Posted by Derek Ethier · Jan 12, 2018 at 7:50 PM

Overview of indexing, and alternatives consider (discuss)

Morgan and Anil,

Sorry for the title on this thread! Couldn't come up with a better one on a Friday afternoon. ;)

The following is based on our call yesterday. I've summarized the workflow as it was described in ISLG and (proposed) for P3 with dtSearch/PDF-to-HTML conversion and followed it with some suggested reading on alternatives to dtSearch for DevIT to evaluate.

Workflow
When I refer to "application", it is meant for P3 and the ISLG-rebuild respectively. Not the current running version of the latter.

An admin uploads a new PDF, the application saves it to a folder on the web server. An external Windows service (developed by DevIT) monitors this folder, and when new files are added, it converts them to HTML (30-40seconds for a 400-500k file). Currently, the user is notified when an upload is complete, but not when the final conversion is complete.
Once the HTML is converted, dtSearch picks it up and indexes the content.
The application search interface queries dtSearch using a variety of different search parameters invoking the full-text search capabilities of dtSearch. Given that there is no native .Net Core support for the dtSearch DLL, DevIT is proposing a similar approach of separate MVC service that the application will query which invokes the dtSearch interface. Results are returned from dtSearch, to this separate MVC service, and then the main application. These results are used by the application to reference the HTML files (preview, text highlight etc.) for the search results and other related functionality.

Missing above is any additional database activity that may take place that links any of the meta data (tags etc.) to the uploaded PDF and converted HTML. We touched on it briefly, but didn't fully clarify this process. The assumption right now is that the external Windows service referenced in 1. is also handling that workload (Anil, please confirm this assumption).

As mentioned in the call, a background process for item 1 is the best way to proceed. Anything that takes longer then 5-10seconds (even that is a big stretch) in a web application is not ideal and should be handled via a background process or a separate thread.

I mentioned this as an option for background job handling in .Net (.Net Core supported too) as a potential option.

Hangfire
https://www.hangfire.io/

It integrates pretty seamlessly into an application to fire off a thread handled outside of the web request that the main thread can poll to check on the status. Useful functionality if UX for the admin requires the final loopback post document conversion to HTML.

Another option is for the application to poll for the existence of the converted HTML file and the final existence is confirmation of the successful conversion.

The latter might be the best approach given the time investment in the network service that is monitoring the PDF folder. Potential risks could be that the existence of a file (or not) may not provide enough information to the application (and extension, the admin) if something went awry during the conversion.

dtSearch alternatives

This isn't meant as a slight on dtSearch, it is just some options we might want to look into further as there are a number of really good full-text search engines/services available that may provide additional features or, at the very least, drop the requirement for that second external interface for their DLL.

Elasticsearch
https://www.elastic.co/

Starting with likely the biggest option in this space right now. It's fully open-source, and can be installed on Windows (primarily meant for Linux, but they have a Windows installer and service). Support can be purchased from Elastic if required though.

It is built on-top of the powerful Lucene full text search engine but provides an easy-to-consume API in front of it that supports standard REST-ful actions against it, and returns JSON.

They provide a plugin interface which supports ingesting documents that will parse a number of document types (HTML, PDF, Word etc.) and index them similarly to how the dtSearch process works. It is event driven, not a folder monitoring search, so the application would have to send the document contents to the Elasticsearch API for it to index.

They also provide .Net clients that fully support the .Net Core standard so will work without requiring an additional application.

It supports clustering, and name-spacing is easy if you want to use the same Elasticsearch environment for both the re-built ISLG application and P3.

Solr
http://lucene.apache.org/solr/

Like Elasticsearch, Solr is built on-top of Lucene for it's full text search capabilities. Solr provides similar interfaces dependent on REST actions and can output responses in XML, CSV and JSON.

There are no official .Net libraries for Solr, but there are a few community options available that both support .Net Core natively.

There are a lot of similarities to the two products, highlighted in detail here. However, the largest difference that impacts P3/ISLG is that Elastic is a full-fledged company behind Elasticsearch while Solr is an Apache open-source product. There are third parties that will provide Solr support though.

For both of the options above, the Facet capabilities built into Lucene might be worth exploring as it provides some additional search options that a more traditional taxonomy (hierarchical, text-based) may not support, such as numeric/date-based data and can be derived automatically through indexing rather than assigned manually.

Another feature that in Lucene is Boosting. This allows the index definition to determine if certain data should be weighted differently in queries. For example, text provided in heading would bubble up higher in search results than similar text in document bodies (a very basic example).

An interesting project came up while looking into this:

Ambar
https://ambar.cloud/

Specifically, as it was developed as an alternative to dtSearch.

It's relatively new, largely an amalgamation of various open source tools (redis, mongo) and we have never used it for any project, and likely won't work on Windows. It's not a recommendation to look into them as an alternative, but they chose Elasticsearch as the primary text-search engine that their tools rely on.

There are other options in this space but these Lucene-based products are really the big open-source players in this space, and the ones we have the most experience with. Good support for .Net, with support options available, and both have very good APIs.

As discussed though, the difficulty will be translating the search parameters that the application currently relies on to the APIs that these tools provide (or the abstractions provided by the various Nuget packages) and assessing what the effort will be to port existing logic over for both the request and response handling in the application.

In any event, there's a lot here to look through and I hope that this helps in some way. If there are any questions, or a follow-up call would be beneficial, let us know.

Have a good weekend!

Comments & Events

Morgan Maguire, CEO

Thanks for this Derek. I'll take a closer look at this over the weekend.

Anil, please add your thoughts, and then we'll decide on next steps.

Morgan

Jan 13, 2018 at 1:04 AM Notified 9 people

Anil Vaghela

Hello Derek and Morgan,

Missing above is any additional database activity that may take place that links any of the meta data (tags etc.) to the uploaded PDF and converted HTML. We touched on it briefly, but didn't fully clarify this process. The assumption right now is that the external Windows service referenced in 1. is also handling that workload (Anil, please confirm this assumption).
Tagging will be done by P3 application itself. There is no need of any external application for tagging. Tagging will be done on the HTML files converted by the background service.

Hangfire
Please note that the purpose to use a window service is not only to manage the processing time in background but for some reason Adobe Pro API is giving COM error while we are trying to convert PDF document to HTML using IIS application on Carbon60. We assume that Hangfire will produce the same COM error if it will run through IIS. We will check this tool in more details and get back to you for any further comment.

Thanks for dtSearch alternatives. We will explore these and get back to you.

Jan 16, 2018 at 1:13 PM Notified 9 people

Derek Ethier

Please note that the purpose to use a window service is not only to manage the processing time in background but for some reason Adobe Pro API is giving COM error while we are trying to convert PDF document to HTML using IIS application on Carbon60. We assume that Hangfire will produce the same COM error if it will run through IIS. We will check this tool in more details and get back to you for any further comment.

Anil, correct. It would have the same error. I was suggesting as a possible means to poll the external service so that the user could be alerted when the file conversion was completed.

Just an option, a bit of JS that periodically polls an application endpoint that checks for the existence of a file would do the trick too. But, if there is an error in the process, additional work might be required to handle the exceptions.

Jan 16, 2018 at 3:19 PM Notified 9 people

Morgan Maguire, CEO

Just to keep the discussion going, and to get to a conclusion on these issues.

Anil

, using the proposed Windows service without Hangfire, would we have the ability to notify admin users when the file conversion is complete? If not maybe we should consider this an option as Derek has suggested (assuming you're able to set it up without the error).

Re dtSearch, let us know whether you think Derek's suggestions are viable alternatives (taking into account the legacy of knowledge you and your team currently have with dtSearch). To use a cliche phrase, I don't want to throw out the baby with the bathwater, particularly if the other options are only marginally better than dtSearch.

Also, could you clarify the impacts of running dtSearch through a separate MVC service? Would there be any discernible difference to users when using tools that rely on dtSearch? If so, what would they be, because that is my primary concern?

Thanks,

Morgan

Jan 17, 2018 at 6:00 PM Notified 9 people

Anil Vaghela

Hello Morgan,

Anil, using the proposed Windows service without Hangfire, would we have the ability to notify admin users when the file conversion is complete? If not maybe we should consider this an option as Derek has suggested (assuming you're able to set it up without the error).
For a solution, we are planning to put a column "HTML Available" in document list so that admin will able to know whether HTML file is ready or not. Please let us know your thoughts.

Re dtSearch, let us know whether you think Derek's suggestions are viable alternatives (taking into account the legacy of knowledge you and your team currently have with dtSearch). To use a cliche phrase, I don't want to throw out the baby with the bathwater, particularly if the other options are only marginally better than dtSearch.
We will explore all options and get back to you with some comments by Monday.

Also, could you clarify the impacts of running dtSearch through a separate MVC service? Would there be any discernible difference to users when using tools that rely on dtSearch? If so, what would they be, because that is my primary concern?
There will not any visible difference to user with separate web service. If we have dtSearch in our P3 project then we don't require to write extra logic to highlight the search keyword while using a separate web service, we would require to write extra logic.

Jan 18, 2018 at 7:48 AM Notified 9 people

Morgan Maguire, CEO

Hi Anil,

The "HTML Available" sounds like a solution that should work. Would it be possible for you to provide some mockups within TargetProcess that illustrate how this would work?

Great. Look forward to hearing your feedback on the viability of the alternatives.

Great. If there's no difference to users and you're confident this will not affect overall performance long-term, I'm fine with this solution (assuming the alternatives are not viable).

Thanks,

Morgan

Jan 18, 2018 at 5:26 PM Notified 9 people

Anil Vaghela

Hi Morgan,

Please see attached screenshot.

html_available.jpg 118 KB • Download

Jan 19, 2018 at 9:59 AM Notified 9 people

Morgan Maguire, CEO

Thanks Anil.

Megan

Melissa

and

Kevin

, would what Anil has proposed above satisfy the UX requirements, or should further modifications be made?

Morgan

Jan 19, 2018 at 5:57 PM Notified 10 people

Megan Goodacre

Hi Morgan, I believe so. If the HTML Processed value is false, and shows an "X" will the service retry in the background? Does the user have to trigger a new attempt to convert to HTML?

Thanks
Megan

Jan 19, 2018 at 8:58 PM Notified 11 people

Morgan Maguire, CEO

Good point.

Anil

, how would the system deal with a conversion failure?

Also, perhaps we should use a different icon to represent that the conversion is pending. For example, a clock or a loading icon.

Morgan

Jan 19, 2018 at 9:06 PM Notified 11 people

Anil Vaghela

Hello Morgan and Megan,

If conversion will be failed then the service will retry for the conversion.

Sure Morgan, we will put a different icon for pending conversion.

Jan 22, 2018 at 10:29 AM Notified 11 people

Morgan Maguire, CEO

Ok.

Megan

Melissa

and

Kevin

, does that make sense to you?

Morgan

Jan 22, 2018 at 9:49 PM Notified 11 people