TOLOGIX - ISLG App Rebuild

Problem with HTML excerpts in Full Text Search Results

Assigned to
Harsh Parikh, Tech Lead at DevIT Harsh P. Radomir Mladenovic, Contegra Radomir M.
Notes
Further to the video below, within the Full Text Search the excerpts generated from HTML documents are not including the formatting required to present quotation marks and other symbols (e.g., ¶). Could you please ensure these excerpts have similar formatting to the other research tools.


Comments & Events

Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan ,

This html or PDF text handled by DTSeacrh so we haven't any control on that. the dtsearch have to provided the formatted text.
Morgan Maguire, CEO
Ok. Thanks Harsh Parikh, Tech Lead at DevIT Harsh .

Radomir Mladenovic, Contegra Radomir , is there anything you could do over the next 2 days to resolve this issue with the excerpts?

Thanks,

Morgan
Radomir Mladenovic, Contegra
Harsh Parikh, Tech Lead at DevIT Harsh this is on the web application side. The search service returns you the highlighted paragraph as HTML. The paragraph character is properly encoded as an HTML entity - see https://dev.w3.org/html5/html-author/charref
I guess you're adding the paragraph to the page as text instead of raw html.
Harsh Parikh, Tech Lead at DevIT
Hi Radomir Mladenovic, Contegra Radomir ,

As discussed in today's call, The JSON request from dtsearch provide us unicode text. so, please look into this and provide feedback.
Radomir Mladenovic, Contegra
Harsh Parikh, Tech Lead at DevIT Harsh please send me a sample HTML for which this happens. I don't see the problem with content from PDF so it might be something with the HTML or paragraph extraction.
Harsh Parikh, Tech Lead at DevIT
Hi Radomir Mladenovic, Contegra Radomir ,

Here, I have attached sample html where we get Unicode latter while click on paragraph number.

Radomir Mladenovic, Contegra
Harsh Parikh, Tech Lead at DevIT Harsh the indexer I sent you today also addresses the HTML entity issue in paragraph highlights. Please recreate FTS index and test.
Harsh Parikh, Tech Lead at DevIT
Hi Radomir Mladenovic, Contegra Radomir ,

After updating the FTS Indexes, the html excerpt Unicode issue is resolved.

But, when I pass the following Search request to get text then it is doesn't fetch the text from PDF (Page 4)

{"searchRequest":"tribunal","SearchType":"3","Stemming":"false","WordNetSynonyms":"false","Fuzzy":"false","Fuzziness":"1","paraId":"EE073142D5FC08C5509D1A05E23D7D27#NA=="}

But in Same Document if I pass this search request to get text then it display text. (Page 8)

{"searchRequest":"tribunal","SearchType":"3","Stemming":"false","WordNetSynonyms":"false","Fuzzy":"false","Fuzziness":"1","paraId":"EE073142D5FC08C5509D1A05E23D7D27#OA=="}

Here, I have attached PDF File. The text is fetching from this PDF file. In old indexing it was works for above both search request.

Radomir Mladenovic, Contegra
Harsh Parikh, Tech Lead at DevIT Harsh I sent you a fix for the missing paragraph in folder 2021-03-27/search
Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan ,

This issue is resolved on staging.islg. Please check and confirm.
Morgan Maguire, CEO
Looks great Harsh Parikh, Tech Lead at DevIT Harsh . The only outstanding item in the FTS is changing the default setting for the search so that Stemming and Fuzzy Typo are enabled for 1 letter:


Thanks,

Morgan
Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan ,

The above change has been done on staging.islg. Please check and confirm.
Morgan Maguire, CEO
Morgan Maguire completed this to-do.