Basecamp Export

From the list: ISLG Team Feedback - complete before launch

✔ Problem with HTML excerpts in Full Text Search Results

Morgan M. added this on Mar 22, 2021
Completed Mar 30, 2021 by Morgan M.

Assigned to: Harsh P. Radomir M.
Notes: Further to the video below, within the Full Text Search the excerpts generated from HTML documents are not including the formatting required to present quotation marks and other symbols (e.g., ¶). Could you please ensure these excerpts have similar formatting to the other research tools.

Problems with presentation of excerpts in full text search (22-Mar-21).mp4 2.89 MB • Download

Comments & Events

Harsh Parikh, Tech Lead

Morgan

,

This html or PDF text handled by DTSeacrh so we haven't any control on that. the dtsearch have to provided the formatted text.

Mar 23, 2021 at 6:13 AM Notified 3 people

Morgan Maguire, CEO

Ok. Thanks

Harsh

Radomir

, is there anything you could do over the next 2 days to resolve this issue with the excerpts?

Thanks,

Morgan

Mar 23, 2021 at 7:29 PM Notified 3 people

Radomir Mladenovic

Harsh

this is on the web application side. The search service returns you the highlighted paragraph as HTML. The paragraph character is properly encoded as an HTML entity - see https://dev.w3.org/html5/html-author/charref
I guess you're adding the paragraph to the page as text instead of raw html.

Mar 23, 2021 at 8:03 PM Notified 3 people

Harsh Parikh, Tech Lead

Radomir

,

As discussed in today's call, The JSON request from dtsearch provide us unicode text. so, please look into this and provide feedback.

Mar 24, 2021 at 8:48 AM Notified 3 people

Radomir Mladenovic

Harsh

please send me a sample HTML for which this happens. I don't see the problem with content from PDF so it might be something with the HTML or paragraph extraction.

Mar 24, 2021 at 10:26 PM Notified 3 people

Harsh Parikh, Tech Lead

Radomir

,

Here, I have attached sample html where we get Unicode latter while click on paragraph number.

IC-0012-10.html 788 KB • Download

Mar 25, 2021 at 5:43 AM Notified 3 people

Radomir Mladenovic

Harsh

the indexer I sent you today also addresses the HTML entity issue in paragraph highlights. Please recreate FTS index and test.

Mar 26, 2021 at 12:05 PM Notified 3 people

Harsh Parikh, Tech Lead

Radomir

,

After updating the FTS Indexes, the html excerpt Unicode issue is resolved.

But, when I pass the following Search request to get text then it is doesn't fetch the text from PDF (Page 4)

{"searchRequest":"tribunal","SearchType":"3","Stemming":"false","WordNetSynonyms":"false","Fuzzy":"false","Fuzziness":"1","paraId":"EE073142D5FC08C5509D1A05E23D7D27#NA=="}

But in Same Document if I pass this search request to get text then it display text. (Page 8)

{"searchRequest":"tribunal","SearchType":"3","Stemming":"false","WordNetSynonyms":"false","Fuzzy":"false","Fuzziness":"1","paraId":"EE073142D5FC08C5509D1A05E23D7D27#OA=="}

Here, I have attached PDF File. The text is fetching from this PDF file. In old indexing it was works for above both search request.

UN-0040-44 - Yukos v. Russia - Respondent Appeal.pdf 31.5 MB • Download

Mar 27, 2021 at 6:40 AM Notified 3 people

Radomir Mladenovic

Harsh

I sent you a fix for the missing paragraph in folder 2021-03-27/search

Mar 27, 2021 at 1:45 PM Notified 3 people

Harsh Parikh, Tech Lead

Morgan

,

This issue is resolved on staging.islg. Please check and confirm.

Mar 28, 2021 at 7:31 AM Notified 3 people

Morgan Maguire, CEO

Looks great

Harsh

. The only outstanding item in the FTS is changing the default setting for the search so that Stemming and Fuzzy Typo are enabled for 1 letter:

image.png 194 KB • Download

Thanks,

Morgan

Mar 28, 2021 at 5:54 PM Notified 4 people

Harsh Parikh, Tech Lead

Morgan

,

The above change has been done on staging.islg. Please check and confirm.

Mar 30, 2021 at 9:29 AM Notified 4 people

Morgan Maguire, CEO

Looks good

Harsh

on both staging.islg and app.islg.

Note the issue I experienced with the searches in the Subject Navigator, FTS and Dispute Documents noted here: Re: 2021-03-25 08.19.13 Weekly Tologix-DevIT-Industrial Meeting 81175607260.mp4 - TOLOGIX - ISLG App Rebuild. However, I will assume this is just a indexing issue for the time being, and will mark this to-do complete.

Thanks,

Morgan

Mar 30, 2021 at 6:12 PM Notified 4 people

Morgan Maguire completed this to-do.

Mar 30, 2021 at 6:12 PM