TOLOGIX - ISLG App Rebuild

Letter and punctuation spacing in migrated HTMLs

Assigned to
Martin Laporte, CTO at Tologix Martin L.
Notes
Hi Martin Laporte, CTO at Tologix Martin ,

As we discussed today, content reviewers encountered a letter and punctuation spacing issue when they reviewed old migrated Automated and Manual ISLG HTMLs.  As long as the HTML was an accurate representation of the text of the original PDF, we asked them to leave it 'as is' in the interest of time.  We'd like to address this now and find a way to fix HTMLS with this spacing issue.

Issue: There are HTMLs with spaces between words and a subsequent punctuation mark (ie. commas, periods, semi-colons, brackets etc.).
For example, “word ,” or “word .”, etc.

Chris Thomas came across an example of this today in IC/0209/18 (Infinito Gold v. Costa Rica Award). 

Using IC/0209/18 as an example. When I click the “find” function in the HTML, and enter a space character and a comma (or period etc.) character, I am able to find all the occurrences.

Like this:
“ .” (there are 56 in the HTML)
or
“ ,” (there are 351 in the HTML)
or
“ ;” (there is 1 in the HTML)
or
“ )” (there are 32 in the HTML)

Is there a way to "find and replace" these issues in IC/0209/18?  We want to make sure Analysis is saved before trying to make any changes to the HTML text.  And we want to make sure Analysis is not disrupted.  This is a migrated document, so shifted highlighted text may not be an issue we need to worry about for this one.

IF we can fix this issue in IC/0209/18, the next step would be to find a way to search app.islg for other documents that have this space issue.

Thank you,
Irit
cc Paul Moon Paul , I will update Chris as we learn about what options and solutions we have. 

Comments & Events

Martin Laporte, CTO at Tologix
Hi Irit Weinfeld Irit ,

We believe it will be feasible to remove these extra spaces without affecting the existing tags and analysis.

Harsh Parikh, Tech Lead at DevIT Harsh will manually try against a few documents first to confirm, and once he confirms that tags are not affected, I will then write a script and execute it in 2 phases:
  • Phase 1: the script will generate a spreadsheet with a list of documents that will be affected. You and Paul Moon Paul will be able to review the list and give it your approval.
  • Phase 2: the script will perform a find&replace against all instances found in Phase 1.

We will have more details by next week.

Thanks,
-Martin
Martin Laporte, CTO at Tologix
Hi Irit Weinfeld Irit ,

Would it be easy for you to find 2-3 documents that have the spacing issues?
Additionally, it would be ideal if these documents have at least one tag where the space issue is found.

For example, say the document is tagged as follows:
Second, relying on Biwater , the Claimant asserts that investment tribunals have repeatedly confirmed that denial of justice is not a requirement for a judicial measure to amount to an expropriation. 1022 For instance, in Rumeli, the tribunal held that “the final decision of Kazakhstan’s Supreme Court affirming the compulsory redemption of the claimant’s shares amounted to unlawful expropriation, even though the decision was made ‘in accordance with due process of law.’” 1023 In Sistem , the tribunal found that the invalidation of a share purchase agreement constituted an expropriation because it had the effect of abrogating the claimant’s ownership rights in a hotel. As noted by the tribunal in Sistem , States are “not immune from liability for this expropriation simply because the state organs that had carried out the expropriation were judicial entities.” 1024

Thanks,
-Martin
Martin Laporte, CTO at Tologix
Thank you, Irit Weinfeld Irit .

Hi Harsh Parikh, Tech Lead at DevIT Harsh : please refer to Irit Weinfeld Irit 's example above for your test.

Thanks,
-Martin
Martin Laporte, CTO at Tologix
Hi Irit Weinfeld Irit ,

Harsh Parikh, Tech Lead at DevIT Harsh is requesting another sample HTML similar to the one you found (IC/0209/18 Infinito Gold Ltd. v. Republic of Costa Rica, ICSID Case No. ARB/14/5, Award, 03 June 2021).

Would it be possible for you to find another one with a similar setup (" , " issue with at least one tag where the space issue is found.

Thanks,
-Martin
Harsh Parikh, Tech Lead at DevIT
Hi Martin Laporte, CTO at Tologix Martin ,

I have checked 2 documents in this task and found that,

  1. if tags already made and then will try to remove space then it creates issue. The tag highlighted position is shifted as per the space we removed. if we remove 2 space then the highlighted portion will be shifted 2 position.
  2. if we remove the space when tag is not available then it is working fine.
  3. if we highlighted whole paragraph rather than excerpt and then if we remove the space then also it working fine.
Only in 1st case the highlight position will be disturb if we remove the space withing tag.

Cc : Irit Weinfeld Irit Paul Moon Paul  
Martin Laporte, CTO at Tologix
Thanks, Harsh Parikh, Tech Lead at DevIT Harsh .

Hi Paul Moon Paul and Irit Weinfeld Irit , it looks like we will not be able to remove the spaces by writing a simple script.

We could write a more complex script that:
  1. Find instances of incorrect spacing
  2. For each document with incorrect spacing:
    1. Lookup each instance of incorrect spacing against the database and determine if the incorrect spacing is within a tag.
      1. If within a tag, correct spacing in document and update the database with the new position of the tag
      2. If not within a tag, correct spacing in document

This would not be a straight-forward project, and we would have to test extensively before applying to Production since we would be changing tag locations.

My recommendation is to create a DevOps item for this, but delay the work until we have a better sense of how we want to manage content moving forward.

Thanks,
-Martin
Paul Moon
Hi Martin Laporte, CTO at Tologix Martin :

Please create a DevOps, and I'll place it as a low-priority item.

Thanks,

Paul
Paul Moon
Hi Martin Laporte, CTO at Tologix Martin :

I see the DevOps item is included in the current sprint - should it be included in this sprint?

Thanks,

Paul
Martin Laporte, CTO at Tologix
Hi Paul Moon Paul ,

Thanks for catching this. It was a mistake; it should have been put in the main backlog.
I have now moved the item to the backlog.

Thanks,
-Martin
Paul Moon Sounds good
Paul Moon
Hi Irit Weinfeld Irit and Martin Laporte, CTO at Tologix Martin :

Is this something that the PDF to HTML comparison tool could largely carry out or should we still keep this as a backlog item for the next sprint?

Please let me know.

Thanks,

Paul
Irit Weinfeld
Hi Paul Moon Paul ,

It seems that any tagged documents that need spacing correction can cause disruptions to the tag. So maybe we just focus on new HTMLs?
We can catch these spacing issues during content review easily.  The issue is document specific.  We don't always have this spacing issue.

We can discuss this over zoom if it's easier.

Irit
Paul Moon
Hi Irit Weinfeld Irit :

As long as we don't have to adjust spacing post-tagging, we can take your approach. Let's discuss this tomorrow.

Thanks,

Paul