TOLOGIX - ISLG Maintenance

Project dealing with all ongoing maintenance of the current ISLG application (www.investorstatelawguide.com and dev.investorstatelawguide.com).

Problem with copy/pasting text from PDF.js

Assigned to
Anil Vaghela Anil V. Harsh Parikh, Tech Lead at DevIT Harsh P. Ryan Knuth, Customer Support Manager at Industrial Ryan K.
Notes
Further to the video below (no audio), there is a problem with copy/pasting text from the PDF.js viewer. When text is pasted into a separate document it inserts random linebreaks throughout the text.


Comments & Events

Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan and Ryan Knuth, Customer Support Manager at Industrial Ryan ,

We are looking into this and will update you soon.
Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan ,

We are still doing R&D on this task and will update you soon. We are not sure right now but may be need to update the Pdf.js version.
Morgan Maguire, CEO
Ok. Thanks for the update Harsh. Look forward to seeing progress on this soon.

Morgan
Morgan Maguire, CEO
Hi Harsh Parikh, Tech Lead at DevIT Harsh ,

Could you provide an update on this to-do. We should ensure this is given a higher priority.

Thanks,

Morgan 
Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan ,

We looked into this, the issue is because of the html structure generated by PDF.js it self.  Changes are needed in PDF.js so that the generated html can have specific html tags. 

We need to ask Contegra team if they can able to do this in PDF hit highlighter tool. 

Please suggest.
Morgan Maguire, CEO
Hi Harsh Parikh, Tech Lead at DevIT Harsh

OK. I'll reach out to Contegra to see if they have encountered the problem, and have a proposed solution. However, this may be an issue with PDF.js, which I understand is developed by Mozzilla. 

Morgan 
Morgan Maguire, CEO
Hi Harsh Parikh, Tech Lead at DevIT Harsh and Anil Vaghela Anil ,

Further to the recommendation provided by Contegra, please implement the proposed solution to the printing problem, and then we'll continue to explore how to resolve the copy/paste issue.

Morgan 
Ryan Knuth, Customer Support Manager at Industrial
Hi Harsh Parikh, Tech Lead at DevIT Harsh and Anil Vaghela Anil  

Morgan had suggested we open an issue for PDF.js on github related to the copy/paste issues. You would be able to explain the issue more technically than I would, so could one of you take care of this? Hopefully someone in the community can help us out.

https://github.com/mozilla/pdf.js/issues

Thanks!

Ryan
Harsh Parikh, Tech Lead at DevIT
Hi Ryan Knuth, Customer Support Manager at Industrial Ryan ,

We have posted issue on github. Please review following URL to see the post.
https://github.com/mozilla/pdf.js/issues/10003

Let us know if you want to add/edit something.
Ryan Knuth, Customer Support Manager at Industrial
Looks great. Thanks for submitting it Harsh Parikh, Tech Lead at DevIT Harsh  

Ryan
Morgan Maguire, CEO
Hi Harsh Parikh, Tech Lead at DevIT Harsh ,

Have you had a chance to review the other posts on this issue: https://github.com/mozilla/pdf.js/labels/4-text-selection? They may contain a possible solution to this problem.

Also, have we started working on the printing fix recommended by Contegra?

Thanks,

Morgan
Morgan Maguire, CEO
Hi Harsh Parikh, Tech Lead at DevIT Harsh ,

Thanks for looping me into the email chain with Contegra. It appears we have the printing issue resolved and the copy/paste issue resolved exception of the line-breaks. However, is this solution only applicable to documents viewed in the the PDF highlighter through the Full Text Search, or does it apply to the PDF.js viewer across the entire system?

Thanks,

Morgan 
Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan ,

Yes. We have resolved both printing and copy/paste issue exception of the line-breaks on both dev.islg and www.islg.

This solution is also applied to PDF.js viewer across the entire system on both dev.islg and www.islg.
Morgan Maguire, CEO
Hi Harsh Parikh, Tech Lead at DevIT Harsh ,

I've performed some tests. The printing issues appears to be resolved; however, I'm still having some issues with copy/pasting where it is inserting addition spacing between words.

Thanks,

Morgan

Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan ,

We have tried to reproduce above issue by using example which you shown in video. But, It is working fine at our end.

We have also tried copy/paste different paragraphs with using different scenarios (e.g In PDF.js viewer, with new window in viewer, After downloaded PDF file). In All Scenarios it seems looks good.

Please note that we have checked copy/paste issue with Microsoft Word 2016.

Ryan Knuth, Customer Support Manager at Industrial Ryan , Could you please check at your end ??

Please see following word document for your reference.
Morgan Maguire, CEO
Hi Harsh Parikh, Tech Lead at DevIT Harsh ,

The sample you've provided has the issue I've pointed out above. I've attached is again, and underlined the sections where additional spaces have been inserted between words. This contrast the version from the Google Chrome PDF viewer. The PDF.js is better in some ways (e.g., there are no linebreaks at the end of each line); however, the additional space issue will cause large problems for users, because removing these spaces would be various tedious.

Thanks,

Morgan

 
Ryan Knuth, Customer Support Manager at Industrial
In Morgan's example it looks like the extra spaces are being rendered as the original PDF has justified text which injects extra spaces to make the paragraphs the same width.


Harsh Parikh, Tech Lead at DevIT Harsh  I'm not sure if it's at all possible for us to strip extra spaces when it's copied? I suspect we're at the mercy of PDF.js at that point?
Morgan Maguire, CEO
Hi Ryan Knuth, Customer Support Manager at Industrial Ryan ,

Understood. Unfortunately, justified text is the standard format for legal documents, so this will occur very frequently across the document collection. I understand that this may be a fundamental flaw in PDF.js, but let's ensure we're exhausting every possible solution before we move on.

Thanks,

Morgan 
Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan and Ryan Knuth, Customer Support Manager at Industrial Ryan ,

Thanks for feedback.

The extra space issue between words might be occurred because of PDF content.

If you can try to copy/paste the paragraph by using following URL then you can see that extra spaces will be not inserted between words. (e.g copy/paste paragraph 4.873 or 4.874)

https://www.investorstatelawguide.com/ResearchTools/SubjectNavigator?toc=content&id=50&kwList=26140,26184,38605,52667,52169,52667&exList=&selectedNodeID=52667&search=&ci=52667&searchBranchLevel=#52667

For Google Chrome PDF Viewer issue, Could you please share a small video ? Because, When we tried copy/paste paragraph using Google Chrome PDF Viewer then it is working fine. We are not able too see any line breaks after each line.
Harsh Parikh, Tech Lead at DevIT
Hi Ryan Knuth, Customer Support Manager at Industrial Ryan ,

Please paste the paragraph by using "keep text only option" in word document and check once again.
Morgan Maguire, CEO
Hi Harsh Parikh, Tech Lead at DevIT Harsh ,

As requested, the video below should hopefully clarify what the problem is with the spacing between words when copy/pasting justified text from within the PDF.js viewer. The example used in the video is here: https://www.investorstatelawguide.com/ResearchTools/SubjectNavigator?toc=content&id=50&kwList=26140,26184,38605,52169,52169&exList=&selectedNodeID=52169&search=&ci=52169&searchBranchLevel=#52169 and here is a sample of the work document created in the video: As I've explained in the video, I believe this is an issue that has been identified and addressed in this string of GitHub posts: https://github.com/mozilla/pdf.js/labels/4-text-selection. Please review and determined whether a solution to the problem is possible, and continue workout a solution with Radomir Mladenovic from Contegra. I want to ensure we've exhausted all possible solutions to the problem, before attributing this to an unsolvable flaw in PDF.js.

Thanks,

Morgan
Harsh Parikh, Tech Lead at DevIT
Hi Morgan Maguire, CEO Morgan ,

Thanks for sharing above video. It helped us to found the actual problem.

We are trying to find out the solution and also, I have sent mail to Radomir Mladenovic regarding above issues.
Morgan Maguire, CEO
Great. Glad to hear it Harsh Parikh, Tech Lead at DevIT Harsh .

Hopefully a solution is on the horizon.

Thanks,

Morgan 
Morgan Maguire, CEO
Hi Harsh Parikh, Tech Lead at DevIT Harsh ,

Got your email about the latest fix on this issue. Although there is still an issue with the line break between each line of text, it is better than before, and consistent with other PDF viewers. Marking this to-do complete.

Thanks,

Morgan 
Morgan Maguire, CEO
Morgan Maguire completed this to-do.