Basecamp Export

From the list: Bugs / Issues

✔ Words placed out-of-order when added to Block Properties and HTML (UN/0152/06)(UN/0152/07)

Irit W. added this on Apr 17, 2022
Completed Apr 26, 2022 by Irit W.

Assigned to: Martin L.
Notes: Words are placed out of order in Block Properties and HTML. This is happening throughout the PDF to HTML for UN/0152/06 and /07.

json file

UN015206-20220417174939.json 225 KB • Download

PDF, TITLE first page

image.png 22.6 KB • Download

image.png 18 KB • Download

***

PDF para 1

image.png 98.6 KB • Download

image.png 27 KB • Download

***
PDF Para 17(iv)

image.png 76.6 KB • Download

image.png 27.2 KB • Download

Comments & Events

Irit Weinfeld

Martin

,

UN/0152/07 is currently a work-in-progress. If you want to see if the same scrambling of words is happening for you as well, you can try starting at paragraph 32.
If you would like, you can delete paragraph 32 that I entered. I saved all files here, including up-to-date json files:
Desktop Converter files - Tologix - Desktop Converter

Thanks,
Irit

Apr 19, 2022 at 3:40 PM Notified 1 person

Martin Laporte, CTO

Irit

,

There seems to be a bug with the component we use to extract text from the PDF.
I will contact their Support and report back.

Can I share the PDF file with them?

-Martin

Apr 19, 2022 at 4:14 PM Notified 1 person

Irit Weinfeld

Yes, absolutely. The PDF is saved here:
UN/152/07 (WIP) - Tologix - Desktop Converter

Apr 19, 2022 at 4:21 PM Notified 1 person

Irit Weinfeld

I was wondering if the issue was document specific as the PDF is not the best quality.

Apr 19, 2022 at 4:21 PM Notified 1 person

Martin Laporte, CTO

I was wondering the same thing, but if you open the PDF in Acrobat, select the text and copy, then paste in Notepad, you will see that the text is pasted in the correct order.
This leads me to believe that the issue is with the component we use.

I'll open a ticket with them and report back shortly.

Apr 19, 2022 at 4:27 PM Notified 1 person

Martin Laporte, CTO

Irit

, this is fixed in the latest version.
There is a new "OCR Mode" option in the main menu, under Document:

image.png 13.1 KB • Download

It will always default to Basic (which is the option we've been using). But when dealing with more "difficult" documents (such as ones that have obviously been scanned), you can switch to Advanced.

This will trigger an automatic reload of the PDF (your project will be saved and reloaded).

Note that the Advanced option is much slower to load the PDF, as it does a deep OCR analysis of the PDF document. A document that takes 10 seconds to load with Basic might take several minutes to load. You can tell the progress by the status bar at the bottom. It will say "Document Ready" once it has completed its analysis.

image.png 19 KB • Download

But the end result from my testing is that the accuracy (and order) of the text is much better.

Thanks,
-Martin

Apr 21, 2022 at 4:50 PM Notified 1 person

Irit Weinfeld

Thank you,

Martin

. I will let you know when I try this out next week.

Irit

Apr 23, 2022 at 12:12 AM Notified 1 person

Irit Weinfeld

Martin

,

This works! I'm marking this to-do as complete.

Thanks,
Irit

Apr 26, 2022 at 5:03 PM Notified 1 person

Irit Weinfeld completed this to-do.

Apr 26, 2022 at 5:03 PM