TOLOGIX - Infrastructure LawGuide (ILG)

PDF to HTML Conversions

Assigned to
Anil Vaghela Anil V. Devaang Bhatt Devaang B. Morgan Maguire, CEO Morgan M. Ryan Knuth, Customer Support Manager at Industrial Ryan K.
Due on
Notes
Further to Morgan Maguire, CEO Morgan , Devaang Bhatt Devaang and Prerak Shah, Executive Director & Jt CEO at DevIT Prerak 's conversation, we going to address the PDF to HTML conversion problem through the following processes.

P3 Prototype:

It imperative that we have good quality HTML documents available on the system when the content analysis process begins on September 18th. So far, the automated PDF to HTML converter has trouble producing high quality HTML documents, because some of the source PDF documents are poor quality. As a result, we will need resources to intervene and correct deficiencies in the HTML after the conversion is complete. Next steps are the following:
  • When admin site development is complete, Anil Vaghela Anil will wipe all the data from the system
  • Morgan Maguire, CEO Morgan will upload and convert all relevant documents required for the prototype
  • Anil Vaghela Anil and Devaang Bhatt Devaang will organize resources to examine and correct deficiencies in the uploaded HTML documents  
It's important to note that the 14 September 2018 deadline is a hard deadline, because we are scheduling the launch of the prototype in conjunction with a conference that we are attending on November 5-6, and we need at least 5 weeks to perform the content analysis.

Post-Prototype

We need to automate the PDF to HTML process as much as possible to make the P3LG and new ISLG projects feasible. As a result, Prerak Shah, Executive Director & Jt CEO at DevIT Prerak , Devaang Bhatt Devaang and Anil Vaghela Anil will start a project to develop a more sophisticated system for automated PDF to HTML conversion that will produce better quality documents and identify possible deficiencies in documents. Manual intervention will probably be required for certain circumstances; however, the system developed, will minimize interventions and make them more efficient. It is understood that this is a longer term project that will require more R&D.

Please add any comments or questions to this to-do, and we'll get the process started as soon as development of the admin site for the prototype is complete (front-end development can continue while content analysis is being performed).

Comments & Events

Anil Vaghela
Hello Morgan Maguire, CEO Morgan ,

I had discussion with Devaang Bhatt Devaang regarding your call. During that call you discussed to shortlist the documents from the given dataset which are required a manual intervention. Please provide us those shortlisted PDF files so that our html team can start working on that. 

Admin development: Instead of waiting for completion of admin development to upload new data in the system, it would be good if we can do both activities in parallel (e.g. Admin development and Data upload) as admin development will not affect the data as we are assuming that currently data will be uploaded only for html repair and task assignment (for editing, tagging and review) will be started after Sept 14. Please let us know your thoughts.
 
Morgan Maguire, CEO
Sounds good Anil Vaghela Anil . I'll organize things with Irit Weinfeld Irit when she is back from holidays on Thursday. In the meantime, could we wipe all the data from the prototype so that we can start fresh. Also, everyone please ensure that you remove any testing data created on the system after you've perform your tests.

Thanks,

Morgan
Anil Vaghela
Hello Morgan Maguire, CEO Morgan ,

We have cleared all data from p3lg. You can create a test folder that we will remove later.
Ryan Knuth, Customer Support Manager at Industrial
Hi Anil Vaghela Anil

I'm getting a 404 at http://p3lg.tologix.com/ right now. Could you please take a look?

Ryan
Anil Vaghela
Hello Ryan Knuth, Customer Support Manager at Industrial Ryan ,

Please check again.
Ryan Knuth, Customer Support Manager at Industrial
Thanks Anil Vaghela Anil . It's back up for me.
Morgan Maguire, CEO
Hi Anil Vaghela Anil and Ryan Knuth, Customer Support Manager at Industrial Ryan ,

The data wipe looks good. I've rename the folder Ryan created to "Testing Folder". Please uploaded any testing document into this folder, and we'll create a separate folder with all the prototype documents when Irit Weinfeld Irit is back on Thursday.

Thanks,

Morgan
Ryan Knuth, Customer Support Manager at Industrial
Sounds good, thanks  Morgan Maguire, CEO Morgan  
Morgan Maguire, CEO
Hi Anil Vaghela Anil and Devaang Bhatt Devaang ,

All the relevant P3 prototype documents and projects have been uploaded to the admin site under the P3 Prototype Documents folder. I've gone through all the relevant HTMLs, and added column P to the Master List spreadsheet under the Document List tab to provide a brief description of what work needs to be done on each document: https://islg.egnyte.com/dl/GxO0zehOLo. Fortunately, most of the documents are in decent condition, and only require formatting changes. Others also require removing images, and then 6 documents require a full review to ensure the accuracy of the converted text. 

For the formatting changes, please ensure that all margins, indenting and fonts are consistent across all the documents. Also, assuming it doesn't require a significant amount of time, it would be ideal if all the documents had a functioning table of contents.

For the images, all non-text should be removed from the documents. This includes logos and signatures. You can maintain tables, but ensure there is consistent formatting.

For text quality checks, these are documents where the original PDF was poor quality and accuracy of the converted text is suspect, and needs review and correction. I suspect these documents will require a significant amount of time, but thankfully this is limited to 6 documents.

Let me know if you have any question or need further clarification on any of the above.

As I've stated previously, we'd like to have all these documents ready to go in excellent quality by September 15th. I suggest targeting September 10th as the due date to ensure we have time to address any issues after the process is complete.

Thanks,

Morgan
Anil Vaghela
Hello Morgan Maguire, CEO Morgan ,

We have formatted 2 files as follows:

BC/0001/0014: We require to create such files manually from scratch as OCR output is not good by Adobe Pro API and hence it requires more efforts. There are total 6 such files as you mentioned.
URL:http://p3lg.tologix.com/formattedfiles/BC-0001-0014%20-%20Canada%20Line%20-%20Concession%20Agreement%20-%20Schedule%2013.html

ON/0001/0020: We have only formatted this file and hence require less efforts compare to above.
URL:http://p3lg.tologix.com/formattedfiles/ON-0001-0020%20-%20Highway%20407%20East%20Phase%201%20-%20Project%20Agreement%20-%20Schedule%2023.html

Please review both and let us know your feedback so that we can continue with this method.
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

The sample documents above look pretty good. However, a few comments on fixes, and considerations for the rest of the documents.

BC/0001/0014:
ON/0001/0020:
Thanks,

Morgan
Anil Vaghela
Hello Morgan Maguire, CEO Morgan and Kevin Andrews, Industrial Kevin ,

I am not able to find Aktiv font anywhere. Can you please provide?
Kevin Andrews, Industrial
Anil Vaghela Anil  

The Aktiv font is provided by typekit. You will need to add this into the <head> of your documents. 

This will need to be applied to both the entire site <head> and the document iframe <head>.

<link rel="stylesheet" href="https://use.typekit.net/zcm8ovf.css">

and then set the font in CSS.

font-family: "Aktiv Grotesk Extended", "Helvetica Nueue", "Arial", sans-serif;
Anil Vaghela
OK Thanks Kevin Andrews, Industrial Kevin !
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

Further to the latest sample of PDF to HTML converted documents: http://p3lg.tologix.com/htmlfiles.zip, my comments are outlined in the video below and are as follows:
  • BC-0001-0008 and BC-0001-0014: these appear to be unfinished documents. I'll comment further when they have been completed.
  • BC-0002-0023 (same comments apply to CO-0001-0015 and SK-0001-0040)
Anil Vaghela
Hello Morgan Maguire, CEO Morgan ,

Looks like you are viewing these documents from zip file directly and therefore facing formatting issues in BC-0001-0008 and BC-0001-0014. Can you please unzip the folder and check again?
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

You're right. Extracting the documents has fixed the issues with the BC-0001-0008 and 0014. However, further to the screenshot below, I'm still seeing some indent issue through the document: 
I also noticed that spaces are getting added between the " and the defined terms. Let's ensure this doesn't occur across the document collection:

Thanks,

Morgan
Anil Vaghela
Hello Morgan Maguire, CEO Morgan ,

During html conversion of filename "BC-0003-0009 - Fort St John Hospital - Project Ageement - Schedule 8", we found odd data from page no 23 to 34 which are not very much readable. Currently we have ignored that during html conversion. Hope this is fine.  
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

Yes, tables like this can be omitted from the HTML document. However, perhaps there should be something that indicates that a section has been omitted:

APPENDIX 8A

FUNCTIONAL UNITS, UNIT DEDUCTION AMOUNTS, RECTIFICATION PERIODS

[see original PDF document]

Thanks,

Morgan
Anil Vaghela
Hi Morgan Maguire, CEO Morgan ,

For formulas as shown in screen shot, we need to use "/" for "division" instead of horizontal line shown in below screenshot as such horizontal line cannot be set using HTML. Hope this is fine.

Morgan Maguire, CEO
Hi Anil Vaghela Anil , yes that's fine.

Thanks,

Morgan
Anil Vaghela
Morgan Maguire, CEO
Hi Anil,

These documents are looking really good. I found one issue with a redaction in
BC-0002-0011. The boxes referring to Section 18 can be omitted, but we still want to indicate that text has been redacted.
Thanks,

Morgan
Anil Vaghela
Hi Morgan Maguire, CEO Morgan ,

Please suggest if [REDACTED] word will work for redaction or we require to put that black background ?
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

Let's use [REDACTED]. Also, if we can do this across other documents as well, that would be ideal.

Thanks,

Morgan 
Anil Vaghela
Ok thanks Morgan Maguire, CEO Morgan !
Anil Vaghela
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

The documents above look great. However, there a couple of issues:
  • BC-0003-0009: For the appendices, there are typos in the table of contents: Also, Appendices 8D to 8F were omitted from the document. I understand that all the information in the applicable tables has been redacted/deleted, but we still need to include the tables with the appropriate [DELETED] indicators to show we didn't miss these aspects of the document. 
  • FL-0001-0018: this document is not working: http://p3lg.tologix.com/final/FL-0001-0018/FL-0001-0018.html
Thanks,

Morgan
Anil Vaghela
Hello Morgan Maguire, CEO Morgan ,

We have uploaded following 5 files for your review. There is still one file under QC (BC-0001-0001) with many typos which will be completed by tomorrow. 

http://p3lg.tologix.com/final/CA-0001-0001/CA-0001-0001.html
http://p3lg.tologix.com/final/AB-0001-0001/AB-0001-0001.html
http://p3lg.tologix.com/final/CO-0001-0001/CO-0001-0001.html
http://p3lg.tologix.com/final/ON-0001-0001/ON-0001-0001.html
http://p3lg.tologix.com/final/MD-0001-0001/MD-0001-0001.html

We have also resolved above issues for BC-0003-0009 and FL-0001-0018.

Please review and let us know your feedback.
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

BC-0003-0009 and FL-0001-0018 look great, and so do all the other documents in your comment above.

Thanks,

Morgan

 
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

BC-0001-0001. Looks good. Assuming the TOC is created, is this the last one? If that's the case, let's get these documents on the system in the Prototype to replace all the HTML documents currently in the P3 Prototype Documents folder. Please clean out any testing data (projects, documents, values) that are not relevant to these specific documents. The prototype needs to be in a presentable state for content analysis. 

Also, once we get the production environment cleaned up and ready for content analysis, we need to sync the content with the development environment, and cease all development work/testing directly on the production environment. All work and testing will be performed on the development environment, and then migrated to production through scheduled migrations (similar to ISLG).

Thanks,

Morgan
Anil Vaghela
Hello Morgan Maguire, CEO Morgan ,

TOC is created for BC-0001-0001. We have replaced all latest HTML files in P3 Prototype Documents folder and cleaned testing data on p3lg. 

We have also synced both environments. Please check once and let us know if anything.
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

BC-0001-0001 looks great as do all the other document within the prototype. We're ready to proceeding with content analysis.

I'll mark this to-do complete, and we'll deal with the broader PDF to HTML conversion issue through other discussions. 

Thank you and the rest of the team for getting this done.

Morgan 
Morgan Maguire, CEO
Morgan Maguire completed this to-do.