✔ PDF to HTML Conversions
Completed by Morgan M.
- Assigned to
-
Anil V.
Devaang B.
Morgan M.
Ryan K.
- Due on
- Notes
-
Further to
,
Morgan
and
Devaang
's conversation, we going to address the PDF to HTML conversion problem through the following processes.
Prerak
P3 Prototype:
It imperative that we have good quality HTML documents available on the system when the content analysis process begins on September 18th. So far, the automated PDF to HTML converter has trouble producing high quality HTML documents, because some of the source PDF documents are poor quality. As a result, we will need resources to intervene and correct deficiencies in the HTML after the conversion is complete. Next steps are the following:- When admin site development is complete,
will wipe all the data from the system
Anil
-
will upload and convert all relevant documents required for the prototype
Morgan
-
and
Anil
will organize resources to examine and correct deficiencies in the uploaded HTML documents
Devaang
It's important to note that the 14 September 2018 deadline is a hard deadline, because we are scheduling the launch of the prototype in conjunction with a conference that we are attending on November 5-6, and we need at least 5 weeks to perform the content analysis.
Post-Prototype
We need to automate the PDF to HTML process as much as possible to make the P3LG and new ISLG projects feasible. As a result, ,
Prerak
and
Devaang
will start a project to develop a more sophisticated system for automated PDF to HTML conversion that will produce better quality documents and identify possible deficiencies in documents. Manual intervention will probably be required for certain circumstances; however, the system developed, will minimize interventions and make them more efficient. It is understood that this is a longer term project that will require more R&D.
Anil
Please add any comments or questions to this to-do, and we'll get the process started as soon as development of the admin site for the prototype is complete (front-end development can continue while content analysis is being performed). - When admin site development is complete,
I had discussion with
Admin development: Instead of waiting for completion of admin development to upload new data in the system, it would be good if we can do both activities in parallel (e.g. Admin development and Data upload) as admin development will not affect the data as we are assuming that currently data will be uploaded only for html repair and task assignment (for editing, tagging and review) will be started after Sept 14. Please let us know your thoughts.
Thanks,
Morgan
We have cleared all data from p3lg. You can create a test folder that we will remove later.
I'm getting a 404 at http://p3lg.tologix.com/ right now. Could you please take a look?
Ryan
Please check again.
The data wipe looks good. I've rename the folder Ryan created to "Testing Folder". Please uploaded any testing document into this folder, and we'll create a separate folder with all the prototype documents when
Thanks,
Morgan
All the relevant P3 prototype documents and projects have been uploaded to the admin site under the P3 Prototype Documents folder. I've gone through all the relevant HTMLs, and added column P to the Master List spreadsheet under the Document List tab to provide a brief description of what work needs to be done on each document: https://islg.egnyte.com/dl/GxO0zehOLo. Fortunately, most of the documents are in decent condition, and only require formatting changes. Others also require removing images, and then 6 documents require a full review to ensure the accuracy of the converted text.
For the formatting changes, please ensure that all margins, indenting and fonts are consistent across all the documents. Also, assuming it doesn't require a significant amount of time, it would be ideal if all the documents had a functioning table of contents.
For the images, all non-text should be removed from the documents. This includes logos and signatures. You can maintain tables, but ensure there is consistent formatting.
For text quality checks, these are documents where the original PDF was poor quality and accuracy of the converted text is suspect, and needs review and correction. I suspect these documents will require a significant amount of time, but thankfully this is limited to 6 documents.
Let me know if you have any question or need further clarification on any of the above.
As I've stated previously, we'd like to have all these documents ready to go in excellent quality by September 15th. I suggest targeting September 10th as the due date to ensure we have time to address any issues after the process is complete.
Thanks,
Morgan
We have formatted 2 files as follows:
BC/0001/0014: We require to create such files manually from scratch as OCR output is not good by Adobe Pro API and hence it requires more efforts. There are total 6 such files as you mentioned.
URL:http://p3lg.tologix.com/formattedfiles/BC-0001-0014%20-%20Canada%20Line%20-%20Concession%20Agreement%20-%20Schedule%2013.html
ON/0001/0020: We have only formatted this file and hence require less efforts compare to above.
URL:http://p3lg.tologix.com/formattedfiles/ON-0001-0020%20-%20Highway%20407%20East%20Phase%201%20-%20Project%20Agreement%20-%20Schedule%2023.html
Please review both and let us know your feedback so that we can continue with this method.
The sample documents above look pretty good. However, a few comments on fixes, and considerations for the rest of the documents.
BC/0001/0014:
Morgan
I am not able to find Aktiv font anywhere. Can you please provide?
The Aktiv font is provided by typekit. You will need to add this into the <head> of your documents.
This will need to be applied to both the entire site <head> and the document iframe <head>.
<link rel="stylesheet" href="https://use.typekit.net/zcm8ovf.css">
and then set the font in CSS.
font-family: "Aktiv Grotesk Extended", "Helvetica Nueue", "Arial", sans-serif;
Further to the latest sample of PDF to HTML converted documents: http://p3lg.tologix.com/htmlfiles.zip, my comments are outlined in the video below and are as follows:
Morgan
Looks like you are viewing these documents from zip file directly and therefore facing formatting issues in BC-0001-0008 and BC-0001-0014. Can you please unzip the folder and check again?
We have resolved indentation issues in the attached file. Please review and let me know if it works.
You're right. Extracting the documents has fixed the issues with the BC-0001-0008 and 0014. However, further to the screenshot below, I'm still seeing some indent issue through the document:
I also noticed that spaces are getting added between the " and the defined terms. Let's ensure this doesn't occur across the document collection:
Thanks,
Morgan
During html conversion of filename "BC-0003-0009 - Fort St John Hospital - Project Ageement - Schedule 8", we found odd data from page no 23 to 34 which are not very much readable. Currently we have ignored that during html conversion. Hope this is fine.
Yes, tables like this can be omitted from the HTML document. However, perhaps there should be something that indicates that a section has been omitted:
APPENDIX 8A
FUNCTIONAL UNITS, UNIT DEDUCTION AMOUNTS, RECTIFICATION PERIODS
[see original PDF document]
Thanks,
Morgan
For formulas as shown in screen shot, we need to use "/" for "division" instead of horizontal line shown in below screenshot as such horizontal line cannot be set using HTML. Hope this is fine.
Thanks,
Morgan
We have completed all the html files, however we are still working on final review and QC.
Please review and feedback for below 9 files for which final review is done at our end.
http://p3lg.tologix.com/final/BC-0001-0008/BC-0001-0008.html
http://p3lg.tologix.com/final/BC-0002-0011/BC-0002-0011.html
http://p3lg.tologix.com/final/BC-0002-0023/BC-0002-0023.html
http://p3lg.tologix.com/final/BC-0003-0001/BC-0003-0001.html
http://p3lg.tologix.com/final/CO-0001-0015/CO-0001-0015.html
http://p3lg.tologix.com/final/FL-0001-0002/FL-0001-0002.html
http://p3lg.tologix.com/final/MD-0001-0002/MD-0001-0002.html
http://p3lg.tologix.com/final/MD-0001-0039/MD-0001-0039.html
http://p3lg.tologix.com/final/SK-0001-0040/SK-0001-0040.html
Tomorrow I will provide you more files for your review.
Please note that we haven't linked TOC yet. Once all files will be completed we will do this.
These documents are looking really good. I found one issue with a redaction in
BC-0002-0011. The boxes referring to Section 18 can be omitted, but we still want to indicate that text has been redacted.
Thanks,
Morgan
Please suggest if [REDACTED] word will work for redaction or we require to put that black background ?
Let's use [REDACTED]. Also, if we can do this across other documents as well, that would be ideal.
Thanks,
Morgan
Our team worked today to complete designing review and QC for the remaining files and have completed following 13 files for your review. We have also integrated TOC and [Redacted] in these files. Please review and let us know your feedback.
http://p3lg.tologix.com/final/BC-0002-0001/BC-0002-0001.html
http://p3lg.tologix.com/final/BC-0002-0002/BC-0002-0002.html
http://p3lg.tologix.com/final/BC-0003-0009/BC-0003-0009.html
http://p3lg.tologix.com/final/CA-0001-0002/CA-0001-0002.html
http://p3lg.tologix.com/final/CA-0001-0014/CA-0001-0014.html
http://p3lg.tologix.com/final/FL-0001-0001/FL-0001-0001.html
http://p3lg.tologix.com/final/FL-0001-0018/FL-0001-0018.html
http://p3lg.tologix.com/final/ON-0001-0002/ON-0001-0002.html
http://p3lg.tologix.com/final/ON-0001-0015/ON-0001-0015.html
http://p3lg.tologix.com/final/ON-0001-0020/ON-0001-0020.html
http://p3lg.tologix.com/final/SK-0001-0001/SK-0001-0001.html
http://p3lg.tologix.com/final/SK-0001-0002/SK-0001-0002.html
http://p3lg.tologix.com/final/BC-0001-0014/BC-0001-0014.html
For remaining 6 files, html conversion and designing review are already completed but those files are lengthy and with many typo mistakes so taking more time in QC. We are hoping that all remaining 6 files will be ready on Monday for your review.
The documents above look great. However, there a couple of issues:
Morgan
We have uploaded following 5 files for your review. There is still one file under QC (BC-0001-0001) with many typos which will be completed by tomorrow.
http://p3lg.tologix.com/final/CA-0001-0001/CA-0001-0001.html
http://p3lg.tologix.com/final/AB-0001-0001/AB-0001-0001.html
http://p3lg.tologix.com/final/CO-0001-0001/CO-0001-0001.html
http://p3lg.tologix.com/final/ON-0001-0001/ON-0001-0001.html
http://p3lg.tologix.com/final/MD-0001-0001/MD-0001-0001.html
We have also resolved above issues for BC-0003-0009 and FL-0001-0018.
Please review and let us know your feedback.
BC-0003-0009 and FL-0001-0018 look great, and so do all the other documents in your comment above.
Thanks,
Morgan
Our QC team has just completed the review for BC-0001-0001. We are still working on TOC but meantime you can review and provide your feedback.
http://p3lg.tologix.com/final/BC-0001-0001/BC-0001-0001.html
BC-0001-0001. Looks good. Assuming the TOC is created, is this the last one? If that's the case, let's get these documents on the system in the Prototype to replace all the HTML documents currently in the P3 Prototype Documents folder. Please clean out any testing data (projects, documents, values) that are not relevant to these specific documents. The prototype needs to be in a presentable state for content analysis.
Also, once we get the production environment cleaned up and ready for content analysis, we need to sync the content with the development environment, and cease all development work/testing directly on the production environment. All work and testing will be performed on the development environment, and then migrated to production through scheduled migrations (similar to ISLG).
Thanks,
Morgan
TOC is created for BC-0001-0001. We have replaced all latest HTML files in P3 Prototype Documents folder and cleaned testing data on p3lg.
We have also synced both environments. Please check once and let us know if anything.
BC-0001-0001 looks great as do all the other document within the prototype. We're ready to proceeding with content analysis.
I'll mark this to-do complete, and we'll deal with the broader PDF to HTML conversion issue through other discussions.
Thank you and the rest of the team for getting this done.
Morgan