TOLOGIX - ISLG App Rebuild

HTML Conversion Problem

Hello all,

Further to my call with Devaang Bhatt Devaang this morning, we need to address a major issue that will affect both the ISLG rebuild and P3LG. A major assumption of both the ISLG rebuild and P3LG is that we are able to automate the process of converting PDF documents to consistent, high quality HTML documents. However, so far, we have been unable to come up with a solution to this problem. The automated converter for the P3 prototype has failed to produced documents of sufficient quality, and we were required to resort to an offline, manual process to produce the documents necessary for the prototype. This process was a temporary solution to get the prototype ready by November 1st, but it's not a viable solution for ISLG and P3LG, which will require converting 500,000 to 1 million pages of PDFs to HTML for both products.  

We need to have a in-depth discussion on the issue to map the way forward. I will be in Europe for meetings with ISLG clients next week, but let's schedule a call with the DevIT and Industrial teams between my meetings. Please provide your availability between 1pm and 4pm Paris time next week.  

I can't overstate the importance of solving this problem. Without a viable method for automating the PDF to HTML conversion process, it will dramatically alter the design of the ISLG rebuild, and potentially scuttle P3LG entirely.

Thanks,

Morgan

 

Comments & Events

Morgan Maguire, CEO
Hello all,

Following up on my message above, Devaang Bhatt Devaang , Anil Vaghela Anil and Stephen Ceresia, Industrial Stephen , could you please confirm availability for a call to discuss the PDF to HTML issue during one of the following times:

Wednesday, October 26th 13:30-14:30 London time
Thursday, October 27th 13:30-14:30 Paris time
Thursday, October 27th 16:30-19:00 Paris time
Friday, October 28th 12:00-2:30 Paris time
Friday, October 28th 16:30-19:00 Paris time

Stephen Ceresia, Industrial Stephen , please send out meeting details once we've confirmed a time for the call.

Thanks,

Morgan 
Stephen Ceresia, Industrial
Hi Morgan Maguire, CEO Morgan ,

The following times would be better for us:

  • Wednesday Sept. 26th: 14:00 London time (GMT+1) / 9:00 Ottawa time (GMT-5)
  • Thursday, Sept. 27th: 17:00 Paris time (GMT+2) / 11:00 Ottawa time (GMT-5)

Devaang Bhatt Devaang Anil Vaghela Anil let me know if either of these work for you and I'll send out the invite.

Thanks,
Stephen
Morgan Maguire, CEO
Hi Stephen Ceresia, Industrial Stephen , I think you might have your time zones mixed up. 13:30 London time is 8:30am Ottawa time.

Morgan 
Stephen Ceresia, Industrial
Hi Morgan - sorry, let me try that again. (Times edited in my previous comment)
Morgan Maguire, CEO
Great. Thanks Stephen Ceresia, Industrial Stephen .

Devaang Bhatt Devaang and Anil Vaghela Anil , let us know if either of those times work for you.

Morgan
Anil Vaghela
Hello Morgan Maguire, CEO Morgan and Stephen Ceresia, Industrial Stephen ,

Wednesday Sept. 26th: 14:00 London time (GMT+1) / 9:00 Ottawa time (GMT-5) would be fine for us.
Morgan Maguire, CEO
Hi Anil Vaghela Anil and Stephen Ceresia, Industrial Stephen ,

Apologies, but I now have a client meeting 13:30-14:30 London time (GMT+1). Anil Vaghela Anil , can you make the call at Thursday, Sept. 27th: 17:00 Paris time (GMT+2) work?

Thanks,

Morgan 
Stephen Ceresia, Industrial
Hi Morgan Maguire, CEO Morgan ,

That works for us. Anil Vaghela Anil please confirm if ok for you and I'll update the invite.

Thanks,
Stephen
Anil Vaghela
Hello Morgan Maguire, CEO Morgan and Stephen Ceresia, Industrial Stephen ,

Thursday, Sept. 27th: 17:00 Paris time would work for us. 
Morgan Maguire, CEO
Hello all,

As I’ve mentioned above, it is urgent that we resolve this problem to ensure the ISLG and P3LG projects can proceed as planned. We cannot afford to delay in solving this problem. As a result, in advance of the call on Thursday, works needs to be done to ensure everyone understands the issues, and is prepared to put forward viable solutions to the problem. The call is not meant for a general discussion of the issues. We are having the call to finalize the plan for implementing a solution to the problem. To get this discussion started, I have created a number of questions below to flesh out the issues. Please respond to these questions in comments, and we’ll work towards a solution on Thursday.

Please provide details on how the current PDF to HTML converter is setup in the P3 prototype? - Anil Vaghela Anil  
  • What software is being used to perform the conversions?
  • What options are available for adjusting the output of the conversions (e.g., can we ensure all conversions are produced in accordance with the Brand Identity Guide)?
What are other options are available for PDF to HTML converters?
  • What other OCR software options are available (e.g., Nuance, ABBYY) and will they produce better quality documents that the current converter?
    • Do they produced better quality output than the current converter?
    • Do they allow us to manipulate the output to conform to our style requirements?
    • Do they allow us to automate the generation of the Table of Contents?
  • Can we use cloud based OCR APIs (e.g., Google cloud vision, Microsoft cognitive services, Amazon AWS)? Note I came across this article comparing these different APIs: https://dataturks.com/blog/compare-image-text-recognition-apis.php
    • Do they produced better quality output than the current converter?
    • Do they allow us to manipulate the output to conform to our style requirements?
    • Do they allow us to automate the generation of the ToC?  
How can we automatically identity what documents require manual intervention in the PDF to HTML conversion?
  • PDFs in ISLG and P3LG have varying degrees of quality. Assuming low quality documents will require some level of manual intervention, what automated systems can be developed to assess the quality of a set of PDF documents in advance of conversion?
    • Is it possible to develop an application that can assess the quality of the PDF, and determine whether manual intervention is necessary to produce a high quality HTML document?
  • This aspect is important, because it will allow us to assess the overall quality of the document collection, and determine what percentage requires manual intervention. It is not viable for us to individually assess the quality of all 5300 documents on ISLG. 
Please provide your input in comments below.

Thanks,

Morgan
Morgan Maguire, CEO
In addition to the above, here are some details on the cloud based OCR APIs:
I also found this post, discussing a project tackling a problem similar to ours: https://aws.amazon.com/blogs/publicsector/serverless-optical-character-recognition-in-support-of-nasa-astronaut-safety/

Morgan
Devaang Bhatt
Hello all,

I will let Anil provide you with information separately on how the PDF to HTML conversion was achieved for the P3 PoC. This was more of a manual effort than automation comparatively.

But, today I had a brain storming session with Anil and few other members from my team on how to achieve a nearest automation solution to this conversion challenge. We did an out-of-box thinking keeping aside all existing tools such as Adobe Acrobat Pro. We now have a concept approach towards development of a fully customized automation tool which we believe can help us overcome this challenge for the long term. The concept is based on a premise that we will define a ‘standard’ HTML structure for P3 or ISLG-rebuild. All PDFs, irrespective of how they are formatted, will eventually convert to a ‘standard’ HTML structure. There are many advantages to this approach which I shall explain during our upcoming call this Thursday.

The following is envisaged from this tool.


1. Take a PDF file as an input

2. Extract TEXT from the PDF. We are exploring if we can evolve an algorithm which extracts the TEXT and reasonably maintain the formatting too.

3. The tool will then take this TEXT as an input and tag it based on the ‘standard’ HTML structure that we have defined. The tagging logic will be derived from the knowledge and lessons learnt during the conversions we executed recently. The tool will automatically identify TOC contents and structures and also relate it to the relevant content in the TEXT body. Will explain more on the call.

4. The tool will also provide lot of QA features such as checking tags, spell check, remove extra spaces, remove tab marks, etc.

5. The end result will be a ‘standard’ HTML file.

The development of such a tool is expected to take 4-5 months using 2 resources as there is lot of R & D involved.

I am optimistic about this approach as the tool can be used for both P3 and ISLG going forward. This optimism is based on absolute knowledge of technical possibilities.

Let’s talk more on this.

Best regards,
Devaang Bhatt | Associate Vice President, International Business
Microsoft Specialist, MCP
Morgan Maguire, CEO
Hi Devaang Bhatt Devaang ,

Thank you for the note above. This sounds like a promising solution to the problem. Juan Silva, Industrial Juan Ryan Knuth, Customer Support Manager at Industrial Ryan , please provide your feedback on this proposal.

Devaang Bhatt Devaang , for step 2 in your proposal above, what are the advantages to developing the text extraction algorithm ourselves, rather than relying on a third party application for such purposes? Also, would there be any benefit in developing this algorithm in conjunction with the cloud based OCR API's mentioned above?

For steps 3-5, I think we should thoroughly document the parameters for the "standard" HTML file. The standard used in P3 will defer from ISLG, and we may want to introduce other nuances depending on the type of document were converting. Let's ensure we integrate this into the process.

Thanks,

Morgan
Juan Silva, Industrial
Hi Morgan Maguire, CEO Morgan  

It is a bit difficult to give an opinion without knowing the exact problems you were facing before.

However, from what I have gathered reading some of the basecamp threads, I understand that the general problem is with the structure of the output html from the PDF conversion.

The approach described by Devaang does make sense in that context. As I understand it, the plan is to extract the text components from the source PDF and give them semantic meaning before build a standard html output. 

I wonder if it would make sense to store those extracted pieces of text with its semantic meaning in a database, instead of just writing it to an html output file. This would allow you to have multiple html structured templates that you can use to render the same document.

This way the PDF document representation would be a database model, and not just a mere file. This would provide a bit more robustness to the implementation and be ready for more advanced features you may want to implement around the document.

Again that may or may not make sense depending on your plans for the tool in the future.
Morgan Maguire, CEO
Hi Juan Silva, Industrial Juan ,

Thank you for the input above. That sounds like a good idea. Different HTML templates will probably be necessary depending on the type of document we're converting (e.g., we'll want to tag Dispute Documents differently from Treaties & Rules in ISLG).

However, I'm still concerned about the quality of the text extraction from the PDF. Back to my question above, would it be advisable to develop our application for performing that task, or should we explore the options available through third parties?

Talk to you all soon on the call.

Morgan

P.S. please ensure no one is on speaker phone, because I want to ensure I can understand everyone clearly.
Morgan Maguire, CEO
Hello all,

Following up on our call from last Thursday, we are going to resolve the PDF to HTML conversion issue through implementing the plan above. As a result, next steps are the following:
  • Perform R&D and finalize solution for PDF text extraction. Solution should have the capacity to accurately extract text from poor quality PDFs (e.g., PDFs that are the product of scanned copies of printed document). Exploring third party API options vs. in-house solution will be considered as part of R&D. 
  • Organize and finalize tagging structure for HTML templates. Templates will vary depending on the type of document (e.g., dispute document vs. legal instrument) and product (e.g., ISLG vs. P3LG). Next week, I will provide samples of each document type with descriptions of the document structure.
Devaang Bhatt Devaang , could you please confirm the above. Also, could you please confirm how we're going to staff the project given that we have previously discussed increasing the number of resources dedicated to ISLG.

Thanks,

Morgan 
Devaang Bhatt
Hi Morgan,

Yes, your brief on the scope of work is in-line with what we discussed and I confirm this.

Regarding the staffing, we would need 2 dedicated resources. We can pull one from the P3 team right now as the work load is less and add one new resource. When we see the tasks on P3 picking up then we can return the P3 resource and replace with a new one then. I am sure this clarifies your staffing question. Yet, do let me know if you need further clarity.

The new resource will start immediately.

Best regards,
Devaang Bhatt | Associate Vice President, International Business
Microsoft Specialist, MCP
Morgan Maguire, CEO
Hi Devaang Bhatt Devaang ,

Moving one resource from P3 makes sense given that development work on that platform will be slower until we decide to move forward with a full product (probably at the beginning of next year). However, perhaps we should pull the additional resource from existing ISLG resources, because we have almost concluded all work needed on the current ISLG site, and should be available for work on this task.

Assuming we are able to resolve this issue quickly, we can then add the additional developer and QA at the beginning of November as we previously discussed when we are ready to start building the framework for the new application.

Please confirm if this makes sense. Thanks,

Morgan
Morgan Maguire, CEO
Hi Devaang Bhatt Devaang ,

Further to previous comment, could you confirm how we're allocating development resources. Also, could you provide us with a timeline for the PDF extraction R&D? I know it's difficult to put a deadline on R&D, but I want to ensure we keep on track with our overall development timeline for the rebuild project.

Thanks,

Morgan
Morgan Maguire, CEO
Hello all,

I'm posting this message on Devaang Bhatt Devaang behalf because he wasn't able to access Basecamp for some reason:

Hi Morgan,
 
I reviewed the current workloads with Anil and agree with you that we should be able to free-up two resources (one each from ISLG and P3) to undertake the PDF to HTML conversion tool development. The R & D is definitely tricky but I hope to have a fine blueprint by 2nd or 3rd week of December 2018. This timeline is considering the fact that we will have our annual Diwali Festival vacations from November 7th to November 14th. 
 
Will keep you updated on the progress as it happens or every 10 days.
Morgan Maguire, CEO
Hi Devaang Bhatt Devaang ,

The timeline on PDF to HTML R&D sounds fine. Devaang Bhatt Devaang and Anil Vaghela Anil , please keep us updated with progress as it develops.

Please confirm the current allocation of DevIT resources:
  • 2 full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3)
  • 2 full-time developers on ISLG (1 to be added on November 1st)
  • 1 full-time QA manager on ISLG (November 1st)
  • 1 full-time developer on P3
  • 1 full-time developer on Machine Learning R&D
  • 1 part-time team lead ( Anil Vaghela Anil
If this is correct, we should have a call to discuss next steps on all projects. I want to ensure all resources are kept busy now that maintenance work on ISLG is slowing down, and ensure we instituting more thorough project management over the development team as it expands.

I propose having a call on Tuesday, October 16th at 8:00am Vancouver time. Devaang Bhatt Devaang , Anil Vaghela Anil , Ryan Knuth, Customer Support Manager at Industrial Ryan , Stefanie Gibson, UX Researcher at Industrial Stefanie , Melissa Cowell, General Manager at Industrial Melissa (and Stephen Ceresia, Industrial Stephen if he is available), please confirm you're available and I'll send out call details.

Thanks,

Morgan
  
Melissa Cowell, General Manager at Industrial
Morgan Maguire, CEO Morgan

Wednesday at 8am Vancouver time works better for the Industrial team.

Mel
Morgan Maguire, CEO
Hi Melissa Cowell, General Manager at Industrial Melissa ,

I've got a conflict on the Wednesday and Thursday morning. Ryan Knuth, Customer Support Manager at Industrial Ryan , can you make the call, and then you can brief everyone else? 

Thanks,

Morgan
Anil Vaghela
Hello Morgan Maguire, CEO Morgan and Ryan Knuth, Customer Support Manager at Industrial Ryan ,

I apologize for late reply. Devaang is in Dubai for GITEX Technology week and will be back during this weekend. Also, we have a holiday on 18th October for Dussehra festival. Please suggest whether we can organize this meeting in next week.
Ryan Knuth, Customer Support Manager at Industrial
Hi Morgan Maguire, CEO Morgan and Anil Vaghela Anil  

I'm at a conference on the 18th as well and out of office the afternoon of the 17th.

For next week my only unavailability is after 4pm EDT on the 22nd and I'm at another conference on the 25th.

Thanks!

Ryan
Morgan Maguire, CEO
Hi Anil Vaghela Anil and Ryan Knuth, Customer Support Manager at Industrial Ryan ,

OK. It looks like scheduling a call might be difficult right now. However, I still need clarification on what the DevIT teams are working to ensure we are allocating resources effectively. Anil Vaghela Anil , could you please confirm that resources are allocated as follows:
  • 2 full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3)
  • 2 full-time developers on ISLG (1 to be added on November 1st)
  • 1 full-time QA manager on ISLG (November 1st)
  • 1 full-time developer on P3
  • 1 full-time developer on Machine Learning R&D
  • 1 part-time team lead ( Anil Vaghela Anil
Also, please provide a detailed description on what each team is working on for the next two weeks.

Let's schedule a call for the week of October 29 - November 1. Devaang Bhatt Devaang , Anil Vaghela Anil , Ryan Knuth, Customer Support Manager at Industrial Ryan , Stephen Ceresia, Industrial Stephen and Melissa Cowell, General Manager at Industrial Melissa , can you confirm whether you're available for a call on Tuesday, October 30th at 8:00am Vancouver time?

Thanks,

Morgan 
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

In addition to the above, I suggest examining the current roles for the 2 developers dedicated to ongoing maintenance of ISLG. We have completed all the outstanding to-do's on Basecamp, and the only working required until we start on the new site will be to address the occasional technical that arises as reported by users (or issues that arise from Ryan Knuth, Customer Support Manager at Industrial Ryan 's security scan). Therefore, I suggest dedicating these resources as much as possible to support the PDF to HTML R&D, because it's critical that we come up with a timely solution to this issue.

Thanks,

Morgan
Anil Vaghela
Hello Morgan Maguire, CEO Morgan ,

Above resource seems OK. Regarding machine learning R & D resources, I will discuss with Devaang once he will be back and let you know.
  • 2 full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3) 
    • One developer will be working on R & D to separate law quality PDF files from the list of PDFs.
    • Second developer will be working on R & D to extract text from low quality PDF files.
  • 2 full-time developers on ISLG 
    • One developer will be working on ISLG Rebuild technical architecture design.
    • Another developer will be working on occasional ISLG support tasks and PDF to HTML R & D to format HTML output generated by Adobe Pro SDK.
  • 1 full-time developer on P3 
    • Will be working on issues related to tag excerpts formatting and display vertical lines for highlighted paragraphs. 
  • 1 full-time developer on Machine Learning R&D 
    •  Currently team is doing research on Azure ML platform aspect such as whether Free tier of Azure ML is sufficient or not. What are the limitation of free tier and what we'll need in production.
      Along with this, they are also looking at component aspect of this research in terms of Software/Hardware/Service that will be needed.
  • 1 part-time team lead ( Anil Vaghela Anil
    • Will be managing all above work and working on PDF to HTML R & D
Please review above and let me know if you need any further clarification.
Ryan Knuth, Customer Support Manager at Industrial
Hi Morgan Maguire, CEO Morgan  

We're good for Oct. 30 at 8am PDT.

Thanks!

Ryan
Morgan Maguire, CEO
Great. Thanks Ryan Knuth, Customer Support Manager at Industrial Ryan . Anil Vaghela Anil and Devaang Bhatt Devaang will Tuesday, October 30th at 8:00am Vancouver time work for you?

Also, Anil Vaghela Anil , a couple of follow-up questions about the allocation of resrouces:
  • 2 full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3) 
  • 2 full-time developers on ISLG 
    • One developer will be working on ISLG Rebuild technical architecture design.
      • Question: I was under the impression that we should finalize the design of the admin site before starting the technical architecture. Why are we changing approaches? Note that the wireframes will be finalized this week, and then we'll start generating the user stories.
    • Another developer will be working on occasional ISLG support tasks and PDF to HTML R & D to format HTML output generated by Adobe Pro SDK.
      • Question: following on my question above, are we examining cloud based ML solutions to this problem? My impression is the Adobe has produced unsatisfactory results so far.
  • 1 full-time developer on P3 
    • Will be working on issues related to tag excerpts formatting and display vertical lines for highlighted paragraphs. 
      • Question: are we working on producing the admin site search within the tagging application? Not having this complete is hindering our content analysis, and forcing us to bring aspects of the process offline.
  • 1 full-time developer on Machine Learning R&D 
    •  Currently team is doing research on Azure ML platform aspect such as whether Free tier of Azure ML is sufficient or not. What are the limitation of free tier and what we'll need in production.
      Along with this, they are also looking at component aspect of this research in terms of Software/Hardware/Service that will be needed.
      • Question: how is Contegra's auto-search getting integrated into this process?
  • 1 part-time team lead ( Anil Vaghela Anil
    • Will be managing all above work and working on PDF to HTML R & D
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

Further to a discussion with Ryan Knuth, Customer Support Manager at Industrial Ryan on the above, have we considered utilizing the ML developer to help us address the PDF to HTML conversion issue? In particular, can they help us understand the ability to utilized ML cloud applications for text extraction (Google Cloud: https://cloud.google.com/vision/docs/ocr; 
Microsoft Azure: https://azure.microsoft.com/en-ca/services/cognitive-services/computer-vision/; 
Amazon AWS: https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html)?

Thanks,

Morgan
Anil Vaghela
Hi Morgan Maguire, CEO Morgan ,

  •  full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3) 
  • 2 full-time developers on ISLG 
    • One developer will be working on ISLG Rebuild technical architecture design.
      • Question: I was under the impression that we should finalize the design of the admin site before starting the technical architecture. Why are we changing approaches? Note that the wireframes will be finalized this week, and then we'll start generating the user stories.
        • There is no relation between admin site design and technical architecture of the project. Technical architecture is a  fundamental thing on which we need to work before starting actual user stories of admin design. Please let me know your thoughts.
    • Another developer will be working on occasional ISLG support tasks and PDF to HTML R & D to format HTML output generated by Adobe Pro SDK.
      • Question: following on my question above, are we examining cloud based ML solutions to this problem? My impression is the Adobe has produced unsatisfactory results so far.
        • Yes, we are examining clould based solutions also to extract text from PDF files. In addition to that we are also doing R & D on other aspects e.g.  1. Whether we can alter the pre-generated HTML files to produce the expected output using different tools. 2. How to format extracted texts into a standard HTML template. 
        • Please note that text extraction is not only the issue, the other issue is to format that extracted text in such a way that could generate standard HTML files. To achieve this we are  exploring other tools too e.g. Adobe SDK, soda PDF etc. 
  • 1 full-time developer on P3 
    • Will be working on issues related to tag excerpts formatting and display vertical lines for highlighted paragraphs. 
      • Question: are we working on producing the admin site search within the tagging application? Not having this complete is hindering our content analysis, and forcing us to bring aspects of the process offline.
        • We have already completed this and uploaded on p3lg for your review. 
  • 1 full-time developer on Machine Learning R&D 
    •  Currently team is doing research on Azure ML platform aspect such as whether Free tier of Azure ML is sufficient or not. What are the limitation of free tier and what we'll need in production.
      Along with this, they are also looking at component aspect of this research in terms of Software/Hardware/Service that will be needed.
      • Question: how is Contegra's auto-search getting integrated into this process?
        • Once we have Contegra ready on the server we will explore this and will let you know.
Please review above and let me know if you need any further clarification.
Anil Vaghela
Hello Morgan Maguire, CEO Morgan ,

30th October 8.00 am would be fine for me. I will let ask Devaang once he will be available in next week.
Morgan Maguire, CEO
Hi Anil Vaghela Anil ,

Thank you for answering my questions above. Everything appears to be on-track. I'll send a calendar invite for the call on 30th, and hopefully Devaang Bhatt Devaang can confirm.

Thanks,

Morgan
Morgan Maguire, CEO
Hello everyone,

I just sent an updated calendar invite for the call tomorrow morning, which includes the following agenda:

Meeting Agenda 
  1. Development Updated – Devaang and Anil
    1. Update of latest developments from development team, including:
      1. Description of resource allocation between projects
      2. Progress report on PDF to HTML R&D
      3. Progress report on ISLG Rebuild technical architecture design
      4. Progress report on Machine Leaning R&D
  2. Update on ISLG rebuild – Melissa, Ryan and Morgan
    1. Update on latest development for ISLG Rebuild design
    2. Next steps for development team in process
Let me know if there is anything you want to add to the agenda.

Morgan