HTML Conversion Problem
Hello all,
Further to my call with
Devaang
this morning, we need to address a major issue that will affect both the ISLG rebuild and P3LG. A major assumption of both the ISLG rebuild and P3LG is that we are able to automate the process of converting PDF documents to consistent, high quality HTML documents. However, so far, we have been unable to come up with a solution to this problem. The automated converter for the P3 prototype has failed to produced documents of sufficient quality, and we were required to resort to an offline, manual process to produce the documents necessary for the prototype. This process was a temporary solution to get the prototype ready by November 1st, but it's not a viable solution for ISLG and P3LG, which will require converting 500,000 to 1 million pages of PDFs to HTML for both products.
We need to have a in-depth discussion on the issue to map the way forward. I will be in Europe for meetings with ISLG clients next week, but let's schedule a call with the DevIT and Industrial teams between my meetings. Please provide your availability between 1pm and 4pm Paris time next week.
I can't overstate the importance of solving this problem. Without a viable method for automating the PDF to HTML conversion process, it will dramatically alter the design of the ISLG rebuild, and potentially scuttle P3LG entirely.
Thanks,
Morgan
Further to my call with
We need to have a in-depth discussion on the issue to map the way forward. I will be in Europe for meetings with ISLG clients next week, but let's schedule a call with the DevIT and Industrial teams between my meetings. Please provide your availability between 1pm and 4pm Paris time next week.
I can't overstate the importance of solving this problem. Without a viable method for automating the PDF to HTML conversion process, it will dramatically alter the design of the ISLG rebuild, and potentially scuttle P3LG entirely.
Thanks,
Morgan
Following up on my message above,
Wednesday, October 26th 13:30-14:30 London time
Thursday, October 27th 13:30-14:30 Paris time
Thursday, October 27th 16:30-19:00 Paris time
Friday, October 28th 12:00-2:30 Paris time
Friday, October 28th 16:30-19:00 Paris time
Thanks,
Morgan
The following times would be better for us:
Thanks,
Stephen
Morgan
Morgan
Wednesday Sept. 26th: 14:00 London time (GMT+1) / 9:00 Ottawa time (GMT-5) would be fine for us.
Apologies, but I now have a client meeting 13:30-14:30 London time (GMT+1).
Thanks,
Morgan
That works for us.
Thanks,
Stephen
Thursday, Sept. 27th: 17:00 Paris time would work for us.
As I’ve mentioned above, it is urgent that we resolve this problem to ensure the ISLG and P3LG projects can proceed as planned. We cannot afford to delay in solving this problem. As a result, in advance of the call on Thursday, works needs to be done to ensure everyone understands the issues, and is prepared to put forward viable solutions to the problem. The call is not meant for a general discussion of the issues. We are having the call to finalize the plan for implementing a solution to the problem. To get this discussion started, I have created a number of questions below to flesh out the issues. Please respond to these questions in comments, and we’ll work towards a solution on Thursday.
Please provide details on how the current PDF to HTML converter is setup in the P3 prototype? -
Thanks,
Morgan
Morgan
I will let Anil provide you with information separately on how the PDF to HTML conversion was achieved for the P3 PoC. This was more of a manual effort than automation comparatively.
But, today I had a brain storming session with Anil and few other members from my team on how to achieve a nearest automation solution to this conversion challenge. We did an out-of-box thinking keeping aside all existing tools such as Adobe Acrobat Pro. We now have a concept approach towards development of a fully customized automation tool which we believe can help us overcome this challenge for the long term. The concept is based on a premise that we will define a ‘standard’ HTML structure for P3 or ISLG-rebuild. All PDFs, irrespective of how they are formatted, will eventually convert to a ‘standard’ HTML structure. There are many advantages to this approach which I shall explain during our upcoming call this Thursday.
The following is envisaged from this tool.
1. Take a PDF file as an input
2. Extract TEXT from the PDF. We are exploring if we can evolve an algorithm which extracts the TEXT and reasonably maintain the formatting too.
3. The tool will then take this TEXT as an input and tag it based on the ‘standard’ HTML structure that we have defined. The tagging logic will be derived from the knowledge and lessons learnt during the conversions we executed recently. The tool will automatically identify TOC contents and structures and also relate it to the relevant content in the TEXT body. Will explain more on the call.
4. The tool will also provide lot of QA features such as checking tags, spell check, remove extra spaces, remove tab marks, etc.
5. The end result will be a ‘standard’ HTML file.
The development of such a tool is expected to take 4-5 months using 2 resources as there is lot of R & D involved.
I am optimistic about this approach as the tool can be used for both P3 and ISLG going forward. This optimism is based on absolute knowledge of technical possibilities.
Let’s talk more on this.
Best regards,
Devaang Bhatt | Associate Vice President, International Business
Microsoft Specialist, MCP
Thank you for the note above. This sounds like a promising solution to the problem.
For steps 3-5, I think we should thoroughly document the parameters for the "standard" HTML file. The standard used in P3 will defer from ISLG, and we may want to introduce other nuances depending on the type of document were converting. Let's ensure we integrate this into the process.
Thanks,
Morgan
It is a bit difficult to give an opinion without knowing the exact problems you were facing before.
However, from what I have gathered reading some of the basecamp threads, I understand that the general problem is with the structure of the output html from the PDF conversion.
The approach described by Devaang does make sense in that context. As I understand it, the plan is to extract the text components from the source PDF and give them semantic meaning before build a standard html output.
I wonder if it would make sense to store those extracted pieces of text with its semantic meaning in a database, instead of just writing it to an html output file. This would allow you to have multiple html structured templates that you can use to render the same document.
This way the PDF document representation would be a database model, and not just a mere file. This would provide a bit more robustness to the implementation and be ready for more advanced features you may want to implement around the document.
Again that may or may not make sense depending on your plans for the tool in the future.
Thank you for the input above. That sounds like a good idea. Different HTML templates will probably be necessary depending on the type of document we're converting (e.g., we'll want to tag Dispute Documents differently from Treaties & Rules in ISLG).
However, I'm still concerned about the quality of the text extraction from the PDF. Back to my question above, would it be advisable to develop our application for performing that task, or should we explore the options available through third parties?
Talk to you all soon on the call.
Morgan
P.S. please ensure no one is on speaker phone, because I want to ensure I can understand everyone clearly.
Following up on our call from last Thursday, we are going to resolve the PDF to HTML conversion issue through implementing the plan above. As a result, next steps are the following:
Thanks,
Morgan
Yes, your brief on the scope of work is in-line with what we discussed and I confirm this.
Regarding the staffing, we would need 2 dedicated resources. We can pull one from the P3 team right now as the work load is less and add one new resource. When we see the tasks on P3 picking up then we can return the P3 resource and replace with a new one then. I am sure this clarifies your staffing question. Yet, do let me know if you need further clarity.
The new resource will start immediately.
Best regards,
Devaang Bhatt | Associate Vice President, International Business
Microsoft Specialist, MCP
Moving one resource from P3 makes sense given that development work on that platform will be slower until we decide to move forward with a full product (probably at the beginning of next year). However, perhaps we should pull the additional resource from existing ISLG resources, because we have almost concluded all work needed on the current ISLG site, and should be available for work on this task.
Assuming we are able to resolve this issue quickly, we can then add the additional developer and QA at the beginning of November as we previously discussed when we are ready to start building the framework for the new application.
Please confirm if this makes sense. Thanks,
Morgan
Further to previous comment, could you confirm how we're allocating development resources. Also, could you provide us with a timeline for the PDF extraction R&D? I know it's difficult to put a deadline on R&D, but I want to ensure we keep on track with our overall development timeline for the rebuild project.
Thanks,
Morgan
I'm posting this message on
Hi Morgan,
The timeline on PDF to HTML R&D sounds fine.
Please confirm the current allocation of DevIT resources:
I propose having a call on Tuesday, October 16th at 8:00am Vancouver time.
Thanks,
Morgan
Wednesday at 8am Vancouver time works better for the Industrial team.
Mel
I've got a conflict on the Wednesday and Thursday morning.
Thanks,
Morgan
I apologize for late reply. Devaang is in Dubai for GITEX Technology week and will be back during this weekend. Also, we have a holiday on 18th October for Dussehra festival. Please suggest whether we can organize this meeting in next week.
I'm at a conference on the 18th as well and out of office the afternoon of the 17th.
For next week my only unavailability is after 4pm EDT on the 22nd and I'm at another conference on the 25th.
Thanks!
Ryan
OK. It looks like scheduling a call might be difficult right now. However, I still need clarification on what the DevIT teams are working to ensure we are allocating resources effectively.
Let's schedule a call for the week of October 29 - November 1.
Thanks,
Morgan
In addition to the above, I suggest examining the current roles for the 2 developers dedicated to ongoing maintenance of ISLG. We have completed all the outstanding to-do's on Basecamp, and the only working required until we start on the new site will be to address the occasional technical that arises as reported by users (or issues that arise from
Thanks,
Morgan
Above resource seems OK. Regarding machine learning R & D resources, I will discuss with Devaang once he will be back and let you know.
Along with this, they are also looking at component aspect of this research in terms of Software/Hardware/Service that will be needed.
We're good for Oct. 30 at 8am PDT.
Thanks!
Ryan
Also,
Microsoft Azure: https://azure.microsoft.com/en-ca/services/cognitive-services/computer-vision/;
Amazon AWS: https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html)?
Along with this, they are also looking at component aspect of this research in terms of Software/Hardware/Service that will be needed.
Further to a discussion with
Microsoft Azure: https://azure.microsoft.com/en-ca/services/cognitive-services/computer-vision/;
Amazon AWS: https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html)?
Thanks,
Morgan
Microsoft Azure: https://azure.microsoft.com/en-ca/services/cognitive-services/computer-vision/;
Amazon AWS: https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html)?
Along with this, they are also looking at component aspect of this research in terms of Software/Hardware/Service that will be needed.
30th October 8.00 am would be fine for me. I will let ask Devaang once he will be available in next week.
Thank you for answering my questions above. Everything appears to be on-track. I'll send a calendar invite for the call on 30th, and hopefully
Thanks,
Morgan
I just sent an updated calendar invite for the call tomorrow morning, which includes the following agenda:
Meeting Agenda
Morgan