Basecamp Export

Morgan Maguire, CEO

Hello all,

Following up on my message above,

Devaang

,

Anil

and

Stephen

, could you please confirm availability for a call to discuss the PDF to HTML issue during one of the following times:

Wednesday, October 26th 13:30-14:30 London time
Thursday, October 27th 13:30-14:30 Paris time
Thursday, October 27th 16:30-19:00 Paris time
Friday, October 28th 12:00-2:30 Paris time
Friday, October 28th 16:30-19:00 Paris time

Stephen

, please send out meeting details once we've confirmed a time for the call.

Thanks,

Morgan

Sep 19, 2018 at 6:04 PM Notified 7 people

Stephen Ceresia

Hi

Morgan

,

The following times would be better for us:

Wednesday Sept. 26th: 14:00 London time (GMT+1) / 9:00 Ottawa time (GMT-5)
Thursday, Sept. 27th: 17:00 Paris time (GMT+2) / 11:00 Ottawa time (GMT-5)

Devaang

Anil

let me know if either of these work for you and I'll send out the invite.

Thanks,
Stephen

Sep 19, 2018 at 6:19 PM Notified 7 people

Morgan Maguire, CEO

Hi

Stephen

, I think you might have your time zones mixed up. 13:30 London time is 8:30am Ottawa time.

Morgan

Sep 19, 2018 at 6:23 PM Notified 7 people

Stephen Ceresia

Hi Morgan - sorry, let me try that again. (Times edited in my previous comment)

Sep 19, 2018 at 6:40 PM Notified 7 people

Morgan Maguire, CEO

Great. Thanks

Stephen

.

Devaang

and

Anil

, let us know if either of those times work for you.

Morgan

Sep 19, 2018 at 8:49 PM Notified 7 people

Anil Vaghela

Hello

Morgan

and

Stephen

,

Wednesday Sept. 26th: 14:00 London time (GMT+1) / 9:00 Ottawa time (GMT-5) would be fine for us.

Sep 20, 2018 at 10:04 AM Notified 7 people

Morgan Maguire, CEO

Hi

Anil

and

Stephen

,

Apologies, but I now have a client meeting 13:30-14:30 London time (GMT+1).

Anil

, can you make the call at Thursday, Sept. 27th: 17:00 Paris time (GMT+2) work?

Thanks,

Morgan

Sep 20, 2018 at 4:28 PM Notified 7 people

Stephen Ceresia

Hi

Morgan

,

That works for us.

Anil

please confirm if ok for you and I'll update the invite.

Thanks,
Stephen

Sep 20, 2018 at 4:31 PM Notified 7 people

Anil Vaghela

Hello

Morgan

and

Stephen

,

Thursday, Sept. 27th: 17:00 Paris time would work for us.

Sep 21, 2018 at 5:53 AM Notified 7 people

Morgan Maguire, CEO

Hello all,

As I’ve mentioned above, it is urgent that we resolve this problem to ensure the ISLG and P3LG projects can proceed as planned. We cannot afford to delay in solving this problem. As a result, in advance of the call on Thursday, works needs to be done to ensure everyone understands the issues, and is prepared to put forward viable solutions to the problem. The call is not meant for a general discussion of the issues. We are having the call to finalize the plan for implementing a solution to the problem. To get this discussion started, I have created a number of questions below to flesh out the issues. Please respond to these questions in comments, and we’ll work towards a solution on Thursday.

Please provide details on how the current PDF to HTML converter is setup in the P3 prototype? - Anil

What software is being used to perform the conversions?
What options are available for adjusting the output of the conversions (e.g., can we ensure all conversions are produced in accordance with the Brand Identity Guide)?

What are other options are available for PDF to HTML converters?

What other OCR software options are available (e.g., Nuance, ABBYY) and will they produce better quality documents that the current converter?
- Do they produced better quality output than the current converter?
- Do they allow us to manipulate the output to conform to our style requirements?
- Do they allow us to automate the generation of the Table of Contents?
Can we use cloud based OCR APIs (e.g., Google cloud vision, Microsoft cognitive services, Amazon AWS)? Note I came across this article comparing these different APIs: https://dataturks.com/blog/compare-image-text-recognition-apis.php
- Do they produced better quality output than the current converter?
- Do they allow us to manipulate the output to conform to our style requirements?
- Do they allow us to automate the generation of the ToC?

How can we automatically identity what documents require manual intervention in the PDF to HTML conversion?

PDFs in ISLG and P3LG have varying degrees of quality. Assuming low quality documents will require some level of manual intervention, what automated systems can be developed to assess the quality of a set of PDF documents in advance of conversion?
- Is it possible to develop an application that can assess the quality of the PDF, and determine whether manual intervention is necessary to produce a high quality HTML document?
This aspect is important, because it will allow us to assess the overall quality of the document collection, and determine what percentage requires manual intervention. It is not viable for us to individually assess the quality of all 5300 documents on ISLG.

Please provide your input in comments below.

Thanks,

Morgan

Sep 21, 2018 at 5:20 PM Notified 7 people

Morgan Maguire, CEO

In addition to the above, here are some details on the cloud based OCR APIs:

Google Cloud: https://cloud.google.com/vision/docs/ocr
Microsoft Azure: https://azure.microsoft.com/en-ca/services/cognitive-services/computer-vision/
Amazon AWS: https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html)

I also found this post, discussing a project tackling a problem similar to ours: https://aws.amazon.com/blogs/publicsector/serverless-optical-character-recognition-in-support-of-nasa-astronaut-safety/

Morgan

Sep 21, 2018 at 5:50 PM Notified 7 people

Devaang Bhatt

Hello all,

I will let Anil provide you with information separately on how the PDF to HTML conversion was achieved for the P3 PoC. This was more of a manual effort than automation comparatively.

But, today I had a brain storming session with Anil and few other members from my team on how to achieve a nearest automation solution to this conversion challenge. We did an out-of-box thinking keeping aside all existing tools such as Adobe Acrobat Pro. We now have a concept approach towards development of a fully customized automation tool which we believe can help us overcome this challenge for the long term. The concept is based on a premise that we will define a ‘standard’ HTML structure for P3 or ISLG-rebuild. All PDFs, irrespective of how they are formatted, will eventually convert to a ‘standard’ HTML structure. There are many advantages to this approach which I shall explain during our upcoming call this Thursday.

The following is envisaged from this tool.

1. Take a PDF file as an input

2. Extract TEXT from the PDF. We are exploring if we can evolve an algorithm which extracts the TEXT and reasonably maintain the formatting too.

3. The tool will then take this TEXT as an input and tag it based on the ‘standard’ HTML structure that we have defined. The tagging logic will be derived from the knowledge and lessons learnt during the conversions we executed recently. The tool will automatically identify TOC contents and structures and also relate it to the relevant content in the TEXT body. Will explain more on the call.

4. The tool will also provide lot of QA features such as checking tags, spell check, remove extra spaces, remove tab marks, etc.

5. The end result will be a ‘standard’ HTML file.

The development of such a tool is expected to take 4-5 months using 2 resources as there is lot of R & D involved.

I am optimistic about this approach as the tool can be used for both P3 and ISLG going forward. This optimism is based on absolute knowledge of technical possibilities.

Let’s talk more on this.

Best regards,
Devaang Bhatt | Associate Vice President, International Business
Microsoft Specialist, MCP

Sep 25, 2018 at 2:25 PM Notified 7 people

Morgan Maguire, CEO

Hi

Devaang

,

Thank you for the note above. This sounds like a promising solution to the problem.

Juan

Ryan

, please provide your feedback on this proposal.

Devaang

, for step 2 in your proposal above, what are the advantages to developing the text extraction algorithm ourselves, rather than relying on a third party application for such purposes? Also, would there be any benefit in developing this algorithm in conjunction with the cloud based OCR API's mentioned above?

For steps 3-5, I think we should thoroughly document the parameters for the "standard" HTML file. The standard used in P3 will defer from ISLG, and we may want to introduce other nuances depending on the type of document were converting. Let's ensure we integrate this into the process.

Thanks,

Morgan

Sep 26, 2018 at 4:02 AM Notified 7 people

Juan Silva

Hi

Morgan

It is a bit difficult to give an opinion without knowing the exact problems you were facing before.

However, from what I have gathered reading some of the basecamp threads, I understand that the general problem is with the structure of the output html from the PDF conversion.

The approach described by Devaang does make sense in that context. As I understand it, the plan is to extract the text components from the source PDF and give them semantic meaning before build a standard html output.

I wonder if it would make sense to store those extracted pieces of text with its semantic meaning in a database, instead of just writing it to an html output file. This would allow you to have multiple html structured templates that you can use to render the same document.

This way the PDF document representation would be a database model, and not just a mere file. This would provide a bit more robustness to the implementation and be ready for more advanced features you may want to implement around the document.

Again that may or may not make sense depending on your plans for the tool in the future.

Sep 27, 2018 at 1:57 PM Notified 7 people

Morgan Maguire, CEO

Hi

Juan

,

Thank you for the input above. That sounds like a good idea. Different HTML templates will probably be necessary depending on the type of document we're converting (e.g., we'll want to tag Dispute Documents differently from Treaties & Rules in ISLG).

However, I'm still concerned about the quality of the text extraction from the PDF. Back to my question above, would it be advisable to develop our application for performing that task, or should we explore the options available through third parties?

Talk to you all soon on the call.

Morgan

P.S. please ensure no one is on speaker phone, because I want to ensure I can understand everyone clearly.

Sep 27, 2018 at 2:37 PM Notified 7 people

Morgan Maguire, CEO

Hello all,

Following up on our call from last Thursday, we are going to resolve the PDF to HTML conversion issue through implementing the plan above. As a result, next steps are the following:

Perform R&D and finalize solution for PDF text extraction. Solution should have the capacity to accurately extract text from poor quality PDFs (e.g., PDFs that are the product of scanned copies of printed document). Exploring third party API options vs. in-house solution will be considered as part of R&D.
Organize and finalize tagging structure for HTML templates. Templates will vary depending on the type of document (e.g., dispute document vs. legal instrument) and product (e.g., ISLG vs. P3LG). Next week, I will provide samples of each document type with descriptions of the document structure.

Devaang

, could you please confirm the above. Also, could you please confirm how we're going to staff the project given that we have previously discussed increasing the number of resources dedicated to ISLG.

Thanks,

Morgan

Oct 01, 2018 at 6:55 AM Notified 7 people

Devaang Bhatt

Hi Morgan,

Yes, your brief on the scope of work is in-line with what we discussed and I confirm this.

Regarding the staffing, we would need 2 dedicated resources. We can pull one from the P3 team right now as the work load is less and add one new resource. When we see the tasks on P3 picking up then we can return the P3 resource and replace with a new one then. I am sure this clarifies your staffing question. Yet, do let me know if you need further clarity.

The new resource will start immediately.

Best regards,
Devaang Bhatt | Associate Vice President, International Business
Microsoft Specialist, MCP

Oct 01, 2018 at 11:48 AM Notified 7 people

Morgan Maguire, CEO

Hi

Devaang

,

Moving one resource from P3 makes sense given that development work on that platform will be slower until we decide to move forward with a full product (probably at the beginning of next year). However, perhaps we should pull the additional resource from existing ISLG resources, because we have almost concluded all work needed on the current ISLG site, and should be available for work on this task.

Assuming we are able to resolve this issue quickly, we can then add the additional developer and QA at the beginning of November as we previously discussed when we are ready to start building the framework for the new application.

Please confirm if this makes sense. Thanks,

Morgan

Oct 01, 2018 at 12:55 PM Notified 7 people

Morgan Maguire, CEO

Hi

Devaang

,

Further to previous comment, could you confirm how we're allocating development resources. Also, could you provide us with a timeline for the PDF extraction R&D? I know it's difficult to put a deadline on R&D, but I want to ensure we keep on track with our overall development timeline for the rebuild project.

Thanks,

Morgan

Oct 10, 2018 at 7:26 PM Notified 8 people

Morgan Maguire, CEO

Hello all,

I'm posting this message on

Devaang

behalf because he wasn't able to access Basecamp for some reason:

Hi Morgan,

I reviewed the current workloads with Anil and agree with you that we should be able to free-up two resources (one each from ISLG and P3) to undertake the PDF to HTML conversion tool development. The R & D is definitely tricky but I hope to have a fine blueprint by 2nd or 3rd week of December 2018. This timeline is considering the fact that we will have our annual Diwali Festival vacations from November 7th to November 14th.

Will keep you updated on the progress as it happens or every 10 days.

Oct 11, 2018 at 4:08 PM Notified 8 people

Morgan Maguire, CEO

Hi

Devaang

,

The timeline on PDF to HTML R&D sounds fine.

Devaang

and

Anil

, please keep us updated with progress as it develops.

Please confirm the current allocation of DevIT resources:

2 full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3)
2 full-time developers on ISLG (1 to be added on November 1st)
1 full-time QA manager on ISLG (November 1st)
1 full-time developer on P3
1 full-time developer on Machine Learning R&D
1 part-time team lead ( Anil )

If this is correct, we should have a call to discuss next steps on all projects. I want to ensure all resources are kept busy now that maintenance work on ISLG is slowing down, and ensure we instituting more thorough project management over the development team as it expands.

I propose having a call on Tuesday, October 16th at 8:00am Vancouver time.

Devaang

,

Anil

,

Ryan

,

Stefanie

,

Melissa

(and

Stephen

if he is available), please confirm you're available and I'll send out call details.

Thanks,

Morgan

Oct 11, 2018 at 4:19 PM Notified 8 people

Melissa Cowell, General Manager

Morgan

Wednesday at 8am Vancouver time works better for the Industrial team.

Mel

Oct 11, 2018 at 4:52 PM Notified 8 people

Morgan Maguire, CEO

Hi

Melissa

,

I've got a conflict on the Wednesday and Thursday morning.

Ryan

, can you make the call, and then you can brief everyone else?

Thanks,

Morgan

Oct 11, 2018 at 5:28 PM Notified 8 people

Anil Vaghela

Hello

Morgan

and

Ryan

,

I apologize for late reply. Devaang is in Dubai for GITEX Technology week and will be back during this weekend. Also, we have a holiday on 18th October for Dussehra festival. Please suggest whether we can organize this meeting in next week.

Oct 15, 2018 at 6:00 AM Notified 8 people

Ryan Knuth, Customer Support Manager

Hi

Morgan

and

Anil

I'm at a conference on the 18th as well and out of office the afternoon of the 17th.

For next week my only unavailability is after 4pm EDT on the 22nd and I'm at another conference on the 25th.

Thanks!

Ryan

Oct 15, 2018 at 2:23 PM Notified 8 people

Morgan Maguire, CEO

Hi

Anil

and

Ryan

,

OK. It looks like scheduling a call might be difficult right now. However, I still need clarification on what the DevIT teams are working to ensure we are allocating resources effectively.

Anil

, could you please confirm that resources are allocated as follows:

2 full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3)
2 full-time developers on ISLG (1 to be added on November 1st)
1 full-time QA manager on ISLG (November 1st)
1 full-time developer on P3
1 full-time developer on Machine Learning R&D
1 part-time team lead ( Anil )

Also, please provide a detailed description on what each team is working on for the next two weeks.

Let's schedule a call for the week of October 29 - November 1.

Devaang

,

Anil

,

Ryan

,

Stephen

and

Melissa

, can you confirm whether you're available for a call on Tuesday, October 30th at 8:00am Vancouver time?

Thanks,

Morgan

Oct 15, 2018 at 5:16 PM Notified 8 people

Morgan Maguire, CEO

Hi

Anil

,

In addition to the above, I suggest examining the current roles for the 2 developers dedicated to ongoing maintenance of ISLG. We have completed all the outstanding to-do's on Basecamp, and the only working required until we start on the new site will be to address the occasional technical that arises as reported by users (or issues that arise from

Ryan

's security scan). Therefore, I suggest dedicating these resources as much as possible to support the PDF to HTML R&D, because it's critical that we come up with a timely solution to this issue.

Thanks,

Morgan

Oct 15, 2018 at 6:35 PM Notified 8 people

Anil Vaghela

Hello

Morgan

,

Above resource seems OK. Regarding machine learning R & D resources, I will discuss with Devaang once he will be back and let you know.

2 full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3)
- One developer will be working on R & D to separate law quality PDF files from the list of PDFs.
- Second developer will be working on R & D to extract text from low quality PDF files.
2 full-time developers on ISLG
- One developer will be working on ISLG Rebuild technical architecture design.
- Another developer will be working on occasional ISLG support tasks and PDF to HTML R & D to format HTML output generated by Adobe Pro SDK.
1 full-time developer on P3
- Will be working on issues related to tag excerpts formatting and display vertical lines for highlighted paragraphs.
1 full-time developer on Machine Learning R&D
- Currently team is doing research on Azure ML platform aspect such as whether Free tier of Azure ML is sufficient or not. What are the limitation of free tier and what we'll need in production.
  Along with this, they are also looking at component aspect of this research in terms of Software/Hardware/Service that will be needed.
1 part-time team lead ( Anil )
- Will be managing all above work and working on PDF to HTML R & D

Please review above and let me know if you need any further clarification.

Oct 16, 2018 at 9:26 AM Notified 8 people

Ryan Knuth, Customer Support Manager

Hi

Morgan

We're good for Oct. 30 at 8am PDT.

Thanks!

Ryan

Oct 16, 2018 at 3:13 PM Notified 8 people

Morgan Maguire, CEO

Great. Thanks

Ryan

.

Anil

and

Devaang

will Tuesday, October 30th at 8:00am Vancouver time work for you?

Also,

Anil

, a couple of follow-up questions about the allocation of resrouces:

2 full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3)
- One developer will be working on R & D to separate law quality PDF files from the list of PDFs.
- Second developer will be working on R & D to extract text from low quality PDF files.
  - Question: Are we investigation into the viability of using third party applications for text extraction as part of R&D (e.g., Google Cloud: https://cloud.google.com/vision/docs/ocr;
    Microsoft Azure: https://azure.microsoft.com/en-ca/services/cognitive-services/computer-vision/;
    Amazon AWS: https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html)?
2 full-time developers on ISLG
- One developer will be working on ISLG Rebuild technical architecture design.
  - Question: I was under the impression that we should finalize the design of the admin site before starting the technical architecture. Why are we changing approaches? Note that the wireframes will be finalized this week, and then we'll start generating the user stories.
- Another developer will be working on occasional ISLG support tasks and PDF to HTML R & D to format HTML output generated by Adobe Pro SDK.
  - Question: following on my question above, are we examining cloud based ML solutions to this problem? My impression is the Adobe has produced unsatisfactory results so far.
1 full-time developer on P3
- Will be working on issues related to tag excerpts formatting and display vertical lines for highlighted paragraphs.
  - Question: are we working on producing the admin site search within the tagging application? Not having this complete is hindering our content analysis, and forcing us to bring aspects of the process offline.
1 full-time developer on Machine Learning R&D
- Currently team is doing research on Azure ML platform aspect such as whether Free tier of Azure ML is sufficient or not. What are the limitation of free tier and what we'll need in production.
  Along with this, they are also looking at component aspect of this research in terms of Software/Hardware/Service that will be needed.
  - Question: how is Contegra's auto-search getting integrated into this process?
1 part-time team lead ( Anil )
- Will be managing all above work and working on PDF to HTML R & D

Oct 16, 2018 at 3:52 PM Notified 8 people

Morgan Maguire, CEO

Hi

Anil

,

Further to a discussion with

Ryan

on the above, have we considered utilizing the ML developer to help us address the PDF to HTML conversion issue? In particular, can they help us understand the ability to utilized ML cloud applications for text extraction (Google Cloud: https://cloud.google.com/vision/docs/ocr;
Microsoft Azure: https://azure.microsoft.com/en-ca/services/cognitive-services/computer-vision/;
Amazon AWS: https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html)?

Thanks,

Morgan

Oct 16, 2018 at 5:27 PM Notified 8 people

Anil Vaghela

Hi

Morgan

,

full-time developers on PDF to HTML R&D (1 from ISLG and 1 from P3)
- One developer will be working on R & D to separate law quality PDF files from the list of PDFs.
- Second developer will be working on R & D to extract text from low quality PDF files.
  - Question: Are we investigation into the viability of using third party applications for text extraction as part of R&D (e.g., Google Cloud: https://cloud.google.com/vision/docs/ocr;
    Microsoft Azure: https://azure.microsoft.com/en-ca/services/cognitive-services/computer-vision/;
    Amazon AWS: https://docs.aws.amazon.com/rekognition/latest/dg/text-detection.html)?
    - Yes we will investigate clould based OCRs and other OCR tools as well for text extraction.
2 full-time developers on ISLG
- One developer will be working on ISLG Rebuild technical architecture design.
  - Question: I was under the impression that we should finalize the design of the admin site before starting the technical architecture. Why are we changing approaches? Note that the wireframes will be finalized this week, and then we'll start generating the user stories.
    - There is no relation between admin site design and technical architecture of the project. Technical architecture is a fundamental thing on which we need to work before starting actual user stories of admin design. Please let me know your thoughts.
- Another developer will be working on occasional ISLG support tasks and PDF to HTML R & D to format HTML output generated by Adobe Pro SDK.
  - Question: following on my question above, are we examining cloud based ML solutions to this problem? My impression is the Adobe has produced unsatisfactory results so far.
    - Yes, we are examining clould based solutions also to extract text from PDF files. In addition to that we are also doing R & D on other aspects e.g. 1. Whether we can alter the pre-generated HTML files to produce the expected output using different tools. 2. How to format extracted texts into a standard HTML template.
    - Please note that text extraction is not only the issue, the other issue is to format that extracted text in such a way that could generate standard HTML files. To achieve this we are exploring other tools too e.g. Adobe SDK, soda PDF etc.
1 full-time developer on P3
- Will be working on issues related to tag excerpts formatting and display vertical lines for highlighted paragraphs.
  - Question: are we working on producing the admin site search within the tagging application? Not having this complete is hindering our content analysis, and forcing us to bring aspects of the process offline.
    - We have already completed this and uploaded on p3lg for your review.
1 full-time developer on Machine Learning R&D
- Currently team is doing research on Azure ML platform aspect such as whether Free tier of Azure ML is sufficient or not. What are the limitation of free tier and what we'll need in production.
  Along with this, they are also looking at component aspect of this research in terms of Software/Hardware/Service that will be needed.
  - Question: how is Contegra's auto-search getting integrated into this process?
    - Once we have Contegra ready on the server we will explore this and will let you know.

Please review above and let me know if you need any further clarification.

Oct 17, 2018 at 6:59 AM Notified 8 people

Anil Vaghela

Hello

Morgan

,

30th October 8.00 am would be fine for me. I will let ask Devaang once he will be available in next week.

Oct 17, 2018 at 1:48 PM Notified 8 people

Morgan Maguire, CEO

Hi

Anil

,

Thank you for answering my questions above. Everything appears to be on-track. I'll send a calendar invite for the call on 30th, and hopefully

Devaang

can confirm.

Thanks,

Morgan

Oct 17, 2018 at 2:43 PM Notified 8 people

Morgan Maguire, CEO

Hello everyone,

I just sent an updated calendar invite for the call tomorrow morning, which includes the following agenda:

Meeting Agenda

Development Updated – Devaang and Anil
1. Update of latest developments from development team, including:
  1. Description of resource allocation between projects
  2. Progress report on PDF to HTML R&D
  3. Progress report on ISLG Rebuild technical architecture design
  4. Progress report on Machine Leaning R&D
Update on ISLG rebuild – Melissa, Ryan and Morgan
1. Update on latest development for ISLG Rebuild design
2. Next steps for development team in process

Let me know if there is anything you want to add to the agenda.

Morgan

Oct 29, 2018 at 8:54 PM Notified 7 people

Morgan Maguire, CEO

https://jusmundi.com/

Oct 30, 2018 at 3:31 PM Notified 7 people

TOLOGIX - ISLG App Rebuild

HTML Conversion Problem

Comments & Events