TOLOGIX - ISLG App Rebuild

Web Scraping ICSID Case data

Hi Martin Laporte, CTO at Tologix Martin and Paul Moon Paul ,

Following up on our discussion today, this is probably a good sample for us to work with on the case data we're hoping to scrape from the ICSID site: https://icsid.worldbank.org/cases/case-database/case-detail?CaseNo=ARB/98/2

Perhaps as a first step, Martin Laporte, CTO at Tologix Martin would it be possible for you to extract that data from the fields on this page and insert them into a spreadsheet? Then we can take a look at what data can be extracted and how we can migrate them into field structure used for ISLG Dispute forms: https://app.investorstatelawguide.com/Admin/ContentTypeData/DisputeList 

Long-term, maybe there could even be a way to automatically generate a completed Dispute form that would avoid the step of having to manually upload the data into the fields.

Thanks,

Morgan

Comments & Events

Martin Laporte, CTO at Tologix
Hi Morgan Maguire, CEO Morgan ,

I am currently evaluating a couple of web scraping solutions, and they all require that we manually identify and name each field. 

Do we need all fields from each case? Or can we narrow it down to a finite list?
If we can narrow the list down, it would help me if someone could provide me with a list of fields to extract.

Thanks,
--Martin
Morgan Maguire, CEO
Ok. Sounds good, Martin Laporte, CTO at Tologix Martin . I can generate the list of relevant fields for you. However, is there an easy way for you to create a spreadsheet of the data on the sample page above based on the HTML? Otherwise, I'll need to copy/paste things directly from the page.

Morgan
Martin Laporte, CTO at Tologix
Hi Morgan Maguire, CEO Morgan ,

I have extracted the sample page into an Excel spreadsheet.
See attached.

--Martin
Morgan Maguire, CEO
Perfect. Thanks Martin Laporte, CTO at Tologix Martin . I'll use this to generate a table of correspondence identifying the fields we'll need to extract and connect them to the relevant fields in the ISLG master list.

Morgan
Morgan Maguire, CEO
Hi Martin Laporte, CTO at Tologix Martin and Paul Moon Paul ,

Following up on above, I've taken the spreadsheet produced by Martin Laporte, CTO at Tologix Martin , isolated the fields that we'll want collect data and inserted columns indicating the applicable fields within ISLG.

Why don't we have a call to discuss and I can take you through what I've done?

Thanks,

Morgan 

Morgan Maguire, CEO
Hi Paul Moon Paul and Martin Laporte, CTO at Tologix Martin ,

Following-up on above, are your both available for a call at 10am PT tomorrow to discuss the above? I'll send a calendar invite out.

Morgan 
Morgan Maguire, CEO
Hi Martin Laporte, CTO at Tologix Martin ,

An updated copy of the spreadsheet above is available here: https://islg.egnyte.com/dl/LgIaIPBu1l.

Also, here is a screenshot to try and describe how we need to map the data concerning the Reconstituted fields:

Let me know you need further detail to explain how this works.

Thanks,

Morgan
Martin Laporte, CTO at Tologix
Thanks, Morgan Maguire, CEO Morgan . The diagram will be helpful.
Martin Laporte, CTO at Tologix
Hi Morgan Maguire, CEO Morgan ,

Below is a sample of what I envision the scraper will be outputting.
Let me know if you have any questions.
Morgan Maguire, CEO
This looks great Martin Laporte, CTO at Tologix Martin . How about the Reconstituted fields? Were you able to generate output for those?

Morgan 
Martin Laporte, CTO at Tologix
Hi Morgan Maguire, CEO Morgan , my screenshot above was just a mock-up, just so we can be aligned on how the output will look like. But I don't anticipate having any problem with the reconstituted fields.

I will have something functional by early next week.

Thanks,
--Martin
Morgan Maguire, CEO
Ok. Great. Thanks Martin Laporte, CTO at Tologix Martin .

Morgan
Martin Laporte, CTO at Tologix
Hi Morgan Maguire, CEO Morgan and Paul Moon Paul ,

I have a script that can read all fields of any ICSID case, and export to CSV.
I will start looking at implementing the reconstituted fields next.

I ran the script against 18 completed cases (including ARB/98/2) and have attached the CSV output.
Please take a look and provide feedback if you have any.

Thanks,
--Martin
Morgan Maguire, CEO
Hi Martin Laporte, CTO at Tologix Martin ,

This looks great. However, I noticed the output hasn't separate fields that have been combined into on field (e.g., Claimant(s)/Nationality(ies)) and that fields with multiple values are getting combined into one value within the same cell (e.g., Arbitrators). Is there a way for us to deal with this at the initial output stage or will need to organize this when we start connecting the data to the applicable ISLG fields?

Thanks,

Morgan
Martin Laporte, CTO at Tologix
Hi Morgan Maguire, CEO Morgan ,

Yes, I'll further develop the script, so we can split the field values that contain multiple values.

I hope to have everything done by Monday.

--Martin
Morgan Maguire, CEO 👍
Paul Moon
Looks good, Martin Laporte, CTO at Tologix Martin .

I noticed what I've commented in the screenshot below.

Two things:
1. Party representative indication is required under column B or C to distinguish different values; and
2. We don't need party representative locations (if it makes the scraping script easier).

Thanks,

Paul
Martin Laporte, CTO at Tologix
Hi Morgan Maguire, CEO Morgan and Paul Moon Paul ,

Just an update that I am still working on this. The parsing rules and field matching are more complex than I anticipated.

Thanks,
--Martin
Martin Laporte, CTO at Tologix
Hi Morgan Maguire, CEO Morgan and Paul Moon Paul ,

I was successfully able to complete the first (and most challenging) part of the project, which was to fully extract an ICSID case, and correctly split the values when dealing with multiple entries under one field.

I'm now also able to capture not only the various phases of the case, but also the subsections within that phase (column C).

Now that we are able to reliably extract and parse case data, my next step will be to work on matching ICSID fields with ISLG fields (columns F, G, H).

I'm attaching the latest export file for ARB/98/2.
Let me know if you have any questions/feedback.
Paul Moon
Hi Martin Laporte, CTO at Tologix Martin :

This looks good. Please note that not every row would have a corresponding equivalent on ISLG (e.g., Outcome of Proceeding, Language(s) of Proceeding), so I can identify all of them if you'd like. Let me know.

Thanks,

Paul
Martin Laporte, CTO at Tologix
Thanks, Paul Moon Paul .

I am using Morgan Maguire, CEO Morgan 's spreadsheet (https://islg.egnyte.com/dl/LgIaIPBu1l) to help me match the relevant ISLG fields.

I'm attaching a work-in-progress of what the final output will look like. I've only matched a few fields yet, but I think this will give you a good idea of what the final product will look like.
The relevant columns are F, G and H.

Morgan Maguire, CEO
This looks great, Martin Laporte, CTO at Tologix Martin . There's some data we'll need to strip out of the values when the field mapping is complete, but the spreadsheet immediately above looks like exactly what we're looking for.

Morgan