Web Scraping ICSID Case data
Hi
Martin
and
Paul
,
Following up on our discussion today, this is probably a good sample for us to work with on the case data we're hoping to scrape from the ICSID site: https://icsid.worldbank.org/cases/case-database/case-detail?CaseNo=ARB/98/2
Perhaps as a first step,
Martin
would it be possible for you to extract that data from the fields on this page and insert them into a spreadsheet? Then we can take a look at what data can be extracted and how we can migrate them into field structure used for ISLG Dispute forms: https://app.investorstatelawguide.com/Admin/ContentTypeData/DisputeList
Long-term, maybe there could even be a way to automatically generate a completed Dispute form that would avoid the step of having to manually upload the data into the fields.
Thanks,
Morgan
Following up on our discussion today, this is probably a good sample for us to work with on the case data we're hoping to scrape from the ICSID site: https://icsid.worldbank.org/cases/case-database/case-detail?CaseNo=ARB/98/2
Perhaps as a first step,
Long-term, maybe there could even be a way to automatically generate a completed Dispute form that would avoid the step of having to manually upload the data into the fields.
Thanks,
Morgan
I am currently evaluating a couple of web scraping solutions, and they all require that we manually identify and name each field.
Do we need all fields from each case? Or can we narrow it down to a finite list?
If we can narrow the list down, it would help me if someone could provide me with a list of fields to extract.
Thanks,
--Martin
Morgan
I have extracted the sample page into an Excel spreadsheet.
See attached.
--Martin
Morgan
Following up on above, I've taken the spreadsheet produced by
Why don't we have a call to discuss and I can take you through what I've done?
Thanks,
Morgan
Following-up on above, are your both available for a call at 10am PT tomorrow to discuss the above? I'll send a calendar invite out.
Morgan
An updated copy of the spreadsheet above is available here: https://islg.egnyte.com/dl/LgIaIPBu1l.
Also, here is a screenshot to try and describe how we need to map the data concerning the Reconstituted fields:
Let me know you need further detail to explain how this works.
Thanks,
Morgan
Below is a sample of what I envision the scraper will be outputting.
Let me know if you have any questions.
Morgan
I will have something functional by early next week.
Thanks,
--Martin
Morgan
I have a script that can read all fields of any ICSID case, and export to CSV.
I will start looking at implementing the reconstituted fields next.
I ran the script against 18 completed cases (including ARB/98/2) and have attached the CSV output.
Please take a look and provide feedback if you have any.
Thanks,
--Martin
This looks great. However, I noticed the output hasn't separate fields that have been combined into on field (e.g., Claimant(s)/Nationality(ies)) and that fields with multiple values are getting combined into one value within the same cell (e.g., Arbitrators). Is there a way for us to deal with this at the initial output stage or will need to organize this when we start connecting the data to the applicable ISLG fields?
Thanks,
Morgan
Yes, I'll further develop the script, so we can split the field values that contain multiple values.
I hope to have everything done by Monday.
--Martin
I noticed what I've commented in the screenshot below.
Two things:
1. Party representative indication is required under column B or C to distinguish different values; and
2. We don't need party representative locations (if it makes the scraping script easier).
Thanks,
Paul
Just an update that I am still working on this. The parsing rules and field matching are more complex than I anticipated.
Thanks,
--Martin
I was successfully able to complete the first (and most challenging) part of the project, which was to fully extract an ICSID case, and correctly split the values when dealing with multiple entries under one field.
I'm now also able to capture not only the various phases of the case, but also the subsections within that phase (column C).
Now that we are able to reliably extract and parse case data, my next step will be to work on matching ICSID fields with ISLG fields (columns F, G, H).
I'm attaching the latest export file for ARB/98/2.
Let me know if you have any questions/feedback.
This looks good. Please note that not every row would have a corresponding equivalent on ISLG (e.g., Outcome of Proceeding, Language(s) of Proceeding), so I can identify all of them if you'd like. Let me know.
Thanks,
Paul
I am using
I'm attaching a work-in-progress of what the final output will look like. I've only matched a few fields yet, but I think this will give you a good idea of what the final product will look like.
The relevant columns are F, G and H.
Morgan