npm i
npm start
covidatlas / li Goto Github PK
View Code? Open in Web Editor NEWNext-generation serverless crawler for COVID-19 data
License: Apache License 2.0
Next-generation serverless crawler for COVID-19 data
License: Apache License 2.0
npm i
npm start
We have schema definitions for scrapers, but now what they output. We should enforce things like administrative levels and measured quantities (e.g. cases, deaths, etc.).
Along with this, I propose we write a scraper skeleton, with a list of data requirements and pseudo-code (or a code template of some sort) showing what needs to be done and when.
Schema should be validated in yarn test
.
There's a number of significant events that affect data either directly or in delayed fashion and may impact projections. Things like:
Add hourly scheduled task runner
event (to run off invokes
table) for operating the crawler
and scraper
events.
The flattening the curve
approach is defining the concept of medical system capacity
which, when exceeded, leads to substantial collapse of the curing capacity and in return skyrocketing of deaths.
It would be great to look for sources of data to provide information on capacity of the medical system, and since it's changing (regions are pushing for increasing it, to give more space for the flattened curve), it should be a data point per date.
❌ Arunachal Pradesh, iso1:IN: ?
❌ Assam, iso1:IN: ?
❌ Jharkhand, iso1:IN
Today's (28 March) version of the timeseries file doesn't distinguish data for New York State vs. New York City. Here are the two rows for yesterday (27 March) that match state == 'NY'
and is.na(county)
:
city | county | state | country | population | lat | long | url | cases | deaths | recovered | active | tested | growthFactor | date |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NA | NA | NY | USA | 19453561 | 42.76081 | -75.84097 | https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases | 44635 | NA | NA | NA | NA | 1.197998 | 2020-03-27 |
NA | NA | NY | USA | 8398748 | 40.70684 | -73.97834 | https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases | 25398 | NA | NA | NA | NA | 1.187211 | 2020-03-27 |
Based on population
, the top row is for the entire state and the second is for NYC. Perhaps someone was trying to fix covidatlas/coronadatascraper#399 and accidentally set city
to NA
?
Below are the list of sources (outside of the US) that we are currently aware of. If you know a source that is not on this list, please file an issue, and we will update this document!
Status definition:
✅ Source is actively scraped by this project
🏗️ Someone is working on a scraper for this source
🐛 Source is buggy, needs fixing
ArcGIS maintains a list of all dashboards by country: https://www.arcgis.com/apps/opsdashboard/index.html#/a9419e61cb6f4521a15baf78be309b35
Contains data for a number of Latin American countries:
https://github.com/DataScienceResearchPeru/covid-19_latinoamerica
Status | ISO Code | Name | URL | Notes |
---|---|---|---|---|
TN | Tunisia Ministry of Health | https://services6.arcgis.com/BiTAc9ApDDtL9okN/arcgis/rest/services/Statistiques_par_gouvernorat_(nouvelle_donn%C3%A9e)/FeatureServer/0/query | ||
✅ | ZA | COVID 19 Data for South Africa | https://github.com/dsfsi/covid19za |
Status | ISO Code | Name | URL | Notes |
---|---|---|---|---|
✅ | BR | Secretaria de Vigilância em Saúde do Ministério da Saúde | https://covid.saude.gov.br/ | |
✅ | CA | Public Health Agency of Canada | https://health-infobase.canada.ca/src/data/covidLive/covid19.csv | |
CA | https://resources-covid19canada.hub.arcgis.com/app/82e586188b7049e1896b771cd4875815 | Provides data at the health district level | ||
🏗️ covidatlas/coronadatascraper#788 | CA-NS | Government of Nova Scotia | https://novascotia.ca/coronavirus/data/COVID-19-data.csv | |
GT | Ministry of Health of Guatemala | https://www.mspas.gob.gt/index.php/noticias/coronavirus-2019-ncov | ||
✅ | PR | Gobierno de Puerto Rico Departamento de Salud | http://www.salud.gov.pr/Pages/coronavirus.aspx | |
SV | Ministry of Health of El Salvador | https://covid19.gob.sv/ | ||
✅ | VI | United States Virgin Islands Department of Health | https://doh.vi.gov/covid19usvi |
/title
parse.number('')
It's null?
This is a tough one. It seems like it should return null, which will cause a validation error, and if the scraper author wants to return zero for empty string, they can explicitly do: parse.number(parse.string(whatever) || 0)
Looks like the St. Louis County data on the MO health site is lagging behind. More accurate data is being linked on the county page:
STL County Covid-19 Home: https://stlouisco.com/Your-Government/County-Executive/COVID-19
Arc Map: https://stlcogis.maps.arcgis.com/apps/MapSeries/index.html?appid=6ae65dea4d804f2ea4f5d8ba79e96df1
@paulboal I noticed you are the maintainer of MO scraper so wanted to bring this to your attention. I'm going update on my forked repo. Will happily create a PR when I'm finished if this source benefits you all as well.
https://www.argentina.gob.ar/coronavirus/informe-diario
This is from the federal government. They are publishing two PDFs per day. "Vespertino" = evening, "Matutino" = morning. They're probably meeting minutes.
Pros:
Cons:
Exact scope on the API is still coming into view; this issue should be for discussing and designing the 1.0 API.
https://sbcph.maps.arcgis.com/apps/opsdashboard/index.html#/44bb35c804c44c8281da6d82ee602dff
San Bernardino County COVID-19 Dashboard
It seems to be as much as a day in advance of Mercury news.
{^_^}
Scope of the annotator
event is not yet clear, need to work closely with @hyperknot to determine the best means for tagging additional datasets (geo, metadata such as population and hospital beds, etc.) to our locations
).
For the states DC, VT and NV aggregate is set to state in the locations.json though county level data is in the dataset.
See also covidatlas/coronadatascraper#264 and covidatlas/coronadatascraper#312
seems like this locations.json was not updated.
To my understanding aggregate matches the type of record (country, state, county, city) if this is the lowest available level of data.
If aggregate is county on a state record this is aggregated county data.
A couple ways this could work:
errors
array as an argument that can be pushed to (feels weird man)this = { errors: [] }
(breaks a lot of scrapers)region
array as an argument you can push data to, can throw at any time (i.e. throw at the end of the scraper to indicate a non-fatal error)Is there anyone who has the right to edit Wikidata articles for counties? Basically it means 50+ edits on Wikidata which means the account is "autoconfirmed".
Right now I have that level, but it's quite tedious to fix all populations alone and I'd be happy if someone could help me.
Some of the locations are less important and everyone can edit them, like these ones in Panama:
https://www.wikidata.org/wiki/Q217138
Other ones are in the "top 3000" items and only people with confirmed accounts can edit them. But basically editing the less important features would allow someone to get to this autoconfirmed level.
So who would like to help by entering population informations?
Need to add missing in:
Mexico
https://coronavirus.gob.mx/, from the federal government's ministry of health.
This is more of a bookmark than anything else - just caching this will be difficult as it seems to be nested more deeply than Argentina.
At the bottom of the site there are some videos / links in some sort of auto-scrolling frame. Each day it appears they have a press conference, and it seems each one gets a page: e.g. April 4. URLs for those appear easy to generate:
https://coronavirus.gob.mx/YYYY/MM/DD/conferencia-D-de-mmm/
where mmm
is the full month name in Spanish, all lower case.
For example, I spot checked March 4th and it exists:
https://coronavirus.gob.mx/2020/03/04/conferencia-4-de-marzo/
Each of the press conference pages links to a PDF with a link whose text is "Comunicado técnico". URLs for those PDFs seem pretty consistent except for one number I can't decipher. e.g.
https://www.gob.mx/cms/uploads/attachment/file/538947/Comunicado_Tecnico_Diario_COVID-19_2020.03.04.pdf
https://www.gob.mx/cms/uploads/attachment/file/545219/Comunicado_Tecnico_Diario_COVID-19_2020.04.03.pdf
https://www.gob.mx/cms/uploads/attachment/file/545266/Comunicado_Tecnico_Diario_COVID-19_2020.04.04.pdf
Content of the PDF apparently can change. I can't imagine doing anything but manual data entry on this one. Currently our source for Mexico is https://github.com/CSSEGISandData/COVID-19
but at least the more recent PDFs here have death counts per state (but not case counts).
Eg: Number of death cases are high in New York but the reported here is zero
Data correction for death and recovered cases
the total for New York City isn't matching the sum of the 5 counties for 3/25. is this because the city/county sources are different?
**Description.
In coviddatascraper, PR covidatlas/coronadatascraper#835 provides support for ArcGIS data pagination. Some json result sets are too big to return in a single response, so the requests will need to manage that. Presumably, similar to GitHub API, they provide a "nextResultSet" token or similar in the response, and then clients can requery with that as a token.
We'd need to manage that for both crawls and scrapes. Presumably this could be managed with lambdas, but the cache file naming convention will need to be page-aware, and return all files.
Describe the solution you'd like
One possibility: include page number, indexed from zero, after the cache key
(or name
), e.g., <datetime>-<name>-<page>-<sha>.<ext>.gz
. If there is only one page (which will be true in most cases), 'page' would be 0 and there won't be any other data sets, and the thing passed to scrape
would just be the content.
e.g. AUS has one row for entire country marked as "state" which should be "country" and it has 6 rows for the states Australian Capital Territory, New South Wales, Northern Territory, Queensland, South Australia and Victoria which are blank but should be marked "state".
If data is provided the above check must be true, or the data is invalid
Colombia (COL)
http://www.ins.gov.co/Noticias/Paginas/Coronavirus.aspx
(National Institute of Health). But see below.
The source URL above has a bunch of Infograms embedded. Each one can be opened in a tab, and then you can snoop the data sources using Chrome's network inspector.
The data is in an array of HTML chunks, e.g.:
[
"<font face=\"Montserrat, sans-serif\" color=\"#ed1e79\" style=\"font-size: 22px;\"><b>1.485</b></font>",
"<font face=\"Montserrat, sans-serif\" color=\"\" style=\"font-size: 13px;\">Casos <b>Confirmados en Colombia</b></font>",
"boyPath"
],
Shows 1,485 confirmed cases.
This is a table structured as an array of rows. The header row is:
"ID de caso" - case ID
"Fecha de diagnóstico" - date of diagnosis
"Ciudad de ubicación" - city
"Departamento o Distrito" - state or district (assuming that's a county)
"Atención**" - status. They note that "recuperado" (recovered) requires two negative tests.
"Edad" - age
"Sexo" - gender
"Tipo*" - type of case. "Importado" (which they define as having come from a country with confirmed COVID-19 cases) or "relacionado" (confirmed to have had contact with someone who has COVID-19)
"País de procedencia" - Country considered the source of the infection for this patient
Status can be:
"casa" - self-quarantining at home (I'm assuming here based on what I've seen in other Latin American countries.
"fallecido" - deceased
"recuperado" - recovered; requires two negative tests to confirm.
"hospital" - hospitalized
"hospital UCI" - intensive care
One series is total cases, deaths, and recoveries, the other one is a weekly count of tests processed and test backlog.
I also found some open sources in the arcGIS hub - https://hub.arcgis.com/search?categories=covid-19&collection=Dataset
You can get JSONs out of all of these.
The license on each of these implies that they are from the same government entity as the Infograms above.
There are different dataset hashes but evidently choosing which data you want is only a function of the number after the underscore.
https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-procedencia-de-los-casos/data?selectedAttribute=CASOS
CSV: https://opendata.arcgis.com/datasets/3a505d6969c149f98b122fb0a6fd1e7e_4.csv
https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-departamento/data
CSV: https://opendata.arcgis.com/datasets/ed48c4ce9ca94d5499f1c327f8f532f1_1.csv
https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-municipio/data
CSV: https://opendata.arcgis.com/datasets/53beb24d21f146c38a42db63c92e3460_0.csv
This is the one we want; includes population, population density, total cases, total active cases, total deaths, and total recovered.
https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-detalle-de-los-casos/data
CSV: https://opendata.arcgis.com/datasets/0e14099fac45422896d50bd52292faea_3.csv
For the country as a hole; includes new/total cases, deaths, and recoveries.
https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-casos-diarios/data
CSV: https://opendata.arcgis.com/datasets/782122624f364fbdbd7e287b96c4a358_6.csv
Like the title says, go back in time to get cache from the wayback.
The official JHU ISO mapping file didn't contain the right ISO codes for some islands.
These are the following:
Anguilla needs to be added to stateMap
Aruba needs to be added to stateMap
Bermuda needs to be added to stateMap
Bonaire, Sint Eustatius and Saba needs to be added added to stateMap
British Virgin Islands needs to be added to stateMap
Cayman Islands needs to be added to stateMap
Channel Islands needs to be added added to stateMap
Curacao needs to be added to stateMap
Diamond Princess needs to be added added to stateMap
Falkland Islands (Malvinas) needs to be added to stateMap
Faroe Islands needs to be added to stateMap
French Guiana needs to be added to stateMap
French Polynesia needs to be added to stateMap
Gibraltar needs to be added to stateMap
Grand Princess needs to be added added to stateMap
Guadeloupe needs to be added to stateMap
Isle of Man needs to be added to stateMap
Martinique needs to be added to stateMap
Mayotte needs to be added to stateMap
Montserrat needs to be added to stateMap
New Caledonia needs to be added to stateMap
Recovered needs to be added added to stateMap
Reunion needs to be added to stateMap
Saint Barthelemy needs to be added to stateMap
Saint Pierre and Miquelon needs to be added to stateMap
Sint Maarten needs to be added to stateMap
St Martin needs to be added to stateMap
Turks and Caicos Islands needs to be added to stateMap
You can get this output by running yarn start -l JHU
.
Task:
For combined areas like Channel Islands write it as a list of strings. Like ['JE', 'GG']
For ships, like Diamond Princess, put -
.
Description
Right now, we repeat logic for checking if a date exists in a timeseries source. NYT, JHU, everyone's logic is similar.
Describe the solution you'd like
processTimeseries(dateColumn, scrapeDate, processFn)
Describe alternatives you've considered
Hand coding the same bugs 4 times.
The Directorate of Health and The Department of Civil Protection and Emergency Management (Government of Iceland)
Meets our minimum requirements for sources. Provides:
Location name: Iceland
URL:
Data page, with charts and links to CSVs (note language selection at top right; the orange button is the cookie acceptance). This appears to simply be embedding this:
https://e.infogram.com/e3205e42-19b3-4e3a-a452-84192884450d
Beneath each chart there's a link to a CSV, but it can't simply be copied and pasted (this is some sort of Tableau-type thing I think)
We generate reports, and downstream consumers are affected by data format changes. (e.g., new fields, see Slack note). Changing the schema impacts them, which may reduce traction for us as well!
If we have a schema and versioning, we can validate, and can report. This could be an automatic script, shouldn't require too much handholding.
ajv
to potentially generate a schema, and save in schemas. Reports could include a "version" field.Latest date in https://raw.githubusercontent.com/daenuprobst/covid19-cases-switzerland/master/covid19_cases_switzerland.csv is 2020-03-27.
The repo looks to be updated recently.
The repo also says that it's aggregating from an other repo:
https://github.com/openZH/covid_19
Maybe we should track the openZH repo instead?
In some scrapers, we're making justifiable assumptions about how to interpret the data (e.g., covidatlas/coronadatascraper#572 - KOR quarantines). For scrapers, we could hardcode these caveats in the scrapers, and perhaps include them in the source output, e.g.:
[
{
"county": "Los Angeles County",
"state": "California",
"country": "United States",
...
"url": "http://www.publichealth.lacounty.gov/media/Coronavirus/",
"cases": 0,
"deaths": 0,
"caveats": [
"some_data_here"
],
...
}
]
Perhaps these assumptions could be rolled up to the higher levels:
"caveats": [
"LA, CA: some_data_here",
"PA: penn. caveats here"
]
Publicize assumptions
For testing/regression, I don't think we'd need to check the caveats field, as it might change over time. One sanity check would be enough.
They have a timeseries API, we should use that one:
https://covidtracking.com/api
I'm frustrated that my Pandemic Estimator takes long time fetching whole dataset, while I display only one location at a time.
I'd like a endpoint for single location. Instead of https://coronadatascraper.com/timeseries-byLocation.json
, sometime like
https://coronadatascraper.com/timeseries/location/meta.json
https://coronadatascraper.com/timeseries/location/france.json
https://coronadatascraper.com/timeseries/location/france/normandie.json
Where in the meta.json
there would be a subset what's in timeseries-byLocation.json
without dates
prop, so that I can populate the dropdown/autofill for user to select the necessary location.
I've considered splitting the timeseries-byLocation.json
as part of dashboard logic but it's stupid idea - a lot better to do so as part of this repo so everyone would benefit from it.
Also I'd love to have aggregation beyond country level - by region and world total. If you wish to provide that then that should be taken into account by organizing the API path
Hungary
Entered daily from the official government website, which uses images.
I suppose a scraper should be counting the same number of states/counties, as well as the same number of data points day-to-day. We could warn if this goes down, and maybe if it goes up too (but countries may have an incomplete table since some states without cases?)
Develop a scraper that can ingest a source list and do nothing with it other than cache.
This would allow non-technical volunteers to vet and contribute sources from around the world so that we can start caching them. Many sources don't have time series data so it's a "race against time" if we want to eventually have temporal data for everything.
As per @chunder's suggestion, I started a spreadsheet (WIP) that this scraper would draw from.
Can't run ./start
without Sandbox, should warn if it can't run!
All sources scrapers (both crawl
and scrape
) should be subject to end to end integration tests, wherein both are exercised against live cache or internet.
Crawl: if a function, should execute and return a valid url or object containing { url, cookie }
Scrape: should load out of the live production cache and return a well-formed result.
If the cache misses, the integration test runner can invoke a crawl for that source and write it to disk locally to complete the test.
States with reported deaths that are not in today's data:
Compared to https://coronavirus.1point3acres.com/en
data missing 03-23; 04-01 data repeated on 04-02; 04-03 data significantly lower than 04-02 (04-01); 04-03 data repeated on 04-04 and 04-05; 04-06 data missing.
04-03 data drop possibly related to 03-27 data -- same case number value.
same source being used throughout.
cc @camjc
Exact scope on source
failure reporting is not yet entirely clear; this issue should be for discussion and scoping.
When a source
suddenly stops reporting data, desired outcomes include:
Currently we are summing up state counts from county counts in the NYT dataset. The number do not match their state counts.
We should use their state level dataset and not sum it up ourselves.
In staging
+ production
:
crawler
should write cache data to S3scraper
should read cache data from S3Locally:
crawler
should write cache data locallyscraper
should attempt to read cache data from S3, and fall back to local data sourcesWe should report if it has not been updated at all. This would catch errors like the NJ dataset changing URLs but leaving the old one accessible.
We should look at migrating independent cities from being listed as counties to being listed as cities at some point. This includes but is not limited to: Baltimore City, St. Louis City, and some 38 cities in VA.
Not sure the value of this, but we could look at the day on day multiplier and make sure it’s under some threshold. I guess look at historical data and add some padding. Eg if cases goes up 10x in one day there’s potentially some weird scraping.
Many patterns are starting to emerge in the way data is stored and parsed, such as tables with left-handle labels, etc.
Though it's likely that many of these will still have to be case-by-case, they should be generalized if at all possible in lib/parse.js
as a configurable function.
In Li, what used to be called scrapers
are now called sources
, and they live in src/shared/sources
The shape has changed, but the core scraper logic should largely remain the same. The new source shape needs docs. All sources have a simple unit test validation pass prior to commit (see below).
npm run migration:status
gives a report:
MacBook-Air:li jeff$ npm run migration:status
> [email protected] migration:status /Users/jeff/Documents/Projects/li
> node tools/report-migration-status.js
Getting commits in /Users/jeff/Documents/Projects/coronadatascraper/src/shared/scrapers for 156 files from covidatlas/coronadatascraper.git/master ...
... done.
Getting commits in /Users/jeff/Documents/Projects/li/src/shared/sources for 10 files from covidatlas/li.git/master ...
... done.
========================================
key CDS path li? up-to-date?
--- -------- --- -----------
DONE (7)
--------
au AU/index.js yes yes
gb-sct GB/SCT/index.js yes yes
in IN/index.js yes yes
...
NEEDS UPDATING (0)
------------------
REMAINING (149)
---------------
at AT/index.js no -
au-act AU/ACT/index.js no -
...
src/shared/sources
scrape
function should be sync; an async scrape
function is likely an antipattern, and should be justifiablescrape
functions should not call to the internet for anything; if they need to, please let me know and we'll figure out how to make that genericscrape
functions should not have unique external dependencies; again, if they need to, please let me know and we'll figure out how to make that genericWe may want to run our existing scrapers through a script to parse, move things around, and output with something like escodegen; if so, please do not put those files into src/shared/sources
– that is the production sources directory, and only known (or expected)-working sources should live there.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.