covidatlas / li Goto Github PK

View Code? Open in Web Editor NEW

57.0 57.0 33.0 1.8 MB

Next-generation serverless crawler for COVID-19 data

License: Apache License 2.0

Arc 0.19% JavaScript 99.69% HTML 0.12%

li's Introduction

COVID Atlas

Work locally

npm i
npm start

li's People

Contributors

Stargazers

Watchers

li's Issues

Develop a schema for scraper output

We have schema definitions for scrapers, but now what they output. We should enforce things like administrative levels and measured quantities (e.g. cases, deaths, etc.).

Along with this, I propose we write a scraper skeleton, with a list of data requirements and pseudo-code (or a code template of some sort) showing what needs to be done and when.

Schema should be validated in yarn test.

@jzohrab @shaperilio

Significant events data

There's a number of significant events that affect data either directly or in delayed fashion and may impact projections. Things like:

Social distancing imposed
Shelter-in-place imposed
Lockdown imposed
Testing policy change (only performed when result would alter treatment)
Mass masks use recommended/advised/required

Add task `runner` event

Add hourly scheduled task runner event (to run off invokes table) for operating the crawler and scraper events.

Check if cases, deaths, or recovered go to zero when they were non-zero the day before

The flattening the curve approach is defining the concept of medical system capacity which, when exceeded, leads to substantial collapse of the curing capacity and in return skyrocketing of deaths.

It would be great to look for sources of data to provide information on capacity of the medical system, and since it's changing (regions are pushing for increasing it, to give more space for the flattened curve), it should be a data point per date.

Missing population for Arunachal Pradesh, Assam, Jharkhand India

Location, date, and short issue description

❌ Arunachal Pradesh, iso1:IN: ?
❌ Assam, iso1:IN: ?
❌ Jharkhand, iso1:IN

Timeseries file does not distinguish New York State and New York City

Today's (28 March) version of the timeseries file doesn't distinguish data for New York State vs. New York City. Here are the two rows for yesterday (27 March) that match state == 'NY' and is.na(county) :

city	county	state	country	population	lat	long	url	cases	deaths	recovered	active	tested	growthFactor	date
NA	NA	NY	USA	19453561	42.76081	-75.84097	https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases	44635	NA	NA	NA	NA	1.197998	2020-03-27
NA	NA	NY	USA	8398748	40.70684	-73.97834	https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases	25398	NA	NA	NA	NA	1.187211	2020-03-27

Based on population, the top row is for the entire state and the second is for NYC. Perhaps someone was trying to fix covidatlas/coronadatascraper#399 and accidentally set city to NA?

List of known data sources

Below are the list of sources (outside of the US) that we are currently aware of. If you know a source that is not on this list, please file an issue, and we will update this document!

Status definition:

✅ Source is actively scraped by this project
🏗️ Someone is working on a scraper for this source
🐛 Source is buggy, needs fixing

Aggregate sources

ArcGIS maintains a list of all dashboards by country: https://www.arcgis.com/apps/opsdashboard/index.html#/a9419e61cb6f4521a15baf78be309b35

Contains data for a number of Latin American countries:
https://github.com/DataScienceResearchPeru/covid-19_latinoamerica

Africa

Status	ISO Code	Name	URL	Notes
	TN	Tunisia Ministry of Health	https://services6.arcgis.com/BiTAc9ApDDtL9okN/arcgis/rest/services/Statistiques_par_gouvernorat_(nouvelle_donn%C3%A9e)/FeatureServer/0/query
✅	ZA	COVID 19 Data for South Africa	https://github.com/dsfsi/covid19za

Asia/Pacific

Status	ISO Code	Name	URL	Notes
	AF	Afghanistan COVID-19 Stats by Province	https://docs.google.com/spreadsheets/d/1F-AMEDtqK78EA6LYME2oOsWQsgJi4CT3V_G4Uo-47Rg/edit#gid=2026243039	Need to verify source
✅	AU-ACT	ACT Government Health	https://www.health.act.gov.au/about-our-health-system/novel-coronavirus-covid-19
✅	AU	WA Health	https://ww2.health.wa.gov.au/Articles/A_E/Coronavirus/COVID19-statistics
✅	AU-NSW	NSW Government Health	https://www.health.nsw.gov.au/_layouts/feed.aspx?xsl=1&web=/news&page=4ac47e14-04a9-4016-b501-65a23280e841&wp=baabf81e-a904-44f1-8d59-5f6d56519965&pageurl=/news/Pages/rss-nsw-health.aspx
✅	AU-NT	Northern Territory Government	https://coronavirus.nt.gov.au/
✅	AU-QLD	QLD Government Health	https://www.health.qld.gov.au/news-events/doh-media-releases
✅	AU-SA	SA Health	https://www.sahealth.sa.gov.au/wps/wcm/connect/public+content/sa+health+internet/health+topics/health+topics+a+-+z/covid+2019/latest+updates/confirmed+and+suspected+cases+of+covid-19+in+south+australia
✅	AU-VIC	Victoria State Government Health and Human Services	https://www.dhhs.vic.gov.au/media-hub-coronavirus-disease-covid-19
✅	AU	Australian Government Department of Health	https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert/coronavirus-covid-19-current-situation-and-case-numbers
✅	ID	Ministry of Health Republic of Indonesia	https://www.kemkes.go.id
✅	IN		https://www.mohfw.gov.in/
✅	KR	Ministry of Health and Welfare	http://ncov.mohw.go.kr/en/bdBoardList.do?brdId=16&brdGubun=162&dataGubun=&ncvContSeq=&contSeq=&board_id=
✅	NZ	New Zealand Government Ministry of Health	https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/covid-19-current-situation/covid-19-current-cases
✅	RU	Rospotrebnadzor	https://yandex.ru/maps/api/covid?csrfToken=
✅	SA		https://datasource.kapsarc.org/explore/dataset/saudi-arabia-coronavirus-disease-covid-19-situation/download/?format=csv&disjunctive.daily_cumulative=true&disjunctive.region=true&refine.daily_cumulative=Daily&timezone=America/Los_Angeles&lang=en&csv_separator=%2C
✅	TH	Department of Disease Control Thailand	https://ddc.moph.go.th/viralpneumonia/eng/index.php

Europe

Status	ISO Code	Name	URL	Notes
✅	AT	Austrian Ministry of Health	https://info.gesundheitsministerium.at
✅	AT	COVID-19/SARS-COV-2 Cases in EU	https://github.com/covid19-eu-zh/covid19-eu-data
✅	BE	Sciensano	https://epistat.wiv-isp.be/covid/
✅	CH	covid19-cases-switzerland	https://github.com/daenuprobst/covid19-cases-switzerland/
🐛	CY	Official website for Cyprus Open Data	https://www.data.gov.cy/sites/default/files/CY%20Covid19%20Daily%20Statistics_6.csv
✅	CZ	Ministry of Health of the Czech Republic	https://onemocneni-aktualne.mzcr.cz/
✅	DE	Dr. Jan-Philip Gehrcke	https://raw.githubusercontent.com/jgehrcke/covid-19-germany-gae/master/data.csv
✅	EE	Estonia Health and Welfare Information Systems Center	https://opendata.digilugu.ee/opendata_covid19_test_results.csv
✅	ES	Antonio Delgado	https://github.com/datadista/datasets/tree/master/COVID%2019
	FI	THL	https://media-koronatilanne.hub.arcgis.com/datasets/bae4e3d772534e58998a9e4ff5c0bf7e_0/data?geometry=-37.220%2C58.184%2C87.936%2C71.082&orderBy=sairaanhoitopiiri&showData=true	Reports health district data
✅	FR	Santé publique France	https://www.data.gouv.fr/fr/organizations/sante-publique-france/datasets-resources.csv
✅	GB	GOV.UK	https://coronavirus.data.gov.uk/
🐛	GB-SCT		https://www.gov.scot/coronavirus-covid-19/
	HR	Koronavirus.hr	https://www.koronavirus.hr/	Reports each region in individual pages
🏗️ covidatlas/coronadatascraper#660	HU	Koronamonitor	https://docs.google.com/spreadsheets/d/1e4VEZL1xvsALoOIq9V2SQuICeQrT5MtWfBm32ad7i8Q/edit#gid=311133316
✅	IE	Ireland Open Data Portal	http://opendata-geohive.hub.arcgis.com/datasets/d9be85b30d7748b5b7c09450b8aede63_0.csv?outSR={"latestWkid":3857,"wkid":102100}
✅	IT	pcm-dpc/COVID-19	https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv
	IS	Ministry of Health of Iceland	https://www.covid.is/data	All the actual data is here: https://e.infogram.com/e3205e42-19b3-4e3a-a452-84192884450d?src=embed
✅	LT	Ministry of Health of the Republic of Lithuania	https://services.arcgis.com/XdDVrnFqA9CT3JgB/arcgis/rest/services/covid_locations/FeatureServer/0/query
✅	LV	Latvia Management of the Center for Disease Prevention and Control	https://services7.arcgis.com/g8j6ESLxQjUogx9p/arcgis/rest/services/Latvia_covid_novadi/FeatureServer/0/query
✅	NL	CoronaWatchNL	https://github.com/J535D165/CoronaWatchNL
	NO	vg.no	https://www.vg.no/spesial/2020/corona/	Has a hidden API which could be scraped, news organization
✅	PL	covid19-eu-zh	https://github.com/covid19-eu-zh/covid19-eu-data
✅	PL	Ministry of Health of the Republic of Poland	https://www.gov.pl/web/koronawirus/wykaz-zarazen-koronawirusem-sars-cov-2
	PR	dssg-pt/covid19pt-data	https://github.com/dssg-pt/covid19pt-data	Could be low quality
	RO	geo-spatial.org	https://covid19.geo-spatial.org/despre	Provides an API
	RS	Serbia Open Data	https://covid19.data.gov.rs/?locale=en
✅	SE	Public Health Agency of Sweden	https://services5.arcgis.com/fsYDFeRKu1hELJJs/arcgis/rest/services/FOHM_Covid_19_FME_1/FeatureServer/1/query
✅	SI	COVID-19 Sledilnik	https://raw.githubusercontent.com/slo-covid-19/data/master/csv/stats.csv
✅	UA	National Security and Defense Council of Ukraine	https://api-covid19.rnbo.gov.ua/data?to= // append YYYY-MM-DD

North/South America

Status	ISO Code	Name	URL	Notes
✅	BR	Secretaria de Vigilância em Saúde do Ministério da Saúde	https://covid.saude.gov.br/
✅	CA	Public Health Agency of Canada	https://health-infobase.canada.ca/src/data/covidLive/covid19.csv
	CA		https://resources-covid19canada.hub.arcgis.com/app/82e586188b7049e1896b771cd4875815	Provides data at the health district level
🏗️ covidatlas/coronadatascraper#788	CA-NS	Government of Nova Scotia	https://novascotia.ca/coronavirus/data/COVID-19-data.csv
	GT	Ministry of Health of Guatemala	https://www.mspas.gob.gt/index.php/noticias/coronavirus-2019-ncov
✅	PR	Gobierno de Puerto Rico Departamento de Salud	http://www.salud.gov.pr/Pages/coronavirus.aspx
	SV	Ministry of Health of El Salvador	https://covid19.gob.sv/
✅	VI	United States Virgin Islands Department of Health	https://doh.vi.gov/covid19usvi

parse.number('') should return null, not 0

Description

/title

Steps to reproduce

parse.number('')
It's 0!

Expected behavior

It's null?

Additional context

This is a tough one. It seems like it should return null, which will cause a validation error, and if the scraper author wants to return zero for empty string, they can explicitly do: parse.number(parse.string(whatever) || 0)

Scraper for St. Louis County, MO

Looks like the St. Louis County data on the MO health site is lagging behind. More accurate data is being linked on the county page:
STL County Covid-19 Home: https://stlouisco.com/Your-Government/County-Executive/COVID-19
Arc Map: https://stlcogis.maps.arcgis.com/apps/MapSeries/index.html?appid=6ae65dea4d804f2ea4f5d8ba79e96df1

@paulboal I noticed you are the maintainer of MO scraper so wanted to bring this to your attention. I'm going update on my forked repo. Will happily create a PR when I'm finished if this source benefits you all as well.

**Full location name: ** "St. Louis County, MO, USA"
**Source URL: ** https://services2.arcgis.com/w657bnjzrjguNyOy/ArcGIS/rest/services/StLouisCounty_Bdy_Geo/FeatureServer/

Scraper for Argentina

https://www.argentina.gob.ar/coronavirus/informe-diario

This is from the federal government. They are publishing two PDFs per day. "Vespertino" = evening, "Matutino" = morning. They're probably meeting minutes.

Pros:

They maintain previous days' files online (but we should start caching just in case).
Later PDFs have cases tabulated by province.

Cons:

PDFs
Inconsistent filenames (must parse HTML links to get PDFs)
Additional information / data in paragraph form

API design

Exact scope on the API is still coming into view; this issue should be for discussing and designing the 1.0 API.

Suggested better source for San Bernardino County California

https://sbcph.maps.arcgis.com/apps/opsdashboard/index.html#/44bb35c804c44c8281da6d82ee602dff

San Bernardino County COVID-19 Dashboard

It seems to be as much as a day in advance of Mercury news.
{^_^}

Add `annotator` / country-level ID integration

Scope of the annotator event is not yet clear, need to work closely with @hyperknot to determine the best means for tagging additional datasets (geo, metadata such as population and hospital beds, etc.) to our locations).

Aggregate is state though county data available DC, VT, NV

For the states DC, VT and NV aggregate is set to state in the locations.json though county level data is in the dataset.

See also covidatlas/coronadatascraper#264 and covidatlas/coronadatascraper#312
seems like this locations.json was not updated.

To my understanding aggregate matches the type of record (country, state, county, city) if this is the lowest available level of data.
If aggregate is county on a state record this is aggregated county data.

Ability for scraper to push errors down the pipeline without throwing

A couple ways this could work:

Scraper called with errors array as an argument that can be pushed to (feels weird man)
Scraper called with this = { errors: [] } (breaks a lot of scrapers)
Scraper called with region array as an argument you can push data to, can throw at any time (i.e. throw at the end of the scraper to indicate a non-fatal error)
Other ways?

Add populations to Wikidata

Is there anyone who has the right to edit Wikidata articles for counties? Basically it means 50+ edits on Wikidata which means the account is "autoconfirmed".

Right now I have that level, but it's quite tedious to fix all populations alone and I'd be happy if someone could help me.

Some of the locations are less important and everyone can edit them, like these ones in Panama:
https://www.wikidata.org/wiki/Q217138

Other ones are in the "top 3000" items and only people with confirmed accounts can edit them. But basically editing the less important features would allow someone to get to this autoconfirmed level.

So who would like to help by entering population informations?

Need to add missing in:

Add scraper for MEX

Location name

Mexico

Source URL

https://coronavirus.gob.mx/, from the federal government's ministry of health.

Notes/comments

This is more of a bookmark than anything else - just caching this will be difficult as it seems to be nested more deeply than Argentina.

At the bottom of the site there are some videos / links in some sort of auto-scrolling frame. Each day it appears they have a press conference, and it seems each one gets a page: e.g. April 4. URLs for those appear easy to generate:

https://coronavirus.gob.mx/YYYY/MM/DD/conferencia-D-de-mmm/

where mmm is the full month name in Spanish, all lower case.

For example, I spot checked March 4th and it exists:
https://coronavirus.gob.mx/2020/03/04/conferencia-4-de-marzo/

Each of the press conference pages links to a PDF with a link whose text is "Comunicado técnico". URLs for those PDFs seem pretty consistent except for one number I can't decipher. e.g.

https://www.gob.mx/cms/uploads/attachment/file/538947/Comunicado_Tecnico_Diario_COVID-19_2020.03.04.pdf
https://www.gob.mx/cms/uploads/attachment/file/545219/Comunicado_Tecnico_Diario_COVID-19_2020.04.03.pdf
https://www.gob.mx/cms/uploads/attachment/file/545266/Comunicado_Tecnico_Diario_COVID-19_2020.04.04.pdf

Content of the PDF apparently can change. I can't imagine doing anything but manual data entry on this one. Currently our source for Mexico is https://github.com/CSSEGISandData/COVID-19 but at least the more recent PDFs here have death counts per state (but not case counts).

Death and recovered data issue

US - New York , Death and recovered cases has null values

Eg: Number of death cases are high in New York but the reported here is zero

Issue details

Data correction for death and recovered cases

NYC numbers

the total for New York City isn't matching the sum of the 5 counties for 3/25. is this because the city/county sources are different?

Support data source pagination

**Description.

In coviddatascraper, PR covidatlas/coronadatascraper#835 provides support for ArcGIS data pagination. Some json result sets are too big to return in a single response, so the requests will need to manage that. Presumably, similar to GitHub API, they provide a "nextResultSet" token or similar in the response, and then clients can requery with that as a token.

We'd need to manage that for both crawls and scrapes. Presumably this could be managed with lambdas, but the cache file naming convention will need to be page-aware, and return all files.

Describe the solution you'd like

One possibility: include page number, indexed from zero, after the cache key (or name), e.g., <datetime>-<name>-<page>-<sha>.<ext>.gz. If there is only one page (which will be true in most cases), 'page' would be 0 and there won't be any other data sets, and the thing passed to scrape would just be the content.

Aggregate column is not marked correctly for all countries

e.g. AUS has one row for entire country marked as "state" which should be "country" and it has 6 rows for the states Australian Capital Territory, New South Wales, Northern Territory, Queensland, South Australia and Victoria which are blank but should be marked "state".

Check if deaths <= cases <= tests

If data is provided the above check must be true, or the data is invalid

Add scraper for Colombia

Location name

Colombia (COL)

Source URL

http://www.ins.gov.co/Noticias/Paginas/Coronavirus.aspx
(National Institute of Health). But see below.

Notes/comments

The source URL above has a bunch of Infograms embedded. Each one can be opened in a tab, and then you can snoop the data sources using Chrome's network inspector.

Summary data

https://infogram.com/api/live/flex/5eb73bf0-6714-4bac-87cc-9ef0613bf697/c9a25571-e7c5-43c6-a7ac-d834a3b5e872?

The data is in an array of HTML chunks, e.g.:

[
"<font face=\"Montserrat, sans-serif\" color=\"#ed1e79\" style=\"font-size: 22px;\"><b>1.485</b></font>",
"<font face=\"Montserrat, sans-serif\" color=\"\" style=\"font-size: 13px;\">Casos <b>Confirmados en Colombia</b></font>",
"boyPath"
],

Shows 1,485 confirmed cases.

This is a table structured as an array of rows. The header row is:
"ID de caso" - case ID
"Fecha de diagnóstico" - date of diagnosis
"Ciudad de ubicación" - city
"Departamento o Distrito" - state or district (assuming that's a county)
"Atención**" - status. They note that "recuperado" (recovered) requires two negative tests.
"Edad" - age
"Sexo" - gender
"Tipo*" - type of case. "Importado" (which they define as having come from a country with confirmed COVID-19 cases) or "relacionado" (confirmed to have had contact with someone who has COVID-19)
"País de procedencia" - Country considered the source of the infection for this patient

Status can be:
"casa" - self-quarantining at home (I'm assuming here based on what I've seen in other Latin American countries.
"fallecido" - deceased
"recuperado" - recovered; requires two negative tests to confirm.
"hospital" - hospitalized
"hospital UCI" - intensive care

Time series and test data

https://infogram.com/api/live/flex/bc384047-e71c-47d9-b606-1eb6a29962e3/523ca417-2781-47f0-87e8-1ccc2d5c2839?

One series is total cases, deaths, and recoveries, the other one is a weekly count of tests processed and test backlog.

Additional sources

I also found some open sources in the arcGIS hub - https://hub.arcgis.com/search?categories=covid-19&collection=Dataset

You can get JSONs out of all of these.

The license on each of these implies that they are from the same government entity as the Infograms above.

There are different dataset hashes but evidently choosing which data you want is only a function of the number after the underscore.

Source of cases

https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-procedencia-de-los-casos/data?selectedAttribute=CASOS
CSV: https://opendata.arcgis.com/datasets/3a505d6969c149f98b122fb0a6fd1e7e_4.csv

Number of confirmed cases by state

https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-departamento/data
CSV: https://opendata.arcgis.com/datasets/ed48c4ce9ca94d5499f1c327f8f532f1_1.csv

Cases by municipality

https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-municipio/data
CSV: https://opendata.arcgis.com/datasets/53beb24d21f146c38a42db63c92e3460_0.csv

This is the one we want; includes population, population density, total cases, total active cases, total deaths, and total recovered.

Case details

https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-detalle-de-los-casos/data
CSV: https://opendata.arcgis.com/datasets/0e14099fac45422896d50bd52292faea_3.csv

Time series

For the country as a hole; includes new/total cases, deaths, and recoveries.
https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-casos-diarios/data
CSV: https://opendata.arcgis.com/datasets/782122624f364fbdbd7e287b96c4a358_6.csv

Populate missing cache from Wayback machine

Like the title says, go back in time to get cache from the wayback.

Extend JHU stateMap with more islands

The official JHU ISO mapping file didn't contain the right ISO codes for some islands.

These are the following:

Anguilla needs to be added to stateMap
Aruba needs to be added to stateMap
Bermuda needs to be added to stateMap
Bonaire, Sint Eustatius and Saba needs to be added added to stateMap
British Virgin Islands needs to be added to stateMap
Cayman Islands needs to be added to stateMap
Channel Islands needs to be added added to stateMap
Curacao needs to be added to stateMap
Diamond Princess needs to be added added to stateMap
Falkland Islands (Malvinas) needs to be added to stateMap
Faroe Islands needs to be added to stateMap
French Guiana needs to be added to stateMap
French Polynesia needs to be added to stateMap
Gibraltar needs to be added to stateMap
Grand Princess needs to be added added to stateMap
Guadeloupe needs to be added to stateMap
Isle of Man needs to be added to stateMap
Martinique needs to be added to stateMap
Mayotte needs to be added to stateMap
Montserrat needs to be added to stateMap
New Caledonia needs to be added to stateMap
Recovered needs to be added added to stateMap
Reunion needs to be added to stateMap
Saint Barthelemy needs to be added to stateMap
Saint Pierre and Miquelon needs to be added to stateMap
Sint Maarten needs to be added to stateMap
St Martin needs to be added to stateMap
Turks and Caicos Islands needs to be added to stateMap

You can get this output by running yarn start -l JHU.

Task:

Look up the iso codes for the above islands from this list: https://github.com/hyperknot/country-levels/blob/master/docs/iso1_list.md
Add them to the stateMap here:
https://github.com/covidatlas/coronadatascraper/blob/fbc0414eece55acb2fa6926d9d8cd051baac4877/src/shared/scrapers/JHU.js#L9-L13

For combined areas like Channel Islands write it as a list of strings. Like ['JE', 'GG']

For ships, like Diamond Princess, put -.

Create a processTimeseries function to avoid repeating logic for timeseries dates

Description

Right now, we repeat logic for checking if a date exists in a timeseries source. NYT, JHU, everyone's logic is similar.

Describe the solution you'd like

processTimeseries(dateColumn, scrapeDate, processFn)

Describe alternatives you've considered

Hand coding the same bugs 4 times.

Scraper for Iceland (ISL)

The Directorate of Health and The Department of Civil Protection and Emergency Management (Government of Iceland)

Meets our minimum requirements for sources. Provides:

Tests per day (timeseries since 2/27)
Number of new infections per day (since 2/28)
Percentage of diagnoses during quarantine (since 2/28)
Snapshots of: gender split, origin of infection, infections and quarantines by region, age distribution, number of confirmed cases, number in isolation, number hospitalized, number in intensive care, number recovered, number in quarantine, number out of quarantine, number of tests.
Deaths (4 as of this writing) is in a short paragraph and spelled out:

Location name: Iceland

URL:
Data page, with charts and links to CSVs (note language selection at top right; the orange button is the cookie acceptance). This appears to simply be embedding this:
https://e.infogram.com/e3205e42-19b3-4e3a-a452-84192884450d

Beneath each chart there's a link to a CSV, but it can't simply be copied and pasted (this is some sort of Tableau-type thing I think)

Validate generated reports with schema, report schema changes

Description

We generate reports, and downstream consumers are affected by data format changes. (e.g., new fields, see Slack note). Changing the schema impacts them, which may reduce traction for us as well!

If we have a schema and versioning, we can validate, and can report. This could be an automatic script, shouldn't require too much handholding.

Why do you need this feature or component?

Good policy :-)
Helps consumers

Additional context/notes

Slack note re a potentially breaking change
I don't believe that this would be tricky to do. We can use ajv to potentially generate a schema, and save in schemas. Reports could include a "version" field.
I believe @shaperilio was doing some work around this as well.

Move CH to use openZH/covid_19 data instead

Location, date, and short issue description

Latest date in https://raw.githubusercontent.com/daenuprobst/covid19-cases-switzerland/master/covid19_cases_switzerland.csv is 2020-03-27.

The repo looks to be updated recently.
The repo also says that it's aggregating from an other repo:
https://github.com/openZH/covid_19

Maybe we should track the openZH repo instead?

Feature: Add "caveats" for scrapers

Description

In some scrapers, we're making justifiable assumptions about how to interpret the data (e.g., covidatlas/coronadatascraper#572 - KOR quarantines). For scrapers, we could hardcode these caveats in the scrapers, and perhaps include them in the source output, e.g.:

[
  {
    "county": "Los Angeles County",
    "state": "California",
    "country": "United States",
...
    "url": "http://www.publichealth.lacounty.gov/media/Coronavirus/",
    "cases": 0,
    "deaths": 0,
    "caveats": [
        "some_data_here"
   ],
...
  }
]

Perhaps these assumptions could be rolled up to the higher levels:

    "caveats": [
        "LA, CA: some_data_here",
        "PA: penn. caveats here"
   ]

Why do you need this feature or component?

Publicize assumptions

Notes

For testing/regression, I don't think we'd need to check the caveats field, as it might change over time. One sanity check would be enough.

Change covidtracking to use timeseries

They have a timeseries API, we should use that one:
https://covidtracking.com/api

Granular data files

Description.

I'm frustrated that my Pandemic Estimator takes long time fetching whole dataset, while I display only one location at a time.

Describe the solution you'd like

I'd like a endpoint for single location. Instead of https://coronadatascraper.com/timeseries-byLocation.json, sometime like

https://coronadatascraper.com/timeseries/location/meta.json
https://coronadatascraper.com/timeseries/location/france.json
https://coronadatascraper.com/timeseries/location/france/normandie.json
...

Where in the meta.json there would be a subset what's in timeseries-byLocation.json without dates prop, so that I can populate the dropdown/autofill for user to select the necessary location.

Describe alternatives you've considered

I've considered splitting the timeseries-byLocation.json as part of dashboard logic but it's stupid idea - a lot better to do so as part of this repo so everyone would benefit from it.

Notes

Also I'd love to have aggregation beyond country level - by region and world total. If you wish to provide that then that should be taken into account by organizing the API path

Add scraper for Hungary

Location name

Hungary

Source URL

https://docs.google.com/spreadsheets/d/1e4VEZL1xvsALoOIq9V2SQuICeQrT5MtWfBm32ad7i8Q/edit#gid=311133316

Notes/comments

Entered daily from the official government website, which uses images.

Check if count of things scraped goes down

I suppose a scraper should be counting the same number of states/counties, as well as the same number of data points day-to-day. We could warn if this goes down, and maybe if it goes up too (but countries may have an incomplete table since some states without cases?)

Cache-only scraper

Description

Develop a scraper that can ingest a source list and do nothing with it other than cache.

Why do you need this feature or component?

This would allow non-technical volunteers to vet and contribute sources from around the world so that we can start caching them. Many sources don't have time series data so it's a "race against time" if we want to eventually have temporal data for everything.

Additional context

As per @chunder's suggestion, I started a spreadsheet (WIP) that this scraper would draw from.

Add warning to start up sandbox from ./start

Can't run ./start without Sandbox, should warn if it can't run!

End to end integration tests for `sources`

All sources scrapers (both crawl and scrape) should be subject to end to end integration tests, wherein both are exercised against live cache or internet.

Crawl: if a function, should execute and return a valid url or object containing { url, cookie }

Scrape: should load out of the live production cache and return a well-formed result.

If the cache misses, the integration test runner can invoke a crawl for that source and write it to disk locally to complete the test.

Number of US states are missing deaths/tested

States with reported deaths that are not in today's data:

New York
New Jersey
Texas
Georgia
Colorado
Tennessee
Wisconsin
Maryland
Missouri
Arizona
Oklahoma
Kansas
Rhode Island
Maine
New Hampshire
Delaware
New Mexico
Montana
West Virginia
Alaska

Compared to https://coronavirus.1point3acres.com/en

AK, US: Data Inconsistency

data missing 03-23; 04-01 data repeated on 04-02; 04-03 data significantly lower than 04-02 (04-01); 04-03 data repeated on 04-04 and 04-05; 04-06 data missing.

04-03 data drop possibly related to 03-27 data -- same case number value.

same source being used throughout.

timeseries-jhu-ak.xlsx

Translate Japanese patient status

Japan's prefecture-level data appears to be a list of patients. If we get the status translated, we can probably get more than just cases:

Add linter to ensure all source filenames are lower case dasherized

cc @camjc

`source` failure reporting

Exact scope on source failure reporting is not yet entirely clear; this issue should be for discussion and scoping.

When a source suddenly stops reporting data, desired outcomes include:

Updating some dataset somewhere that makes this visible on a dashboard
Possibly alerting slack
Possibly alerting the maintainer directly

use NYT states dataset for state counts

Currently we are summing up state counts from county counts in the NYT dataset. The number do not match their state counts.

We should use their state level dataset and not sum it up ourselves.

Add S3 integration

In staging + production:

crawler should write cache data to S3
scraper should read cache data from S3

Locally:

crawler should write cache data locally
scraper should attempt to read cache data from S3, and fall back to local data sources

Calculate MD5 hash of each fetched page and ensure that the content has changed from day to day

We should report if it has not been updated at all. This would catch errors like the NJ dataset changing URLs but leaving the old one accessible.

USA Independent Cities (County -> City)

We should look at migrating independent cities from being listed as counties to being listed as cities at some point. This includes but is not limited to: Baltimore City, St. Louis City, and some 38 cities in VA.

Establish reasonable upper bounds to warn at

Not sure the value of this, but we could look at the day on day multiplier and make sure it’s under some threshold. I guess look at historical data and add some padding. Eg if cases goes up 10x in one day there’s potentially some weird scraping.

Generalize parsing patterns

Many patterns are starting to emerge in the way data is stored and parsed, such as tables with left-handle labels, etc.

Though it's likely that many of these will still have to be case-by-case, they should be generalized if at all possible in lib/parse.js as a configurable function.

Port CDS `scrapers` to Li `sources`

In Li, what used to be called scrapers are now called sources, and they live in src/shared/sources

The shape has changed, but the core scraper logic should largely remain the same. The new source shape needs docs. All sources have a simple unit test validation pass prior to commit (see below).

Example sources

simple: https://github.com/covidatlas/li/blob/master/src/shared/sources/us/ca/san-francisco-county.js
less simple: https://github.com/covidatlas/li/blob/master/src/shared/sources/us/ut/index.js
timeseries: https://github.com/covidatlas/li/blob/master/src/shared/sources/nyt/index.js
multi-source timeseries: https://github.com/covidatlas/li/blob/master/src/shared/sources/nl/index.js

Current shape validated here

https://github.com/covidatlas/li/blob/master/tests/unit/shared/sources/source-validation-test.js

Migration status

npm run migration:status gives a report:

MacBook-Air:li jeff$ npm run migration:status

> [email protected] migration:status /Users/jeff/Documents/Projects/li
> node tools/report-migration-status.js

Getting commits in /Users/jeff/Documents/Projects/coronadatascraper/src/shared/scrapers for 156 files from covidatlas/coronadatascraper.git/master ...
... done.
Getting commits in /Users/jeff/Documents/Projects/li/src/shared/sources for 10 files from covidatlas/li.git/master ...
... done.


========================================

key                           CDS path                         li?  up-to-date?
---                           --------                         ---  -----------

DONE (7)
--------
au                            AU/index.js                      yes  yes 
gb-sct                        GB/SCT/index.js                  yes  yes 
in                            IN/index.js                      yes  yes 
...

NEEDS UPDATING (0)
------------------

REMAINING (149)
---------------
at                            AT/index.js                      no   -   
au-act                        AU/ACT/index.js                  no   -   
...

Notes

Small supporting datasets are ok (example: https://github.com/covidatlas/li/blob/master/src/shared/sources/nl/mapping.json), but mocks/fixtures etc. should not come over to src/shared/sources
If large additional vendored datasets are needed for geo, please let me know and we'll figure out the best approach
Helper functions: thus far in the order of keeping things as light and tidy as possible, I'm only bringing over necessary helper functions needed on a case by case basis
- Please be judicious about what helpers you're proposing porting over
- What you do port, please ensure it's tested
Each scrape function should be sync; an async scrape function is likely an antipattern, and should be justifiable
scrape functions should not call to the internet for anything; if they need to, please let me know and we'll figure out how to make that generic
scrape functions should not have unique external dependencies; again, if they need to, please let me know and we'll figure out how to make that generic

We may want to run our existing scrapers through a script to parse, move things around, and output with something like escodegen; if so, please do not put those files into src/shared/sources – that is the production sources directory, and only known (or expected)-working sources should live there.

covidatlas / li Goto Github PK

li's Introduction

Work locally

li's People

Contributors

Stargazers

Watchers

Forkers

li's Issues

Location, date, and short issue description

Aggregate sources

Africa

Asia/Pacific

Europe

North/South America

Description

Steps to reproduce

Expected behavior

Additional context

Location name

Source URL

Notes/comments

US - New York , Death and recovered cases has null values

Issue details

Location name

Source URL

Notes/comments

Summary data

Number of cases by "departamento" (state)

Status of positive cases (e.g. hospitalized, deceased, etc.)

Breakdown by gender and age

List of cases

Time series and test data

Additional sources

Source of cases

Number of confirmed cases by state

Cases by municipality

Case details

Time series

Description

Why do you need this feature or component?

Additional context/notes

Location, date, and short issue description

Description

Why do you need this feature or component?

Notes

Description.

Describe the solution you'd like

Describe alternatives you've considered

Notes

Location name

Source URL

Notes/comments

Description

Why do you need this feature or component?

Additional context

Example sources

Current shape validated here

Migration status

Notes

Recommend Projects

Recommend Topics

Recommend Org