Coder Social home page Coder Social logo

li's Introduction

li's People

Contributors

1ec5 avatar andys1376 avatar appastair avatar bricewolfgang avatar camjc avatar joliss avatar jzohrab avatar lazd avatar mr0grog avatar ryanblock avatar tautme avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

li's Issues

Develop a schema for scraper output

We have schema definitions for scrapers, but now what they output. We should enforce things like administrative levels and measured quantities (e.g. cases, deaths, etc.).

Along with this, I propose we write a scraper skeleton, with a list of data requirements and pseudo-code (or a code template of some sort) showing what needs to be done and when.

Schema should be validated in yarn test.

@jzohrab @shaperilio

Significant events data

There's a number of significant events that affect data either directly or in delayed fashion and may impact projections. Things like:

  • Social distancing imposed
  • Shelter-in-place imposed
  • Lockdown imposed
  • Testing policy change (only performed when result would alter treatment)
  • Mass masks use recommended/advised/required

Add task `runner` event

Add hourly scheduled task runner event (to run off invokes table) for operating the crawler and scraper events.

Hospital beds capacity data

The flattening the curve approach is defining the concept of medical system capacity which, when exceeded, leads to substantial collapse of the curing capacity and in return skyrocketing of deaths.

It would be great to look for sources of data to provide information on capacity of the medical system, and since it's changing (regions are pushing for increasing it, to give more space for the flattened curve), it should be a data point per date.

Timeseries file does not distinguish New York State and New York City

Today's (28 March) version of the timeseries file doesn't distinguish data for New York State vs. New York City. Here are the two rows for yesterday (27 March) that match state == 'NY' and is.na(county) :

city county state country population lat long url cases deaths recovered active tested growthFactor date
NA NA NY USA 19453561 42.76081 -75.84097 https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases 44635 NA NA NA NA 1.197998 2020-03-27
NA NA NY USA 8398748 40.70684 -73.97834 https://coronavirus.health.ny.gov/county-county-breakdown-positive-cases 25398 NA NA NA NA 1.187211 2020-03-27

Based on population, the top row is for the entire state and the second is for NYC. Perhaps someone was trying to fix covidatlas/coronadatascraper#399 and accidentally set city to NA?

List of known data sources

Below are the list of sources (outside of the US) that we are currently aware of. If you know a source that is not on this list, please file an issue, and we will update this document!

Status definition:

✅ Source is actively scraped by this project
🏗️ Someone is working on a scraper for this source
🐛 Source is buggy, needs fixing

Aggregate sources

ArcGIS maintains a list of all dashboards by country: https://www.arcgis.com/apps/opsdashboard/index.html#/a9419e61cb6f4521a15baf78be309b35

Contains data for a number of Latin American countries:
https://github.com/DataScienceResearchPeru/covid-19_latinoamerica

Africa

Status ISO Code Name URL Notes
TN Tunisia Ministry of Health https://services6.arcgis.com/BiTAc9ApDDtL9okN/arcgis/rest/services/Statistiques_par_gouvernorat_(nouvelle_donn%C3%A9e)/FeatureServer/0/query
ZA COVID 19 Data for South Africa https://github.com/dsfsi/covid19za

Asia/Pacific

Status ISO Code Name URL Notes
AF Afghanistan COVID-19 Stats by Province https://docs.google.com/spreadsheets/d/1F-AMEDtqK78EA6LYME2oOsWQsgJi4CT3V_G4Uo-47Rg/edit#gid=2026243039 Need to verify source
AU-ACT ACT Government Health https://www.health.act.gov.au/about-our-health-system/novel-coronavirus-covid-19
AU WA Health https://ww2.health.wa.gov.au/Articles/A_E/Coronavirus/COVID19-statistics
AU-NSW NSW Government Health https://www.health.nsw.gov.au/_layouts/feed.aspx?xsl=1&web=/news&page=4ac47e14-04a9-4016-b501-65a23280e841&wp=baabf81e-a904-44f1-8d59-5f6d56519965&pageurl=/news/Pages/rss-nsw-health.aspx
AU-NT Northern Territory Government https://coronavirus.nt.gov.au/
AU-QLD QLD Government Health https://www.health.qld.gov.au/news-events/doh-media-releases
AU-SA SA Health https://www.sahealth.sa.gov.au/wps/wcm/connect/public+content/sa+health+internet/health+topics/health+topics+a+-+z/covid+2019/latest+updates/confirmed+and+suspected+cases+of+covid-19+in+south+australia
AU-VIC Victoria State Government Health and Human Services https://www.dhhs.vic.gov.au/media-hub-coronavirus-disease-covid-19
AU Australian Government Department of Health https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert/coronavirus-covid-19-current-situation-and-case-numbers
ID Ministry of Health Republic of Indonesia https://www.kemkes.go.id
IN https://www.mohfw.gov.in/
KR Ministry of Health and Welfare http://ncov.mohw.go.kr/en/bdBoardList.do?brdId=16&brdGubun=162&dataGubun=&ncvContSeq=&contSeq=&board_id=
NZ New Zealand Government Ministry of Health https://www.health.govt.nz/our-work/diseases-and-conditions/covid-19-novel-coronavirus/covid-19-current-situation/covid-19-current-cases
RU Rospotrebnadzor https://yandex.ru/maps/api/covid?csrfToken=
SA https://datasource.kapsarc.org/explore/dataset/saudi-arabia-coronavirus-disease-covid-19-situation/download/?format=csv&disjunctive.daily_cumulative=true&disjunctive.region=true&refine.daily_cumulative=Daily&timezone=America/Los_Angeles&lang=en&csv_separator=%2C
TH Department of Disease Control Thailand https://ddc.moph.go.th/viralpneumonia/eng/index.php

Europe

Status ISO Code Name URL Notes
AT Austrian Ministry of Health https://info.gesundheitsministerium.at
AT COVID-19/SARS-COV-2 Cases in EU https://github.com/covid19-eu-zh/covid19-eu-data
BE Sciensano https://epistat.wiv-isp.be/covid/
CH covid19-cases-switzerland https://github.com/daenuprobst/covid19-cases-switzerland/
🐛 CY Official website for Cyprus Open Data https://www.data.gov.cy/sites/default/files/CY%20Covid19%20Daily%20Statistics_6.csv
CZ Ministry of Health of the Czech Republic https://onemocneni-aktualne.mzcr.cz/
DE Dr. Jan-Philip Gehrcke https://raw.githubusercontent.com/jgehrcke/covid-19-germany-gae/master/data.csv
EE Estonia Health and Welfare Information Systems Center https://opendata.digilugu.ee/opendata_covid19_test_results.csv
ES Antonio Delgado https://github.com/datadista/datasets/tree/master/COVID%2019
FI THL https://media-koronatilanne.hub.arcgis.com/datasets/bae4e3d772534e58998a9e4ff5c0bf7e_0/data?geometry=-37.220%2C58.184%2C87.936%2C71.082&orderBy=sairaanhoitopiiri&showData=true Reports health district data
FR Santé publique France https://www.data.gouv.fr/fr/organizations/sante-publique-france/datasets-resources.csv
GB GOV.UK https://coronavirus.data.gov.uk/
🐛 GB-SCT https://www.gov.scot/coronavirus-covid-19/
HR Koronavirus.hr https://www.koronavirus.hr/ Reports each region in individual pages
🏗️ covidatlas/coronadatascraper#660 HU Koronamonitor https://docs.google.com/spreadsheets/d/1e4VEZL1xvsALoOIq9V2SQuICeQrT5MtWfBm32ad7i8Q/edit#gid=311133316
IE Ireland Open Data Portal http://opendata-geohive.hub.arcgis.com/datasets/d9be85b30d7748b5b7c09450b8aede63_0.csv?outSR={"latestWkid":3857,"wkid":102100}
IT pcm-dpc/COVID-19 https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-regioni/dpc-covid19-ita-regioni.csv
IS Ministry of Health of Iceland https://www.covid.is/data All the actual data is here: https://e.infogram.com/e3205e42-19b3-4e3a-a452-84192884450d?src=embed
LT Ministry of Health of the Republic of Lithuania https://services.arcgis.com/XdDVrnFqA9CT3JgB/arcgis/rest/services/covid_locations/FeatureServer/0/query
LV Latvia Management of the Center for Disease Prevention and Control https://services7.arcgis.com/g8j6ESLxQjUogx9p/arcgis/rest/services/Latvia_covid_novadi/FeatureServer/0/query
NL CoronaWatchNL https://github.com/J535D165/CoronaWatchNL
NO vg.no https://www.vg.no/spesial/2020/corona/ Has a hidden API which could be scraped, news organization
PL covid19-eu-zh https://github.com/covid19-eu-zh/covid19-eu-data
PL Ministry of Health of the Republic of Poland https://www.gov.pl/web/koronawirus/wykaz-zarazen-koronawirusem-sars-cov-2
PR dssg-pt/covid19pt-data https://github.com/dssg-pt/covid19pt-data Could be low quality
RO geo-spatial.org https://covid19.geo-spatial.org/despre Provides an API
RS Serbia Open Data https://covid19.data.gov.rs/?locale=en
SE Public Health Agency of Sweden https://services5.arcgis.com/fsYDFeRKu1hELJJs/arcgis/rest/services/FOHM_Covid_19_FME_1/FeatureServer/1/query
SI COVID-19 Sledilnik https://raw.githubusercontent.com/slo-covid-19/data/master/csv/stats.csv
UA National Security and Defense Council of Ukraine https://api-covid19.rnbo.gov.ua/data?to= // append YYYY-MM-DD

North/South America

Status ISO Code Name URL Notes
BR Secretaria de Vigilância em Saúde do Ministério da Saúde https://covid.saude.gov.br/
CA Public Health Agency of Canada https://health-infobase.canada.ca/src/data/covidLive/covid19.csv
CA https://resources-covid19canada.hub.arcgis.com/app/82e586188b7049e1896b771cd4875815 Provides data at the health district level
🏗️ covidatlas/coronadatascraper#788 CA-NS Government of Nova Scotia https://novascotia.ca/coronavirus/data/COVID-19-data.csv
GT Ministry of Health of Guatemala https://www.mspas.gob.gt/index.php/noticias/coronavirus-2019-ncov
PR Gobierno de Puerto Rico Departamento de Salud http://www.salud.gov.pr/Pages/coronavirus.aspx
SV Ministry of Health of El Salvador https://covid19.gob.sv/
VI United States Virgin Islands Department of Health https://doh.vi.gov/covid19usvi

parse.number('') should return null, not 0

Description

/title

Steps to reproduce

  1. parse.number('')
  2. It's 0!

Expected behavior

It's null?

Additional context

This is a tough one. It seems like it should return null, which will cause a validation error, and if the scraper author wants to return zero for empty string, they can explicitly do: parse.number(parse.string(whatever) || 0)

Scraper for St. Louis County, MO

Looks like the St. Louis County data on the MO health site is lagging behind. More accurate data is being linked on the county page:
STL County Covid-19 Home: https://stlouisco.com/Your-Government/County-Executive/COVID-19
Arc Map: https://stlcogis.maps.arcgis.com/apps/MapSeries/index.html?appid=6ae65dea4d804f2ea4f5d8ba79e96df1

@paulboal I noticed you are the maintainer of MO scraper so wanted to bring this to your attention. I'm going update on my forked repo. Will happily create a PR when I'm finished if this source benefits you all as well.

Scraper for Argentina

https://www.argentina.gob.ar/coronavirus/informe-diario

This is from the federal government. They are publishing two PDFs per day. "Vespertino" = evening, "Matutino" = morning. They're probably meeting minutes.

Pros:

  • They maintain previous days' files online (but we should start caching just in case).
  • Later PDFs have cases tabulated by province.

Cons:

  • PDFs
  • Inconsistent filenames (must parse HTML links to get PDFs)
  • Additional information / data in paragraph form

API design

Exact scope on the API is still coming into view; this issue should be for discussing and designing the 1.0 API.

Aggregate is state though county data available DC, VT, NV

For the states DC, VT and NV aggregate is set to state in the locations.json though county level data is in the dataset.

See also covidatlas/coronadatascraper#264 and covidatlas/coronadatascraper#312
seems like this locations.json was not updated.

To my understanding aggregate matches the type of record (country, state, county, city) if this is the lowest available level of data.
If aggregate is county on a state record this is aggregated county data.

Ability for scraper to push errors down the pipeline without throwing

A couple ways this could work:

  1. Scraper called with errors array as an argument that can be pushed to (feels weird man)
  2. Scraper called with this = { errors: [] } (breaks a lot of scrapers)
  3. Scraper called with region array as an argument you can push data to, can throw at any time (i.e. throw at the end of the scraper to indicate a non-fatal error)
  4. Other ways?

Add populations to Wikidata

Is there anyone who has the right to edit Wikidata articles for counties? Basically it means 50+ edits on Wikidata which means the account is "autoconfirmed".

Right now I have that level, but it's quite tedious to fix all populations alone and I'd be happy if someone could help me.

Some of the locations are less important and everyone can edit them, like these ones in Panama:
https://www.wikidata.org/wiki/Q217138

Other ones are in the "top 3000" items and only people with confirmed accounts can edit them. But basically editing the less important features would allow someone to get to this autoconfirmed level.

So who would like to help by entering population informations?

Need to add missing in:

  • Slovenia
  • Ireland
  • Poland
  • Lithuania
  • South Korea
  • Panama
  • Sebastopol, Russia

Add scraper for MEX

Location name

Mexico

Source URL

https://coronavirus.gob.mx/, from the federal government's ministry of health.

Notes/comments

This is more of a bookmark than anything else - just caching this will be difficult as it seems to be nested more deeply than Argentina.

At the bottom of the site there are some videos / links in some sort of auto-scrolling frame. Each day it appears they have a press conference, and it seems each one gets a page: e.g. April 4. URLs for those appear easy to generate:

https://coronavirus.gob.mx/YYYY/MM/DD/conferencia-D-de-mmm/

where mmm is the full month name in Spanish, all lower case.

For example, I spot checked March 4th and it exists:
https://coronavirus.gob.mx/2020/03/04/conferencia-4-de-marzo/

Each of the press conference pages links to a PDF with a link whose text is "Comunicado técnico". URLs for those PDFs seem pretty consistent except for one number I can't decipher. e.g.

https://www.gob.mx/cms/uploads/attachment/file/538947/Comunicado_Tecnico_Diario_COVID-19_2020.03.04.pdf
https://www.gob.mx/cms/uploads/attachment/file/545219/Comunicado_Tecnico_Diario_COVID-19_2020.04.03.pdf
https://www.gob.mx/cms/uploads/attachment/file/545266/Comunicado_Tecnico_Diario_COVID-19_2020.04.04.pdf

Content of the PDF apparently can change. I can't imagine doing anything but manual data entry on this one. Currently our source for Mexico is https://github.com/CSSEGISandData/COVID-19 but at least the more recent PDFs here have death counts per state (but not case counts).

Death and recovered data issue

US - New York , Death and recovered cases has null values

Eg: Number of death cases are high in New York but the reported here is zero

Issue details

Data correction for death and recovered cases

NYC numbers

the total for New York City isn't matching the sum of the 5 counties for 3/25. is this because the city/county sources are different?

Support data source pagination

**Description.

In coviddatascraper, PR covidatlas/coronadatascraper#835 provides support for ArcGIS data pagination. Some json result sets are too big to return in a single response, so the requests will need to manage that. Presumably, similar to GitHub API, they provide a "nextResultSet" token or similar in the response, and then clients can requery with that as a token.

We'd need to manage that for both crawls and scrapes. Presumably this could be managed with lambdas, but the cache file naming convention will need to be page-aware, and return all files.

Describe the solution you'd like

One possibility: include page number, indexed from zero, after the cache key (or name), e.g., <datetime>-<name>-<page>-<sha>.<ext>.gz. If there is only one page (which will be true in most cases), 'page' would be 0 and there won't be any other data sets, and the thing passed to scrape would just be the content.

Aggregate column is not marked correctly for all countries

e.g. AUS has one row for entire country marked as "state" which should be "country" and it has 6 rows for the states Australian Capital Territory, New South Wales, Northern Territory, Queensland, South Australia and Victoria which are blank but should be marked "state".

Add scraper for Colombia

Location name

Colombia (COL)

Source URL

http://www.ins.gov.co/Noticias/Paginas/Coronavirus.aspx
(National Institute of Health). But see below.

Notes/comments

The source URL above has a bunch of Infograms embedded. Each one can be opened in a tab, and then you can snoop the data sources using Chrome's network inspector.

Summary data

https://infogram.com/api/live/flex/5eb73bf0-6714-4bac-87cc-9ef0613bf697/c9a25571-e7c5-43c6-a7ac-d834a3b5e872?

The data is in an array of HTML chunks, e.g.:

[
"<font face=\"Montserrat, sans-serif\" color=\"#ed1e79\" style=\"font-size: 22px;\"><b>1.485</b></font>",
"<font face=\"Montserrat, sans-serif\" color=\"\" style=\"font-size: 13px;\">Casos <b>Confirmados en Colombia</b></font>",
"boyPath"
],

Shows 1,485 confirmed cases.

Number of cases by "departamento" (state)

https://infogram.com/api/live/flex/5e0d85ae-48a4-4899-a679-5ee9aab66d4b/266e0a29-b843-4891-9da4-12325531507b?

Status of positive cases (e.g. hospitalized, deceased, etc.)

https://infogram.com/api/live/flex/de2e4d7c-f649-409e-a874-a7f3f6033ef1/f9098f49-e26a-4843-8291-e78cb0d9aef0?

Breakdown by gender and age

https://infogram.com/api/live/flex/de2e4d7c-f649-409e-a874-a7f3f6033ef1/406f17bb-9a08-4b76-9984-63941d87a790?

List of cases

https://infogram.com/api/live/flex/bc384047-e71c-47d9-b606-1eb6a29962e3/664bc407-2569-4ab8-b7fb-9deb668ddb7a?

This is a table structured as an array of rows. The header row is:
"ID de caso" - case ID
"Fecha de diagnóstico" - date of diagnosis
"Ciudad de ubicación" - city
"Departamento o Distrito" - state or district (assuming that's a county)
"Atención**" - status. They note that "recuperado" (recovered) requires two negative tests.
"Edad" - age
"Sexo" - gender
"Tipo*" - type of case. "Importado" (which they define as having come from a country with confirmed COVID-19 cases) or "relacionado" (confirmed to have had contact with someone who has COVID-19)
"País de procedencia" - Country considered the source of the infection for this patient

Status can be:
"casa" - self-quarantining at home (I'm assuming here based on what I've seen in other Latin American countries.
"fallecido" - deceased
"recuperado" - recovered; requires two negative tests to confirm.
"hospital" - hospitalized
"hospital UCI" - intensive care

Time series and test data

https://infogram.com/api/live/flex/bc384047-e71c-47d9-b606-1eb6a29962e3/523ca417-2781-47f0-87e8-1ccc2d5c2839?

One series is total cases, deaths, and recoveries, the other one is a weekly count of tests processed and test backlog.

Additional sources

I also found some open sources in the arcGIS hub - https://hub.arcgis.com/search?categories=covid-19&collection=Dataset

You can get JSONs out of all of these.

The license on each of these implies that they are from the same government entity as the Infograms above.

There are different dataset hashes but evidently choosing which data you want is only a function of the number after the underscore.

Source of cases

https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-procedencia-de-los-casos/data?selectedAttribute=CASOS
CSV: https://opendata.arcgis.com/datasets/3a505d6969c149f98b122fb0a6fd1e7e_4.csv

Number of confirmed cases by state

https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-departamento/data
CSV: https://opendata.arcgis.com/datasets/ed48c4ce9ca94d5499f1c327f8f532f1_1.csv

Cases by municipality

https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-municipio/data
CSV: https://opendata.arcgis.com/datasets/53beb24d21f146c38a42db63c92e3460_0.csv

This is the one we want; includes population, population density, total cases, total active cases, total deaths, and total recovered.

Case details

https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-detalle-de-los-casos/data
CSV: https://opendata.arcgis.com/datasets/0e14099fac45422896d50bd52292faea_3.csv

Time series

For the country as a hole; includes new/total cases, deaths, and recoveries.
https://hub.arcgis.com/datasets/esri-colombia::colombia-covid19-coronavirus-casos-diarios/data
CSV: https://opendata.arcgis.com/datasets/782122624f364fbdbd7e287b96c4a358_6.csv

Extend JHU stateMap with more islands

The official JHU ISO mapping file didn't contain the right ISO codes for some islands.

These are the following:

Anguilla needs to be added to stateMap
Aruba needs to be added to stateMap
Bermuda needs to be added to stateMap
Bonaire, Sint Eustatius and Saba needs to be added added to stateMap
British Virgin Islands needs to be added to stateMap
Cayman Islands needs to be added to stateMap
Channel Islands needs to be added added to stateMap
Curacao needs to be added to stateMap
Diamond Princess needs to be added added to stateMap
Falkland Islands (Malvinas) needs to be added to stateMap
Faroe Islands needs to be added to stateMap
French Guiana needs to be added to stateMap
French Polynesia needs to be added to stateMap
Gibraltar needs to be added to stateMap
Grand Princess needs to be added added to stateMap
Guadeloupe needs to be added to stateMap
Isle of Man needs to be added to stateMap
Martinique needs to be added to stateMap
Mayotte needs to be added to stateMap
Montserrat needs to be added to stateMap
New Caledonia needs to be added to stateMap
Recovered needs to be added added to stateMap
Reunion needs to be added to stateMap
Saint Barthelemy needs to be added to stateMap
Saint Pierre and Miquelon needs to be added to stateMap
Sint Maarten needs to be added to stateMap
St Martin needs to be added to stateMap
Turks and Caicos Islands needs to be added to stateMap

You can get this output by running yarn start -l JHU.

Task:

  1. Look up the iso codes for the above islands from this list: https://github.com/hyperknot/country-levels/blob/master/docs/iso1_list.md
  2. Add them to the stateMap here:
    https://github.com/covidatlas/coronadatascraper/blob/fbc0414eece55acb2fa6926d9d8cd051baac4877/src/shared/scrapers/JHU.js#L9-L13

For combined areas like Channel Islands write it as a list of strings. Like ['JE', 'GG']

For ships, like Diamond Princess, put -.

Scraper for Iceland (ISL)

The Directorate of Health and The Department of Civil Protection and Emergency Management (Government of Iceland)

Meets our minimum requirements for sources. Provides:

  1. Tests per day (timeseries since 2/27)
  2. Number of new infections per day (since 2/28)
  3. Percentage of diagnoses during quarantine (since 2/28)
  4. Snapshots of: gender split, origin of infection, infections and quarantines by region, age distribution, number of confirmed cases, number in isolation, number hospitalized, number in intensive care, number recovered, number in quarantine, number out of quarantine, number of tests.
  5. Deaths (4 as of this writing) is in a short paragraph and spelled out:
    image

Location name: Iceland

URL:
Data page, with charts and links to CSVs (note language selection at top right; the orange button is the cookie acceptance). This appears to simply be embedding this:
https://e.infogram.com/e3205e42-19b3-4e3a-a452-84192884450d

Beneath each chart there's a link to a CSV, but it can't simply be copied and pasted (this is some sort of Tableau-type thing I think)

Validate generated reports with schema, report schema changes

Description

We generate reports, and downstream consumers are affected by data format changes. (e.g., new fields, see Slack note). Changing the schema impacts them, which may reduce traction for us as well!

If we have a schema and versioning, we can validate, and can report. This could be an automatic script, shouldn't require too much handholding.

Why do you need this feature or component?

  • Good policy :-)
  • Helps consumers

Additional context/notes

Feature: Add "caveats" for scrapers

Description

In some scrapers, we're making justifiable assumptions about how to interpret the data (e.g., covidatlas/coronadatascraper#572 - KOR quarantines). For scrapers, we could hardcode these caveats in the scrapers, and perhaps include them in the source output, e.g.:

[
  {
    "county": "Los Angeles County",
    "state": "California",
    "country": "United States",
...
    "url": "http://www.publichealth.lacounty.gov/media/Coronavirus/",
    "cases": 0,
    "deaths": 0,
    "caveats": [
        "some_data_here"
   ],
...
  }
]

Perhaps these assumptions could be rolled up to the higher levels:

    "caveats": [
        "LA, CA: some_data_here",
        "PA: penn. caveats here"
   ]

Why do you need this feature or component?

Publicize assumptions

Notes

For testing/regression, I don't think we'd need to check the caveats field, as it might change over time. One sanity check would be enough.

Granular data files

Description.

I'm frustrated that my Pandemic Estimator takes long time fetching whole dataset, while I display only one location at a time.

Describe the solution you'd like

I'd like a endpoint for single location. Instead of https://coronadatascraper.com/timeseries-byLocation.json, sometime like

  • https://coronadatascraper.com/timeseries/location/meta.json
  • https://coronadatascraper.com/timeseries/location/france.json
  • https://coronadatascraper.com/timeseries/location/france/normandie.json
  • ...

Where in the meta.json there would be a subset what's in timeseries-byLocation.json without dates prop, so that I can populate the dropdown/autofill for user to select the necessary location.

Describe alternatives you've considered

I've considered splitting the timeseries-byLocation.json as part of dashboard logic but it's stupid idea - a lot better to do so as part of this repo so everyone would benefit from it.

Notes

Also I'd love to have aggregation beyond country level - by region and world total. If you wish to provide that then that should be taken into account by organizing the API path

Check if count of things scraped goes down

I suppose a scraper should be counting the same number of states/counties, as well as the same number of data points day-to-day. We could warn if this goes down, and maybe if it goes up too (but countries may have an incomplete table since some states without cases?)

Cache-only scraper

Description

Develop a scraper that can ingest a source list and do nothing with it other than cache.

Why do you need this feature or component?

This would allow non-technical volunteers to vet and contribute sources from around the world so that we can start caching them. Many sources don't have time series data so it's a "race against time" if we want to eventually have temporal data for everything.

Additional context

As per @chunder's suggestion, I started a spreadsheet (WIP) that this scraper would draw from.

End to end integration tests for `sources`

All sources scrapers (both crawl and scrape) should be subject to end to end integration tests, wherein both are exercised against live cache or internet.

Crawl: if a function, should execute and return a valid url or object containing { url, cookie }

Scrape: should load out of the live production cache and return a well-formed result.

If the cache misses, the integration test runner can invoke a crawl for that source and write it to disk locally to complete the test.

AK, US: Data Inconsistency

data missing 03-23; 04-01 data repeated on 04-02; 04-03 data significantly lower than 04-02 (04-01); 04-03 data repeated on 04-04 and 04-05; 04-06 data missing.

04-03 data drop possibly related to 03-27 data -- same case number value.

same source being used throughout.

timeseries-jhu-ak.xlsx

Translate Japanese patient status

Japan's prefecture-level data appears to be a list of patients. If we get the status translated, we can probably get more than just cases:

image

`source` failure reporting

Exact scope on source failure reporting is not yet entirely clear; this issue should be for discussion and scoping.

When a source suddenly stops reporting data, desired outcomes include:

  • Updating some dataset somewhere that makes this visible on a dashboard
  • Possibly alerting slack
  • Possibly alerting the maintainer directly

use NYT states dataset for state counts

Currently we are summing up state counts from county counts in the NYT dataset. The number do not match their state counts.

We should use their state level dataset and not sum it up ourselves.

Add S3 integration

In staging + production:

  • crawler should write cache data to S3
  • scraper should read cache data from S3

Locally:

  • crawler should write cache data locally
  • scraper should attempt to read cache data from S3, and fall back to local data sources

USA Independent Cities (County -> City)

We should look at migrating independent cities from being listed as counties to being listed as cities at some point. This includes but is not limited to: Baltimore City, St. Louis City, and some 38 cities in VA.

Establish reasonable upper bounds to warn at

Not sure the value of this, but we could look at the day on day multiplier and make sure it’s under some threshold. I guess look at historical data and add some padding. Eg if cases goes up 10x in one day there’s potentially some weird scraping.

Generalize parsing patterns

Many patterns are starting to emerge in the way data is stored and parsed, such as tables with left-handle labels, etc.

Though it's likely that many of these will still have to be case-by-case, they should be generalized if at all possible in lib/parse.js as a configurable function.

Port CDS `scrapers` to Li `sources`

In Li, what used to be called scrapers are now called sources, and they live in src/shared/sources

The shape has changed, but the core scraper logic should largely remain the same. The new source shape needs docs. All sources have a simple unit test validation pass prior to commit (see below).

Example sources

Current shape validated here

Migration status

npm run migration:status gives a report:

MacBook-Air:li jeff$ npm run migration:status

> [email protected] migration:status /Users/jeff/Documents/Projects/li
> node tools/report-migration-status.js

Getting commits in /Users/jeff/Documents/Projects/coronadatascraper/src/shared/scrapers for 156 files from covidatlas/coronadatascraper.git/master ...
... done.
Getting commits in /Users/jeff/Documents/Projects/li/src/shared/sources for 10 files from covidatlas/li.git/master ...
... done.


========================================

key                           CDS path                         li?  up-to-date?
---                           --------                         ---  -----------

DONE (7)
--------
au                            AU/index.js                      yes  yes 
gb-sct                        GB/SCT/index.js                  yes  yes 
in                            IN/index.js                      yes  yes 
...

NEEDS UPDATING (0)
------------------

REMAINING (149)
---------------
at                            AT/index.js                      no   -   
au-act                        AU/ACT/index.js                  no   -   
...

Notes

  • Small supporting datasets are ok (example: https://github.com/covidatlas/li/blob/master/src/shared/sources/nl/mapping.json), but mocks/fixtures etc. should not come over to src/shared/sources
  • If large additional vendored datasets are needed for geo, please let me know and we'll figure out the best approach
  • Helper functions: thus far in the order of keeping things as light and tidy as possible, I'm only bringing over necessary helper functions needed on a case by case basis
    • Please be judicious about what helpers you're proposing porting over
    • What you do port, please ensure it's tested
  • Each scrape function should be sync; an async scrape function is likely an antipattern, and should be justifiable
  • scrape functions should not call to the internet for anything; if they need to, please let me know and we'll figure out how to make that generic
  • scrape functions should not have unique external dependencies; again, if they need to, please let me know and we'll figure out how to make that generic

We may want to run our existing scrapers through a script to parse, move things around, and output with something like escodegen; if so, please do not put those files into src/shared/sources – that is the production sources directory, and only known (or expected)-working sources should live there.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.