datasets / covid-19 Goto Github PK

View Code? Open in Web Editor NEW

1.2K 78.0 608.0 5.05 GB

Novel Coronavirus 2019 time series data on cases

Home Page: https://datahub.io/core/covid-19

Python 100.00%

coronavirus coronavirus-disease covid datapackage data-package covid-19 covid19-data dataset

covid-19's People

Contributors

Stargazers

Watchers

Forkers

mpolidori pidugusundeep miguel-mx srdea93 tmartin18b achenxu azhry vamit fatima2468 ankit-messi vericonica sincab berkemeijer moniecodes zeyadsharo profjefer neur0t djavello wojciechkrukar pawelryk gauthamikuravi digin777 benitokam theculliganman haripriya007 caramacnish carvmatheus conaku thesmartgirl hassanmasood1 derekdlp hgoycoolea ninjarobots neahtsan loleg tbeermann krunal-darji mariannecula matteoserafino liuking dmaclean8100 taisukef aravindnair430 hliebscher kellykampen karansehgal1988 vedattumen sarava8304 t1089 fatemeh-salmani amyjanuskis rskoenig33 daniel23182 naresh4cools g-bejarano yalamber kranz912 derekduy ma0511 dichaelen akshay926 zumnan olatundeadedeji tchigher julianpz21 ruusa0 zensen-bi sstrelnikoff febriy manjeetbhati gravitytrope tiiao bumjin mayurmahurkar menidh69 gmarxcc vishal-subedi jhlee418 aggarch sifar786 lasna chunhaoep langtung rawdatalabs mairatariq0000 ajeetlodha sayandeepmajumdar datatroy sandhyac0203 mmcconachie win-spider sebbezi monishver samecrowder enkhbateddie omkarvijay5 rezaahmadirendi guokr1991 trojanspike hubash

covid-19's Issues

Executing process.py on 3/11/2020 gets ValidationError

Here's the traceback:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dataflows/base/schema_validator.py", line 49, in schema_validator
row[f.name] = f.cast_value(row.get(f.name))
File "/usr/local/lib/python3.7/site-packages/tableschema/field.py", line 149, in cast_value
).format(field=self, value=value))
datapackage.exceptions.CastError: Field "Deaths" can't cast value "None" for type "number" with format "default"

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "process.py", line 60, in
dump_to_path()
File "/usr/local/lib/python3.7/site-packages/dataflows/base/flow.py", line 12, in results
return self._chain().results(on_error=on_error)
File "/usr/local/lib/python3.7/site-packages/dataflows/base/datastream_processor.py", line 96, in results
for res in ds.res_iter
File "/usr/local/lib/python3.7/site-packages/dataflows/base/datastream_processor.py", line 96, in
for res in ds.res_iter
File "/usr/local/lib/python3.7/site-packages/dataflows/base/schema_validator.py", line 46, in schema_validator
for i, row in enumerate(iterator):
File "/usr/local/lib/python3.7/site-packages/dataflows/processors/dumpers/dumper_base.py", line 69, in row_counter
for row in iterator:
File "/usr/local/lib/python3.7/site-packages/dataflows/processors/dumpers/file_dumper.py", line 76, in rows_processor
for row in resource:
File "/usr/local/lib/python3.7/site-packages/dataflows/base/schema_validator.py", line 51, in schema_validator
if not on_error(resource['name'], row, i, e):
File "/usr/local/lib/python3.7/site-packages/dataflows/base/schema_validator.py", line 22, in raise_exception
raise ValidationError(res_name, row, i, e)
dataflows.base.schema_validator.ValidationError:
ROW: {'Date': datetime.date(2020, 3, 11), 'Province/State': 'Anhui', 'Country/Region': 'Mainland China', 'Lat': Decimal('31.8257'), 'Long': Decimal('117.2264'), 'Confirmed': None, 'Recovered': None, 'Deaths': 'None'}

Total number for deaths in Germany lower than the day before

In time series data the total number of deaths for Germany are lower than the day before. A issue at JHU GitHUB is already opend: CSSEGISandData/COVID-19#2137 (comment)

2020-04-10,Germany,,51.0,9.0,122171,53913,2767<br> 2020-04-11,Germany,,51.0,9.0,124908,57400,2736

Confirmed Cases missing

Current dataset on 4/7/2020 shows 0 cases for North Dakota and the overall # for 4/6/2020 is off by about 140k.

FAQs (WIP)

Why this dataset? (After all authoritative one is elsewhere)?

Ans: well structured data, data package'd so you have tools to ingest into your system of choice, reliably kept up to date ...

Why this dashboard? After all there are many others?

We provide a dashboard that is simple and well-designed but primarily because open source and easy for others to reuse

Who's behind this?

@rufuspollock and colleagues at @datopian who have worked in #opendata and #opensource and #datasets for many years.

State Data Missing for US

I imported this file yesterday and it included state data for US - when I refreshed this morning, the data is now missing.

time-series-19-covid-combined_csv.csv

[optimization] Move longitude and latitude data to a separate CSV

As a user of the covid-19 data, I want the latitude and longitude data in a separate CSV file from the other data, so that it optimizes the use of the data by cutting down the file sizes, loading times, etc.

Acceptance criteria

Latitude and longitude data is moved to a separate CSV file
A new datapackage.json is created for the new CSV
A new visualization is created for it

unable to open database file

Hi,
When I try to run in Jupyter notebook, am getting error as unable to open database file.

OperationalError: unable to open database file

RECOVERED number seems to be available again at source, just under a new filename.

would it be possible to feed it back in?
see
("time_series_covid19_recovered_global.csv | update recoverd time series with 3/26/20 data")
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

Add graphs to data package so they show up on datahub.io

When viewing https://datahub.io/core/covid-19 i want to see summary graphs so that a) i get an immediate overview b) i get a sense of what's in the dataset

Tasks

Add basic graph of total world to date = sum across all for each date
Add key countries e.g. China, US, Italy, UK, France and Germany. Have to sum across provinces for a given country e.g. china

API based on the Data Package

As a lot of people want to connect from dashboards and get filtered/streaming access to the data, it would be good to also set up an (example) wrapper with API endpoints.

Design (from @rufuspollock)

Jobs to be done: i want to get latest data for my country / region.

url: coronavirus.api.datahub.io

Desired API

GET /country/{name or code} => (in reverse date order)
[ 
 {
  date: 
  confirmed: ...
  deaths: 
 }
]

API-ifying a Data Package

Can we take Inspiration from https://github.com/simonw/datasette

We have a datapackage.json - let's auto API-ify-it.

e.g. suppose we have a table cases.csv

Country, Date, Value

Each table => a url ...

/cases?field=x

Values => sub-urls

Dimension

Adding an id (??)

/cases/{country}/{date}

time-series-19-covid-combined.csv much shorter today

Hi,,
the number of records of time-series-19-covid-combined.csv is approx half of yesterday count (31000 rows).
It seems some data are lost - or am I missing something? Thanks

google spreacsheet MAY have some new data sources

https://docs.google.com/spreadsheets/d/1g_YxmDfQx7aOU2DKzNZo9b-NTk62Bju6X3z6OuCa6gw/htmlview?sle=true#gid=515684451

Need to look at the current data sources and these and compare. Just keeping track here.

Wrong Numbers for Spain on 12/March/2020

Data for Spain on the 12/March/2020 is wrong, Accidentally you copied the same as of 11/March/2020

Hope you can fix this.

Edit: the file is countries-aggregated.csv

Executing on 3/14/2020 gets ValidationError & CastError

CastError                                 Traceback (most recent call last)
~/.local/lib/python3.8/site-packages/dataflows/base/schema_validator.py in schema_validator(resource, iterator, field_names, on_error)
     48             for f in schema_fields:
---> 49                 row[f.name] = f.cast_value(row.get(f.name))
     50         except CastError as e:

~/.local/lib/python3.8/site-packages/tableschema/field.py in cast_value(self, value, constraints)
    145             if cast_value == config.ERROR:
--> 146                 raise exceptions.CastError((
    147                     'Field "{field.name}" can\'t cast value "{value}" '

CastError: Field "Deaths" can't cast value "None" for type "number" with format "default"

During handling of the above exception, another exception occurred:

ValidationError                           Traceback (most recent call last)
<ipython-input-11-4036c1aa3210> in <module>
     18 extra_value = {'name': 'Case', 'type': 'number'}
     19 
---> 20 Flow(
     21       load(f'{BASE_URL}{CONFIRMED}'),
     22       load(f'{BASE_URL}{RECOVERED}'),

~/.local/lib/python3.8/site-packages/dataflows/base/flow.py in results(self, on_error)
     10 
     11     def results(self, on_error=None):
---> 12         return self._chain().results(on_error=on_error)
     13 
     14     def process(self):

~/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py in results(self, on_error)
     92     def results(self, on_error=None):
     93         ds = self._process()
---> 94         results = [
     95             list(schema_validator(res.res, res, on_error=on_error))
     96             for res in ds.res_iter

~/.local/lib/python3.8/site-packages/dataflows/base/datastream_processor.py in <listcomp>(.0)
     93         ds = self._process()
     94         results = [
---> 95             list(schema_validator(res.res, res, on_error=on_error))
     96             for res in ds.res_iter
     97         ]

~/.local/lib/python3.8/site-packages/dataflows/base/schema_validator.py in schema_validator(resource, iterator, field_names, on_error)
     44         field_names = [f.name for f in schema.fields]
     45     schema_fields = [f for f in schema.fields if f.name in field_names]
---> 46     for i, row in enumerate(iterator):
     47         try:
     48             for f in schema_fields:

~/.local/lib/python3.8/site-packages/dataflows/processors/dumpers/dumper_base.py in row_counter(self, resource, iterator)
     67     def row_counter(self, resource, iterator):
     68         counter = 0
---> 69         for row in iterator:
     70             counter += 1
     71             yield row

~/.local/lib/python3.8/site-packages/dataflows/processors/dumpers/file_dumper.py in rows_processor(self, resource, writer, temp_file)
     74 
     75     def rows_processor(self, resource, writer, temp_file):
---> 76         for row in resource:
     77             writer.write_row(row)
     78             yield row

~/.local/lib/python3.8/site-packages/dataflows/base/schema_validator.py in schema_validator(resource, iterator, field_names, on_error)
     49                 row[f.name] = f.cast_value(row.get(f.name))
     50         except CastError as e:
---> 51             if not on_error(resource['name'], row, i, e):
     52                 continue
     53 

~/.local/lib/python3.8/site-packages/dataflows/base/schema_validator.py in raise_exception(res_name, row, i, e)
     20 
     21 def raise_exception(res_name, row, i, e):
---> 22     raise ValidationError(res_name, row, i, e)
     23 
     24 

ValidationError: 
ROW: {'Date': datetime.date(2020, 3, 14), 'Province/State': None, 'Country/Region': 'Thailand', 'Lat': Decimal('15.0'), 'Long': Decimal('101.0'), 'Confirmed': None, 'Recovered': None, 'Deaths': 'None'}
----

Data is wrong

These numbers are wrong. Where is this data taken from? Compare with https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

Regional granularity

Country level comparisons are quite limiting, it is difficult to draw meaning about the impact of measures. For instance, mortality and intensive care cases on country level are under/over estimated, in regards to whether co-morbidities are considered, or after health system collapse. The statistics are much more granular in the case of the United States already in the John Hopkins dataset, the Italian regions or Swiss cantons. It would be good to build on the work here to go beyond a country ranking.

Inconsistent file formatting

The data files have inconsistent file formatting making it difficult to write code which works on all files.

Headers example: Last Update changes to Last_Update, Confirmed changes to FIPS.

Changes to Country/ Region: UK changes to United Kingdom.

Compare files '02-03-2020.csv' to '03-26-2020.csv' for example.

[workflow] Actions pipeline stucked

Your action workflow seems stucked since 10 hours ago
Possible something gone wrong while at step Run pip install -r scripts/requirements.txt

Italy has wrong data for March 23

I was updating my dashboards on https://corona.deleu.dev and I noticed a full flat data on Italy.

The dataset shows

2020-03-21,Italy,,43.0,12.0,53578,6072,4825
2020-03-22,Italy,,43.0,12.0,59138,7024,5476
2020-03-23,Italy,,43.0,12.0,59138,7024,5476

when in reality it should be

2020-03-21,Italy,,43.0,12.0,53578,6072,4825
2020-03-22,Italy,,43.0,12.0,59138,7024,5476
2020-03-23,Italy,,43.0,12.0,63,927,7432,6077

effected vs affected

In the intro you say "effect", which is correct. But when you say "effected" it should be "affected".

effected means to have made something happen
affected means that something has changed something else

Typo on the link to data repo in ReadMe

In the link, it should be github instead githab.

Create per capita numbers by merging country population data

Create a new dataset (or add to existing) that is per capita data.

Upstream apparently already has this so we can merge.

O/w Computation is easy and we can use https://github.com/datasets/population

US confirmed cases 682619 increased by 127306?

The number of confirmed cases on April 13 increased by 127306 from the previous day. Is this right?

Push fixes to upstream repo

Can we try and upstream stuff to the upstream repo? May be tough as they have a lot of open PRs and a lot of noise right now. We initially planned (back in Feb) to put in a PR for datapackage.json (and maybe even a refactor or file structures) but this may be tough now (they certainly are unlikely to change file structure).

However, may still be worth trying to push data bugfixes.

Example jupyter notebook showing how to use the data

e.g. could use https://colab.research.google.com/

Data Update

Hi,
when data will be updated? Thanks bye, Alberto

Dashboard for this

Create a simple dashboard similar to e.g. https://carbon.datahub.io or https://london.datahub.io to present this information and provide an open source basis for others to create their own dashboards quickly esp per country.

Tasks

Design the dashboard
Sketch out dashboard
Implement

Implement

Check out existing dashboards
- Check out carbon project https://github.com/datopian/carbondoomsday
- Check out london project

Analysis

Mockup

Charting libraries

~~Use the vega work we have~~
~~https://nivo.rocks/line~~
Use plotly js

v1 - worldwide data with key figures and choropleth map

v2 - added line chart with cumulative cases in top 5 countries

v3 - ability to select a country and showing a graph with cumulative cases, deaths per day and new cases per day

v4 - added figure for showing cases per 100k population

v5 - added choropleth map (again)

Charts to do

Time series of cases
Chloropeth of cases by country

Needs Analysis

Domain Model

Value: (new confirmed) cases, deaths, recovered

Dimensions:

Time
Country
- SubCountry i.e. Province/State
- City

Job Stories

Key figures (for world and per country)

When wanting to know about the situation I want to see key figures such as total number of people infected/recovered/died, so that I understand current status of the situation in the World.

In my country, in my locality

Specific items:

How many total cases? [single figure]
How many total cases (over time) i.e. cumulative? [time series]
How many cases "per day" over time [time series]
What is the mortality rate? (how that has changed over time?)
Cases in specific locations (lon, lat and by country)
Total Case by country (now)
Case by country (over time)

"What's happening in my country" => Ditto but just with my country

What's changed

When I see the COVID-19 dashboard, I want to see a figure showing change of total number of people affected in last 24h (something like stock market price), so that I can know if it's getting better or not.

Secondary

When I see the COVID-19 dashboard, I want to check number of cases per capita, so that I can compare my country against others.

Tertiary

When I see the COVID-19 dashboard, I want to see viz showing some correlation with economic indicators (by country), so that I can assess the economic impact.

State-wise data for the US

Hello,

I saw that some countries (e.g., China, Canada, Australia) have state/province data but not the US. Is there any reason that there are only the data for the whole US ?

Thanks!

Open Source Helps!

Thanks for your work to help the people in need! Your site has been added! I currently maintain the Open-Source-COVID-19 page, which collects all open source projects related to COVID-19, including maps, data, news, api, analysis, medical and supply information, etc. Please share to anyone who might need the information in the list, or will possibly contribute to some of those projects. You are also welcome to recommend more projects.

http://open-source-covid-19.weileizeng.com/

Cheers!

NYT data (for the US)

NYT now have data - just for the US. https://github.com/nytimes/covid-19-data

But it's not open ...

In light of the current public health emergency, The New York Times Company is
providing this database under the following free-of-cost, perpetual,
non-exclusive license. Anyone may copy, distribute, and display the database, or
any part thereof, and make derivative works based on it, provided (a) any such
use is for non-commercial purposes only and (b) credit is given to The New York
Times in any public display of the database, in any publication derived in part
or in full from the database, and in any other public use of the data contained
in or derived from the database.

Dataset Design

Value: (new confirmed) cases, deaths, recovered

Dimensions:

Time
Country
- SubCountry i.e. Province/State

Province/State,Country/Region,Lat,Long,date,case
Anhui,Mainland China,31.8257,117.2264,2020-03-04,6
Anhui,Mainland China,31.8257,117.2264,2020-03-05,6
Anhui,Mainland China,31.8257,117.2264,2020-03-06,6
Beijing,Mainland China,40.1824,116.4142,2020-01-22,0
Beijing,Mainland China,40.1824,116.4142,2020-01-23,0
Beijing,Mainland China,40.1824,116.4142,2020-01-24,0
Beijing,Mainland China,40.1824,116.4142,2020-01-25,0

Perfect dataset

Would go with cumulative numbers (we can always difference to get per day)

What about country totals? Do we compute and put in file e.g. if country is null it is the total ... or we can aggregate in browser / elsewhere.

Country,Province,Date,Confirmed,Death,Recovered

province2latlon

Province,Lat,Lon

404 on the recovery url

On running process.py, I get a 404 on the confirmed url. This one:
RECOVERED = 'time_series_19-covid-Recovered.csv'

Maybe this is just some temporary url bug, but I thought I'd let you know.

Meanwhile, I have managed to get the script to run by commenting out all references to the recovered portion of the data, which is less than ideal.

Great job!

Recovery info missing for 2020-03-24

Hey guys, just noticed missing data here.

Incredible work btw 😄

Add Total Tests of each country in Country Aggregated

Total Tests of each country can give a knowledge of how active that country's government is

Blog post updating on progress so far

Blog post(s) to put on datahub.io/blog highlighting progress on this dataset plus all the work by others. Could also blog specific stuff e.g. the modelling background.

@Liyubov do you want to lead on this? I suggest drafting blog posts in markdown in hackmd so that can they can then be reviewed and then added to datahub.io/blog easily.

Potential Posts

How we are collecting and data packaging the data
An overview on the data, dashboarding and modelling efforts going on in the ecosystem
An overview of modelling approaches

Move script stuff (process.py etc) to scripts directory

This would follow convention and make a cleaner setup.

⚠️ need to refactor the github workflow as well.

Canada Recovery Data

Not seeing recovery data for Canada, but it is being updated in the John Hopkins data.

Those are the only NA's I'm seeing. Great work on this - thanks a ton.

Add clinical trials information

Add data about the current clinical trials being conducted against COVID-19.

This might (or might not) involve scraping some clinical trials registries (e.g. EUCTR, ICTRP etc.).

I will self assign as I wanted to get them anyway, can't think of a better place to put them. The only caveat is that I will try to patch some of the OpenTrials collectors in order to do that and that might not be the straightest (or most obvious) path to extract that information.

docs: methodology

Great stuff! I'm planning to use API for my dashboard Pandemic Estimator but I wish your API had a better documentation on methodology. I'm using JHU directly and I know what chaos it is, the most blatant example being that they provide "cumulative data" that's not cumulative quite often in practice. And the whole change of file formats, etc.

Can you please describe methodology how you deal with it? What's from JHU and what's from CSBS? What has been omitted, what has been "adjusted" and how? Thank you!

Datasets are not being updated since 2 days

Hello maintainer,

Data series are not updated since 2 days ago. Is the script updated with the latest chenges in structure from the orignal data sources?

Romania data lagging one full day

First of all, congrats on the project! It took me almost no time to synchronize my excel workbook with your csv raw data. Thank you !

I have one issue, Romania data is lagging one full day, do you think you could refresh the dataset faster or at another time? Or please advise how to proceed

Thanks Again !

Automate keeping data up to date by pulling data from upstream

We want to automate collecting the data every day (or even every half-day?). Since upstream repo is update at 23:59 GMT (once a day), we can run our update script right after that time, eg, 00:00 GMT.

Acceptance criteria

The repo is updated at least every day
The new dataset is pushed to datahub.io/core/covid-19

Tasks

Future

Run the update when upstream repo changes - seems non trivial to do if you don't control upstream see e.g. https://github.community/t5/GitHub-Actions/Triggering-by-other-repository/td-p/30668

got millions instead of thousands in confirmed by country

Potentially relevant: Country-level secondary data relevant for predicting cases

Dear maintainers, my team are curating a repository of country- (or region-)level secondary data deemed relevant for predicting cases. Might be relevant for people using your curated CSSE data:

https://github.com/cjvanlissa/COVID19_metadata

Sincerely,
Caspar

Data Improvements re Upstream

There are various issues primarily related to geo name normalization in upstream. Ultimately we'd like to upstream these but the maintainers there may be a bit overwhelmed atm so for now we should try and fix here:

Cape Verde vs Cabo Verde CSSEGISandData/COVID-19#1217
Timor Leste and East Timor CSSEGISandData/COVID-19#1229 (could not find this in the CSV so may be artefact of arcgis)

France aggregated count is down from yesterday, why?

France count or confirmed number has an issue,
82 2020-04-12 133670
83 2020-04-13 137875
84 2020-04-14 131361
why the number is going down from yesterday?
As it is an aggregated number is has to growth or to show stagnation ...
Thank for any clarification.

Link to Portugal Data

This repository contains (manually updated; as the commit message state) data for Portugal:
https://github.com/aperaltasantos/covid_pt/tree/master/datasets
Main site of project:
https://aperaltasantos.github.io/covid_pt/#vig-epidemiologica

The project readme lists this official government website as the data source:
https://covid19.min-saude.pt/ponto-de-situacao-atual-em-portugal/

Admin2/City field missing in US data

Since the following commit: ab35560
The "Admin2" field is missing from US CSV files. In my case, I was using this field to filter data by US city, and now I can only do so by state. Can this field be added back into the US datasets?

Reading in the data via read_csv gives NA results for Canada on 29 March

read_csv(time-series-19-covid-combined.csv, col_names = T) gives 68 NA values for Confirmed and Deaths in the last update on 29 March 2020 for Canada. I cannot immediately see the reason why, but I did pull the data into Excel and that works fine. It seems just the read_csv function is not working on this latest update.

Refactor for changes in upstream 24 March 2020

There are changes upstream we need to handle (cf #33): CSSEGISandData/COVID-19#1250

US specific data with info by @anuveyatsu are we now pulling the US data into a separate file? If not, that would be useful i think e.g. us.csv
Also what other new data files or changes do we need to do?

datasets / covid-19 Goto Github PK

covid-19's People

Contributors

Stargazers

Watchers

Forkers

covid-19's Issues

Why this dataset? (After all authoritative one is elsewhere)?

Why this dashboard? After all there are many others?

Who's behind this?

Acceptance criteria

Tasks

Design (from @rufuspollock)

API-ifying a Data Package

Tasks

Implement

Analysis

Mockup

Charting libraries

v1 - worldwide data with key figures and choropleth map

v2 - added line chart with cumulative cases in top 5 countries

v3 - ability to select a country and showing a graph with cumulative cases, deaths per day and new cases per day

v4 - added figure for showing cases per 100k population

v5 - added choropleth map (again)

Charts to do

Needs Analysis

Domain Model

Job Stories

Key figures (for world and per country)

What's changed

Perfect dataset

Potential Posts

Acceptance criteria

Tasks

Future

Recommend Projects

Recommend Topics

Recommend Org