blankerl / dxy-covid-19-data Goto Github PK

View Code? Open in Web Editor NEW

2.2K 66.0 711.0 35.69 MB

2019新型冠状病毒疫情时间序列数据仓库 | COVID-19/2019-nCoV Infection Time Series Data Warehouse

Home Page: https://lab.isaaclin.cn/nCoV/

License: MIT License

Python 100.00%

2019-ncov data-warehouse

dxy-covid-19-data's Introduction

2019新型冠状病毒疫情时间序列数据仓库

简体中文 | English

本项目为2019新型冠状病毒（COVID-19/2019-nCoV）疫情状况的时间序列数据仓库，数据来源为丁香园。

近期数位高校师生与我联系，希望用这些数据做科研之用。然而并不熟悉API的使用和JSON数据的处理，因此做了这个数据仓库，直接推送大部分统计软件可以直接打开的csv文件，希望能够减轻各位的负担。

数据由2019新型冠状病毒疫情实时爬虫获得。

每天0点，程序将准时执行，数据会被推送至Release中。

受限于服务器带宽压力，自2020年3月19日起，API接口/nCoV/api/overall及/nCoV/api/area不再返回时间序列数据，时间序列数据可以在json文件夹下获取。如果您调用接口时使用了latest=0参数，则需要修改请求，否则无需修改。

由于本人精力有限，不接受数据定制。如对数据有更多的要求，烦请自行处理。

数据说明

部分数据存在重复统计的情况，如Issue #21中所述，河南省部分市级数据存在"南阳（含邓州）"及"邓州"两条数据，因此在求和时"邓州"的数据会被重复计算一次。

数据异常

目前发现浙江省/湖北省部分时间序列数据存在数据异常，可能的原因是丁香园数据为人工录入，某些数据可能录入错误，比如某一次爬虫获取的浙江省治愈人数为537人，数分钟后被修改回正常人数。
Issue #110中反馈丁香园3月15日更新的吉林省长春市和吉林市的确诊人数颠倒。为了保证数据完整，我没有修改这部分数据，请大家在使用的时候手动调整。

本项目爬虫仅从丁香园公开的数据中获取并储存数据，并不会对异常值进行判断和处理，因此如果将本数据用作科研目的，请自己对数据进行清洗。同时，我已经在Issue中开放了异常数据反馈通道，可以直接在此问题中反馈潜在的异常数据，我会定期检查并处理。

dxy-covid-19-data's People

Contributors

Stargazers

Watchers

Forkers

hotliu shaobing777 descarm zbingbing-lava xukesun juliansun yue-q gooda dingxuefeng sendcr may-u-love tomleung1996 whuscity popopcorn-pop thanksyy ncp-vis idmeforever btharp onepointfuck carlos121493 breezeduo chuckhope 10daystalk murray7 leoyanlili songyuo destefano1986 haoxu3 michaelzzz15 ashinwz rogerclarkgc rt-asdf quhui tzw28 smilewr boringtb muchway2019 haitaoge xushanbao sophie521 mm12432 wuruiqi183 leaf-wang waynelee1023 jianxu305 zl9099 meichen1129 jingwen-z rossdee liuxiang199x luyuliu sy590 ngocpv-lotusapp wangxiaolin86 phone5y gxuebin huhong12345 zx108547 feolcn romanpaxzhang nutboy xiaohan0310 yuqingni mcjevons xu-znw joejiong cavanpan carazhurui zyyshadow0911 phgis leebond gimmeapollo jiyiqini xiaobeilo sollarzoo littleserendipity ki0021 martbox3721 xiaolizi000 gaozming hlyu368 adian98 wuyy2007 jokertion j-p-zhang mossv rodesad wfking geor7 helloxn yuxiangzhang0114 xjq106 nanshan231 simons1117 apollopluspy qmiwang aolong-song caozq19 nanzhui xinihe

dxy-covid-19-data's Issues

DXYArea.json 中英国翻译错误

"countryEnglishName": "United Kiongdom",
实际应为
United Kingdom

Python Analysis

Hi,
I created a Python Analysis repo here:

https://github.com/jianxu305/nCov2019_analysis

Could you please add this to reference? Thanks.

请问是否可以按日期存储数据

感谢作者的付出。
目前json/csv中只有最新的数据。
请问是否可以将数据按日期存储在json/csv目录下，结合gitcdn，对于缓解服务器压力、节省作者服务器开支以及提升数据使用的便捷性都有益处。

[2020-02-09 08:10:06.607] 的死亡数据错误

"3795" 应该是多了个 3。

'https://img1.dxycdn.com/2020/0208/356/3395436496692894611-135.png', 'https://img1.dxycdn.com/2020/0208/599/3395474215095538530-135.png', 'https://img1.dxycdn.com/2020/0208/502/3395474230127927756-135.png', 'https://img1.dxycdn.com/2020/0208/704/3395474279520515356-135.png', 'https://img1.dxycdn.com/2020/0208/629/3395474292405418005-135.png']",,,36833,27657,2602,3795,6101.0,,,,,,该字段已替换为说明1,易感人群：人群普遍易感。老年人及有基础疾病者感染后病情较重，儿童及婴幼儿也有发病,潜伏期：一般为 3～7 天，最长不超过 14 天，潜伏期内可能存在传染性，其中无症状病例传染性非常罕见,宿主：野生动物，可能为中华菊头蝠,,,病毒：新型冠状病毒 2019-nCoV,传染源：新冠肺炎的患者。无症状感染者也可能成为传染源。,传播途径：经呼吸道飞沫、气溶胶传播、接触传播是主要的传播途径。消化道等传播途径尚待明确。,疑似病例数来自国家卫健委数据，目前为全国数据，未分省市自治区等,,[],2020-02-09 08:10:06.607

Can you review the pull request?

I am trying to put in some analysis codes into the project. Can you please review the pull request?

If you feel ok to merge that into master, then I can directly contribute to this project this way. Please let me know if you have any concerns. Thanks.

Redundant data

I noticed this consolidated data for whole China

Can possible removed because there are already data for China per Province?

不方便对CSV文件进行处理的可以看下这个已经按省份和地区分类存放好的数据仓库

csv格式

数组json格式

字典json格式

可以直接加载homepage 上json为前端所使用，比如

海南的详细情况

13号起湖北把临床诊断病例也统计进去了导致数据暴涨，是否可以单独加一列临床诊断的数据？

可能重复

可能与https://github.com/wuhan-support/dataset/tree/master/epidemic_history 重复

疑似病例数很多都是零

前面有人提过， Issue closed 了，但是问题并没有解决。这是什么原因？

数据和卫健委发布数据不相同

以DXYOverall.json这个文件来说，我通过git拉下来每天最后一次提交的数据文件当做这一天的数据，但是却跟第二天的卫健委数据有冲突。

比如说这个是3.18最后一次的数据，新增确诊病例86例，新增死亡病例11例，新增疑似病例12例。

zhegeshiweijianw这个是卫健委的数据，新增确诊病例34例，新增死亡病例8例（湖北8例），新增疑似病例23例。
可见两份数据有较大差异

您的项目已录入 OpenSourceWuhan 武汉开源

您好！

感谢您对武汉和开源社区的贡献，您的项目已录入 OpenSourceWuhan 武汉开源,如有疑问请open an issue

武汉开源 OpenSourceWuhan
集齐了开源平台上支援武汉的项目，是一个连结各个开源项目的入口。

站点：
https://weileizeng.github.io/OpenSourceWuhan/

suggest modify the English name of 陕西 to shaanxi

so that it can distinguish the English name of 山西
see http://www.stats.gov.cn/tjsj/ndsj/2019/indexeh.htm

Doube quote missing!

Hi,
There's a double quote missing in DXYArea.json line 475.
Adrien

2019-ncov累计数据问题

冒昧的想问下：
1、截至2020年1月29日（10：00-23：00之间都行），累计确诊人数的数据，在哪里找啊？
2、截至2020年1月30日（10：00-23：00之间都行），累计确诊人数的数据，在哪里找啊？
谢谢啊！

csv/DXYArea.csv数据字段中包含逗号分隔符

**维吾尔自治区,Xinjiang,兵团第八师石河子市,"Shihezi, Xinjiang Production and Construction Corps 8th Division",70,0,10,1,3,0,0,1,2020-02-15 22:08:57.299

2020-02-03 data abnormally

Feb 3 CSV data contains logically duplicated entries with cityName "南阳" and "南阳（含邓州）", "商丘" and "商丘（含永城）", etc, The effect is the counts aggregated on province level will be too large on Feb 3.

If this project doesn't clean this data, then probably better to notify users so that they can be aware of this. Thanks.

版权咨询

您好，十分感谢您可以在该平台提供如此丰富的数据内容。
近期，我们准备利用COVID-19数据做相关研究，请问是否可以直接使用您提供的数据？您是否有一些其他要求？烦请告知，谢谢！

English version of Area Data - DXYArea.csv

Hi,
Thank you for sharing this dataset. But it would be great if you can provide the dataset with the names of cities and provinces in English.

Thank you,

Best Regards,

Mahasen

Many garbled code in DXYAreas.json

And some data are mixed and unvalid, like:

Just search 'Bhutan' to take a look, there are 3 results

json数据像csv数据那样提供所有数据

你好

我觉得可以像csv数据那样json数据也提供所有数据，或者说可以提供两个版本一个latest=0，一个是latest=1，目前json版本只是提供的最新版数据。

因为您提供csv数据字段并不像json中那么全，比如关于国家的字段country*, 这样如果需要得到所有数据就得用api，会无形中给api访问带来压力。如果json数据也是提供所有的数据，就可以直接根据json自定制自己想要的数据格式，或者是某些字段，而不用访问api，减小api压力。

丁香园网页上还有全世界的数据，能不能也抓下来变成 csv 文件？

谢谢

异常提交

您好，开发者，我发现在您的CSV文件夹下的有关各个省份的数据中没有香港，**与澳门的数据，但在API测试中是有的。期待您的回复，谢谢。

新生成的数据格式似乎有错乱

你好，新生成的csv数据格式似乎有错乱，一会updateTime字段在中间，一会格式又会变成在最后，在中间的时候数量又变成浮点型了。是不是开了两个定时任务在处理了。

failure in db connection

Fail to connect to your database, an error comes with "getaddrinfo failed", which I think might due to aninproper URI? Your kind help is very much appreciated!

异常数据

DXYarea的数据反馈

2.14 武汉死亡数据，有一行为1124，影响数据清洗（我统计数据使用当天最大值，这个很干扰）
湖北省,武汉,51986,0,4131,1426,35991,0,2286,1124,2020-02-14 08:10:27.048
应该是1106

2.2日武汉的治愈数据有个252，也不对
湖北省,武汉,9074,0,215,294,4109,0,252,224,2020-02-02 18:23:15.451

另外，为什么不把各项新增数据爬出来呢，只有总数，虽然后一天减去前一天可以得出新增数据，但是由于存在核销数据的情况，这样计算有时并不准确，有时候累计数后一天数量甚至比前一天可能少，减出来的新增数据就为负数，影响统计和趋势判断。
为了新增数据我还专门写了脚本处理，如果能够直接抓去出来就好了。

updateTime数据类型

请问疫情信息变化的时间序列数据，/nCoV/api/area里的updateTime数据更新时间的数据类型是什么，如何解析？

Duplicate data for : /overall api and area?provinceEng=China !

The "confirmedCount" value for
https://lab.isaaclin.cn/nCoV/api/area?provinceEng=China AND
https://lab.isaaclin.cn/nCoV/api/overall?latest=0

is same, Please have a look at it !

suspectedCount in DXYArea.json are all 0

Not consistent with data other files and it's also not correct right now.

数据去除重复条目 | Duplicated Documents Removed

感谢大家对项目的支持。

近期，我在浏览数据库时发现，丁香园的数据更新异常：大量境外数据和少部分大陆地区数据的createTime和modifyTime字段即使在疫情数据没有任何变动的情况下也会发生变化，这就导致了外国数据被多次重复收录至数据库中，收录的条目仅是createTime和modifyTime字段与其他不一致。

个人推断，丁香园的createTime和modifyTime字段在任何一个国家/省份/城市的数据发生变动时都会发生变动，因此导致了这个问题。所以，我在实时爬虫最近的两次更新ced5fda和540ae98中，移除了这两个字段，未来不会再发生类似的问题。

与此同时，对于历史数据部分，我删除了重复的数据条目，删除的逻辑为：

保留第一次获取到的数据，删除掉剩余的重复数据。例如，在不同的三个时间点获取到相同疫情数据，只保留第一个时间点获取到的疫情数据，删除剩下两个时间点的疫情数据；
仅对重复疫情数据字段进行筛查。针对相同的疫情数据，如果进行数据录入的人operator不同，则两份数据都予以保留。

（可能表述有不准确的地方，可以参考此处。）

共计删除12716条重复数据。

在最新一次数据更新d166029及之后的数据中，重复条目均不会再得到保留，如果需要回溯重复条目，可以查询c8d6947及以前的数据。

Thank you for your support.

Recently, I found that the data of Ding Xiang Yuan was abnormally updated: the createTime and modifyTime fields of a large amount of overseas data and a small amount of data in the mainland China will change even if there is no change in the numbers. As a result, foreign data were found duplicated several times, and the only differences between those duplications are createTime and modifyTime.

I suppose that the createTime and modifyTime fields will change when the data of any country/province/city modified by Ding Xiang Yuan, thus causing this problem. Therefore, in my last 2 updates in real-time crawler ced5fda and 540ae98, these two fields have been removed, and similar problems will not occur in the future.

At the same time, for the historical data part, I deleted the duplicate data entries, and the deletion methodology was:

Keep the data obtained for the first time and delete the remaining duplicate data. For example, if the same epidemic data is obtained at three different time points, only the epidemic data obtained at the first time point is retained, and the epidemic data for the remaining two time points are deleted, and,
Screen for duplicate epidemic-data fields only. For example, if the operator who entered the data is different, even if the numbers are the same, both data will be retained.

12716 documents were removed in total.

In the latest update d166029 and future updates, duplicate entries will not be retained anymore, if you would like to backtrack duplicated entries, you can check them out in c8d6947 and previous data.

cannot download with read.csv()

DTdxy <- read.csv("https://raw.githubusercontent.com/BlankerL/DXY-COVID-19-Data/master/csv/DXYArea.csv", header = TRUE, stringsAsFactors = FALSE)

It raise an error:

Error in file(file, "rt") :
cannot open the connection to 'https://raw.githubusercontent.com/BlankerL/DXY-COVID-19-Data/master/csv/DXYArea.csv'

With a warning message:

Warning message:
In file(file, "rt") :
URL 'https://raw.githubusercontent.com/BlankerL/DXY-COVID-19-Data/master/csv/DXYArea.csv': status was 'Couldn't connect to server'

And when I copy the https://raw.githubusercontent.com/BlankerL/DXY-COVID-19-Data/master/csv/DXYArea.csv to browser, the connection cannot be opened either.

作者你好，想问下currentConfirmedCount是什么 confirmedCount是确诊人数，看了下接口文档里没写currentConfirmedCount，单词意思是当前确认数

The interface seems to be out of order

Timeline data

Can you add a data timeline api to check the tendency of the virus?

同步 json 格式的文件

您好，我是 Wuhan2020 那边的参与者，最近在用您抓取的数据制图。

请问是否可以也在此仓库中同步 json 格式的文件？

https://lab.isaaclin.cn/nCoV/ 应该是压力比较大，一直不敢太用只是每天更新，但是现在也访问不了了... 在这个仓库有 json 格式文件的话可能可以缓解 api 的访问压力。有

https://lab.isaaclin.cn/nCoV/api/area
https://lab.isaaclin.cn/nCoV/api/area?latest=0
https://lab.isaaclin.cn/nCoV/api/overall?latest=0

这几个我觉得就能覆盖大多数应用了。

感谢！

csv文件中添加国外、香港和**的数据

谢谢您的项目，有两个问题请教您一下

因为之前是分析从 api 得到的数据，好像外网访问 api 不太稳定，现在想添加一种方法读取这个数据仓库中的 csv 文件。为了跟 api 返回的数据一致，能不能在area文件中

把csv中包含国外、香港和**的数据
然后加上countryName和coutryEnglishName列
添加provinceShortName列

谢谢

json/DXYArea.json 文件 475 行 "法属波缺引号

undefined:475
            "countryName": "法属波,
                                ^

SyntaxError: Unexpected token

blankerl / dxy-covid-19-data Goto Github PK

dxy-covid-19-data's Introduction

2019新型冠状病毒疫情时间序列数据仓库

数据说明

数据异常

更多功能

扩展插件

数据分析

dxy-covid-19-data's People

Contributors

Stargazers

Watchers

Forkers

dxy-covid-19-data's Issues

Recommend Projects

Recommend Topics

Recommend Org