yiranjing / coronavirus-epidemic-covid-19 Goto Github PK

👩🏻‍⚕️Covid-19 estimation and forecast using statistical model; 新型冠状病毒肺炎统计模型预测 (Jan 2020)

Jupyter Notebook 98.88% Python 1.12%

statistical-models visualization python coronavirus epidemic-model covid-19 seir-model gradient-descent ridge-regression

coronavirus-epidemic-covid-19's Introduction

估计和预测 2019-nCoV 新型冠状病毒的爆发情况

简体中文 | English

日期: 2020年1月

内容：

估计和预测 2019-nCoV 新型冠状病毒在武汉的爆发情况

MSE, basic SEIR model, sentiment analysis 了解 SEIR 模型原理
- 模型 1: 估计武汉封城时的感染人数
- 模型 2: 模拟预测武汉封城后肺炎感染人数以及峰值
根据丁香园实时数据预测全国未来两个月的肺炎趋势

Author: Shih Heng Lo(模型灵感的提供以及指导者); Yiran Jing.
Baseline: Ridge regression, improved by Dynamic SEIR model
- 全国走势预测
- 湖北省及非湖北地区走势预测

以下模型的重要局限：

模型的各种假设对结论的影响非常大。（很难收集到足够准确且全面的信息，所以有些假设未必合适）
- 每个模型的敏感度测试有针对部分假设做一些调整
以下的模型都非常简单，而且没有包含足够多的数据，所得结论只是粗略估计
- 但是因为仅针对武汉市区预测，所以或许这么简单的模型就足够了
- 会根据最新消息持续更新模型

预测未来病情走势困难的主要原因：

我们目前对2019-nCoV的了解还有许多未知
- 比如，我们不能正确地检测出所有感染患者：约17% 的病患不会表现出明显症状，但是依旧可以传染病毒给他人
我们无法得到真实的历史数据, **官方的数据是低于实际情况的，尤其是武汉市
- 比如说，封城的时候到底检测出了多少病患，我们无从知晓
官方不断颁布的新政策对病情走势影响很大
- 比如，交通限制，强制居家隔离（反而造成大量家庭内部感染），2月5号之后武汉新建立的三所医院开始接受大量病患
- 这些随着时间发展快速变化的正常都对病情控制有很大的影响。而当我们用模型预测未来时，我们的重要前提假设是未来不会有新的政策发生。

Data

实时数据抓取并储存在csv

估计和预测 2019-nCoV 新型冠状病毒在武汉的爆发情况 (模型 1 和 2)

2020年1月23日，交通枢纽的武汉市被封城。900万人民被困在武汉市区。在此之前，有500万人因春节离开武汉。估计机场的国际人流量为1900万。

考虑到新型武汉肺炎的快速传播性和武汉居住人口在封城前后变化巨大，我选择了不同的模型来估计封城前后武汉的感染人数，主要参考和借鉴今日发表的相关论文，数据参考官方数据。

模型 1: 估计武汉封城时的感染人数😷

作者: 景怡然
主要结论(仅仅针对武汉市)：截止1月23日，武汉有超过 38500 名感染者加确诊者，95%置信区间(30000, 48470)，根据1月29号海外发现的感染人数计算，引用2018年的交通数据估算。

Method: Considering Wuhan is the major air and train transportation hub of China, we use the number of cases exported from Wuhan internationally as the sample, assuming the infected people follow a Possion distribution, then calculate the 95% confidence interval by profile likelihood method. Sensitivity analysis followed by.

Reference: report2 (Jan 21)

模型 2: 模拟预测武汉封城后肺炎感染人数以及峰值📈

作者: 景怡然

Method: Deterministic SEIR (susceptible-exposed-infectious- recovered) model and Sensitivity analysis

Reference: Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak (Jan 31)

主要结论(仅仅针对武汉市): (根据 2019-12-08 至 2020-02-02 的官方数据)
- 估计最初的传播速率 R0 (基本传染数) 为: 2.9
- 在非常乐观的情况下，预测武汉肺炎的感染人数会超过 1.4 万人 (非累计，仅峰值)，峰值最早在2月中下旬出现 (峰值为下图的红线最高点); 整个过程直到疫情结束，武汉累计患病总数约为5万 (绿色的线)
- 实情1: 考虑到医疗资源不足和官方数据低于实际，武汉肺炎患者的实际峰值可能会在1.6万至2.5万人之间
- 实情2: 肺炎传染风险在封城之后，到2月5号之前依旧很高，主要原因是很多病患传染一家人。2月5号之后武汉3所新医院开始投入使用，所以传染风险会有明显下降
  
  根据2月2号官方媒体爆料，患者发现并不及时而且隔离措施也没有做的很好。基于这个现实，武汉肺炎患者的实际峰值很可能超过10万甚至15万。更新：2月5号之后，武汉新建的三所医院开始收纳病患（共计有6000床位），所以现在的传染风险应该有明显下降，毕竟更多的病人可以被医院收容（治疗/强制隔离）
- 结合实情1和2，武汉实际肺炎患者人数(非累计，仅峰值)应该在2.5万至10万之间
- 封城措施对控制病情有非常显著的作用: 根据模型估算，如果不封城，仅仅隔离患者，武汉患者峰值可能会高达20万

模型 3: 根据丁香园实时数据预测全国未来两个月的肺炎趋势📉

作者: 羅士恆; 景怡然

Method: Dynamic SEIR (susceptible-exposed-infectious- recovered) model, Gradient Descent Model comparison based on the test score (MAPE) of last 5 days, baseline is ridge Ridge regression Reference: Dynamic SIR model

主要结论(针对全国): (根据 2019-12-08 至 2020-02-13 官方数据)
- 现存确诊患者的峰值会突破6万，峰值有望在2月20日之前到来
- 目前传播速率已经有效得到控制，从最初的R0>3以降至0.5以下
- 四月初全国的总体感染人数会下降到4000以下
模型主要假设：了解 SEIR 模型原理
- 人口总数不变: 由于国际航空禁运，严格的居家隔离错误和肺炎较低的死亡率，这个假设基本成立
- 在SEIR模型中，潜伏期的人前期不具有传染性。然而新型冠状肺炎在初期就有较高的传染率
- 假设平均恢复期为14天，和非典类似
- 假设潜伏期未发病的人数是疑似病例的4～5倍，假设死亡率为2%

红色的线为现存感染人数的走势预测注释:

Removed(移除人群): 治愈或者死亡
Death(死亡患者): 移除人群 * 致死率
Exposed(潜伏人群): 在潜伏期的患者
Susceptible(易感人群): 健康但有风险被感染的人群
Infected(确诊并隔离患者): 确诊人群

模型实际表现

The mean absolute percentage error (MAPE) is a measure of prediction accuracy of a forecasting method in statistics. The MAPE of confirmed cases using data between 2020- 2-14 to 2020-02-22 is 0.0066. The figure below visualizes the real observation and the SEIR model predictions for the next 9 days. Overall, SEIR model predicts well for the peaking time and the general trend.

Dynamic contact rate β as a function of time t

Optimization algorithm Gradient Descent

抓取数据步骤:

从丁香园抓取最新数据

## Update data from DXY
$ cd ../data_processing && python DXY_AreaData_query.py # save data out to data folder.

可视化

目前见过的最棒的全国可视化

Author: Minghou Lei
Origin Github

海外可视化（英文版）

CoronaTracker Analytics Dashboard

目前关于肺炎的学习和任务，以及接下来的方向在这里更新：Project

如果你对肺炎相关的数据分析和可视化感兴趣，请联系我！

邮箱: [email protected]
微信: A570281374

coronavirus-epidemic-covid-19's People

Contributors

Stargazers

Watchers

Forkers

thisiscarlosgomes wh-forker shubhamhjain nall95 0xqq zl9099 cheneyhen toobright hnexplorer rover99909 zansongzhu moondanced guo-m-j leebond liu-zhenhui yanqing-1004 khaos mommail j-p-zhang haitaoge qmiwang ieee820 fnwhcy xhongz ruanchengzhi jeromechen14 chloe2 shiyuegu qiongyan zhongxinchen today-y pabloren klsfct ldxd1 ruikunli jarvisstark777 asdlei99 zkailinzhang aristyogresearch janelovelzy howard-1995 nk173 yzr1991 degerli aykay123 mariohc68 mwilian wangyh2016 jiazichen111 lsynn1116 csjzhou gong-q 18251887937 15737117639 flying-architecture tang-ziyi krima05 valentinnnnna dongyann luoluogogogo feihuwang handsameboy xinggou eliasfeng lw9726 1specalversu vensafconno sunrisever

coronavirus-epidemic-covid-19's Issues

预测model3中dynamic SEIR模型的loss

在文件model3里的dynamic SEIR模型预测时，在每个epoch里，模型的loss不会变化，导致最后模型估算出的beta值不收敛，随着epoch数的增加会越来越低

您好，改动训练数据，画出的图都是水平直线。

原代码是根据2019-12-08 至 2020-02-13 官方数据预测，我增加了近期的数据加入训练，预测的结果是直线，是不是我哪里出错了

Thoughts about visualisation and data

Hi, @YiranJing

http://github.com/BlankerL/DXY-2019-nCoV-Crawler

这里有三个数据可视化

1:
http://cuihuan.net/wuhan/news.html
http://github.com/cuihuan/2020_wuhan

全国数据:
全国 ncov 总数量以及全国 ncov 新增量

我觉得可以把疑似/和确诊放在一起，把死亡和治愈放在一起这样更容易看出来数据的变化。

分省数据:
还是把疑似/确诊，治愈/死亡这样分成两类

2:
http://yiqing.ahusmart.com/
https://github.com/hack-fang/nCov

这个图是用**地图做的。但是只明确了几个省份。我觉得可以做那种点击哪个省份然后哪个省份的地图放大，数字具体到各个市，然后把第一个 visualization 的图表也加上，这样又能看到数据变化又能看到实时数量。

3:
https://github.com/Moyck/2019NCOV

这个文件是apk格式，安卓手机应用程序的格式。它有一个光标可以调整日期，我觉得我们不需要因为我们可以有时间变化的图标。这个主要是给中老年人看的，所以比较简洁。

API:
https://lab.isaaclin.cn/nCoV/
这个网站有所有dataset的所有变量的description

https://github.com/BlankerL/DXY-2019-nCoV-Data
这个网站里面有所有的数据，包括：
全国数据 DXYOverall.csv
地区数据 DXYArea.csv
新闻数据 DXYNews.csv
谣言数据 DXYRumors.csv

我觉得我们可以用他在丁香园上的数据，选出每天最晚更新的时间的数据作为那天的总数和增量，以天为时间单位去做数据可视化去看总量和增量的变化。

您的项目已录入 OpenSourceWuhan 武汉开源

您好！

感谢您对武汉和开源社区的贡献，您的项目已录入 OpenSourceWuhan 武汉开源,如有疑问请open an issue

武汉开源 OpenSourceWuhan 集齐了开源平台上支援武汉的项目，是一个连结各个开源项目的入口。

站点： https://weileizeng.github.io/OpenSourceWuhan/

善意的提醒，模型拼写是SEIR，readme中有笔误写成SIER

Yijun

dataset format and coding improve

Hi @chrisyifanjin,

Can you please do something further for data process ?

1. can you please add more information and as the same format with:

https://github.com/pdtyreus/coronavirus-ds/blob/master/data/snapshot_jan25_12pm.csv
(please use English column name only, which is easier for non-Chinese looking)

2. define a python function for data processing, for example

def data_cleaning(folder_path):
     """
    Combine data from multiple files within the given folder, and resture them
     """
     all_filenames=[i for i in glob.glob('*.{}'.format(extension))]
     combined_csv=pd.concat([pd.read_csv(f) for f in all_filenames],ignore_index=True)
     cleaned_data = 
     ......
     return cleaned_data 

folder_path = '../data/China'
cleaned_data  = data_cleaning(folder_path)

Another example

def preprocess_data(df: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame:
    """
Apply data processing. 
        1)  Rename columns name
        2)  Columns type cast
    """    
   # 1)  Rename column
    df = df.withColumnRenamed("POS Margin on Net Sales", "Margin")
   
   # 2)  Conver the `df` columns to `FloatType()`
    columns = ['NetSales', 'QtySold', 'Margin', 'StockQty']
    df = convertColumn(df, columns, FloatType())
    # Convert Date column to timestamp 
    df = df.withColumn("Date", to_timestamp(df.Date, "yyyyMMdd"))

    return df

3. Please use `relative path` instead of `absolute path` for the file, then we can run your code without change the file path:

i.e. instead of

combined_data=combined_csv.to_csv('/Users/jinyifan/Desktop/Coronavirus-Epidemic-2019-nCov/Data_processing/Data_pro_China.csv',header=True, index=False)

## relative path
combined_data=combined_csv.to_csv('../Data/Conbined_data/China/Data_pro_China.csv',header=True, index=False)

(optional )You can combine Combined_data_China.ipynb and Combined_data_International.ipynb into one notebook only.
because they use quite similar code, (can refer to the same function), just define different path!

使用2月7号之后的数据，R0有问题

使用7号之后的数据，R0变化有问题一直很高，还有就是死亡率为2%估算在代码体现是0.02才对，我看你是0.2

IncorrectCIresult

The calculation of 95% CI is incorrect, compared to paper's output

def calculate_conf_interval(self, alpha=0.05):
        """
        Calculate confidence interval by MLE:
           Suppose population follows binomial distribution Bin(p,N)
           p: probability anyone case will be detected overseas
           N: negative binomially distributed function of X (number of cases detected outside mainland China)
        """
       pass

See notebook:
https://github.com/YiranJing/Coronavirus-Epidemic-2019-nCov/blob/master/Estimating%20current%20cases%20in%20Wuhan.ipynb

Corresponding description in the paper:

Confidence intervals can be calculated from the observation that the number of cases
detected overseas, X, is binomially distributed as Bin(p,N), where p = probability any one
case will be detected overseas, and N is the total number of cases. N is therefore a negative
binomially distributed function of X. The results in Table 1 are maximum likelihood estimates
obtained using this negative binomial likelihood function.

I am confused about how to use MLE in the negative binomial likelihood function.

I currently tried two ways:

def calculate_conf_interval(self, alpha=0.05):
        # based on formula mensioned in paper
        p = self.calculate_pro_detected_overseas()  # 0.0017373684210526318
        # the estimated number based on formula mensioned in paper
        n = self.calculate_total_cases_inWuhan()  # 4029
        # our sample is number of cases detected overseas
        k = self.international.cases  # 7 
        mean, var, skew, kurt = nbinom.stats(k, p, moments='mvsk')
 
        # Lower bound of 95% density dist
        lb = nbinom.ppf((alpha/2), k, p)  
        # Upper bound of 95% density dist
        ub = nbinom.ppf((1-alpha/2), k, p) 
        return (lb, ub) # incorrect !!! it is density dist, not the dist of mean

def calculate_conf_interval(self, alpha=0.05):
        p = self.calculate_pro_detected_overseas()
        n = self.calculate_total_cases_inWuhan()
        # our sample is number of cases detected overseas
        k = self.international.cases

        mean, var, skew, kurt = nbinom.stats(k, p, moments='mvsk')

        margin = stats.norm.ppf(1-alpha/2)*np.sqrt(var)
        return (mean-margin, mean+margin) # traditional way of CI calculation