Coder Social home page Coder Social logo

yiranjing / coronavirus-epidemic-covid-19 Goto Github PK

View Code? Open in Web Editor NEW
243.0 17.0 69.0 46.12 MB

👩🏻‍⚕️Covid-19 estimation and forecast using statistical model; 新型冠状病毒肺炎统计模型预测 (Jan 2020)

Jupyter Notebook 98.88% Python 1.12%
statistical-models visualization python coronavirus epidemic-model covid-19 seir-model gradient-descent ridge-regression

coronavirus-epidemic-covid-19's Introduction

估计和预测 2019-nCoV 新型冠状病毒的爆发情况

简体中文 | English

日期: 2020年1月

内容:

  1. 估计和预测 2019-nCoV 新型冠状病毒在武汉的爆发情况

    MSE, basic SEIR model, sentiment analysis 了解 SEIR 模型原理

    • 模型 1: 估计武汉封城时的感染人数
    • 模型 2: 模拟预测武汉封城后肺炎感染人数以及峰值
  2. 根据丁香园实时数据预测全国未来两个月的肺炎趋势

    Author: Shih Heng Lo(模型灵感的提供以及指导者); Yiran Jing.
    Baseline: Ridge regression, improved by Dynamic SEIR model

    • 全国走势预测
    • 湖北省及非湖北地区走势预测

以下模型的重要局限:

  1. 模型的各种假设对结论的影响非常大。(很难收集到足够准确且全面的信息,所以有些假设未必合适)
    • 每个模型的敏感度测试有针对部分假设做一些调整
  2. 以下的模型都非常简单,而且没有包含足够多的数据,所得结论只是粗略估计
    • 但是因为仅针对武汉市区预测,所以或许这么简单的模型就足够了
    • 会根据最新消息持续更新模型

预测未来病情走势困难的主要原因:

  1. 我们目前对2019-nCoV的了解还有许多未知
    • 比如,我们不能正确地检测出所有感染患者:约17% 的病患不会表现出明显症状,但是依旧可以传染病毒给他人
  2. 我们无法得到真实的历史数据, **官方的数据是低于实际情况的,尤其是武汉市
    • 比如说,封城的时候到底检测出了多少病患,我们无从知晓
  3. 官方不断颁布的新政策对病情走势影响很大
    • 比如,交通限制,强制居家隔离(反而造成大量家庭内部感染),2月5号之后武汉新建立的三所医院开始接受大量病患
    • 这些随着时间发展快速变化的正常都对病情控制有很大的影响。而当我们用模型预测未来时,我们的重要前提假设是未来不会有新的政策发生

Data

实时数据抓取并储存在csv


估计和预测 2019-nCoV 新型冠状病毒在武汉的爆发情况 (模型 1 和 2)

2020年1月23日,交通枢纽的武汉市被封城。900万人民被困在武汉市区。在此之前,有500万人因春节离开武汉。估计机场的国际人流量为1900万。

考虑到新型武汉肺炎的快速传播性和武汉居住人口在封城前后变化巨大,我选择了不同的模型来估计封城前后武汉的感染人数,主要参考和借鉴今日发表的相关论文,数据参考官方数据。

  • 作者: 景怡然
  • 主要结论(仅仅针对武汉市): 截止1月23日,武汉有超过 38500 名感染者加确诊者,95%置信区间(30000, 48470),根据1月29号海外发现的感染人数计算,引用2018年的交通数据估算。

Method: Considering Wuhan is the major air and train transportation hub of China, we use the number of cases exported from Wuhan internationally as the sample, assuming the infected people follow a Possion distribution, then calculate the 95% confidence interval by profile likelihood method. Sensitivity analysis followed by.

Reference: report2 (Jan 21)

  • 作者: 景怡然

Method: Deterministic SEIR (susceptible-exposed-infectious- recovered) model and Sensitivity analysis

Reference: Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak (Jan 31)

  • 主要结论(仅仅针对武汉市): (根据 2019-12-08 至 2020-02-02 的官方数据)
    • 估计最初的传播速率 R0 (基本传染数) 为: 2.9
    • 在非常乐观的情况下,预测武汉肺炎的感染人数会超过 1.4 万人 (非累计,仅峰值),峰值最早在2月中下旬出现 (峰值为下图的红线最高点); 整个过程直到疫情结束,武汉累计患病总数约为5万 (绿色的线)
    • 实情1: 考虑到医疗资源不足和官方数据低于实际,武汉肺炎患者的实际峰值可能会在1.6万至2.5万人之间
    • 实情2: 肺炎传染风险在封城之后,到2月5号之前依旧很高,主要原因是很多病患传染一家人。2月5号之后武汉3所新医院开始投入使用,所以传染风险会有明显下降

      根据2月2号官方媒体爆料,患者发现并不及时而且隔离措施也没有做的很好。基于这个现实,武汉肺炎患者的实际峰值很可能超过10万甚至15万。 更新:2月5号之后,武汉新建的三所医院开始收纳病患(共计有6000床位),所以现在的传染风险应该有明显下降,毕竟更多的病人可以被医院收容(治疗/强制隔离)

    • 结合实情1和2,武汉实际肺炎患者人数(非累计,仅峰值)应该在2.5万至10万之间
    • 封城措施对控制病情有非常显著的作用: 根据模型估算,如果不封城,仅仅隔离患者,武汉患者峰值可能会高达20万

Method: Dynamic SEIR (susceptible-exposed-infectious- recovered) model, Gradient Descent Model comparison based on the test score (MAPE) of last 5 days, baseline is ridge Ridge regression Reference: Dynamic SIR model

  • 主要结论(针对全国): (根据 2019-12-08 至 2020-02-13 官方数据)
    • 现存确诊患者的峰值会突破6万,峰值有望在2月20日之前到来
    • 目前传播速率已经有效得到控制,从最初的R0>3以降至0.5以下
    • 四月初全国的总体感染人数会下降到4000以下
  • 模型主要假设:了解 SEIR 模型原理
    • 人口总数不变: 由于国际航空禁运,严格的居家隔离错误和肺炎较低的死亡率,这个假设基本成立
    • 在SEIR模型中,潜伏期的人前期不具有传染性。然而新型冠状肺炎在初期就有较高的传染率
    • 假设平均恢复期为14天,和非典类似
    • 假设潜伏期未发病的人数是疑似病例的4~5倍, 假设死亡率为2%

红色的线为现存感染人数的走势预测 注释:

  • Removed(移除人群): 治愈或者死亡
  • Death(死亡患者): 移除人群 * 致死率
  • Exposed(潜伏人群): 在潜伏期的患者
  • Susceptible(易感人群): 健康但有风险被感染的人群
  • Infected(确诊并隔离患者): 确诊人群

模型实际表现

The mean absolute percentage error (MAPE) is a measure of prediction accuracy of a forecasting method in statistics. The MAPE of confirmed cases using data between 2020- 2-14 to 2020-02-22 is 0.0066. The figure below visualizes the real observation and the SEIR model predictions for the next 9 days. Overall, SEIR model predicts well for the peaking time and the general trend.

Dynamic contact rate β as a function of time t

Optimization algorithm Gradient Descent


抓取数据步骤:

  1. 从丁香园抓取最新数据
## Update data from DXY
$ cd ../data_processing && python DXY_AreaData_query.py # save data out to data folder.

可视化

目前见过的最棒的全国可视化

image

海外可视化(英文版)

CoronaTracker Analytics Dashboard


目前关于肺炎的学习和任务,以及接下来的方向在这里更新:Project

如果你对肺炎相关的数据分析和可视化感兴趣,请联系我!

coronavirus-epidemic-covid-19's People

Contributors

chrisyifanjin avatar williamwu88 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coronavirus-epidemic-covid-19's Issues

预测model3中dynamic SEIR模型的loss

在文件model3里的dynamic SEIR模型预测时,在每个epoch里,模型的loss不会变化,导致最后模型估算出的beta值不收敛,随着epoch数的增加会越来越低

Thoughts about visualisation and data

Hi, @YiranJing

http://github.com/BlankerL/DXY-2019-nCoV-Crawler

这里有三个数据可视化

1:
http://cuihuan.net/wuhan/news.html
http://github.com/cuihuan/2020_wuhan

全国数据:
全国 ncov 总数量以及全国 ncov 新增量

我觉得可以把疑似/和确诊放在一起,把死亡和治愈放在一起这样更容易看出来数据的变化。

分省数据:
还是把疑似/确诊,治愈/死亡这样分成两类

2:
http://yiqing.ahusmart.com/
https://github.com/hack-fang/nCov

这个图是用**地图做的。但是只明确了几个省份。我觉得可以做那种 点击哪个省份 然后哪个省份的地图放大,数字具体到各个市,然后把第一个 visualization 的图表也加上,这样又能看到数据变化又能看到实时数量。

3:
https://github.com/Moyck/2019NCOV

这个文件是apk格式,安卓手机应用程序的格式。它有一个光标可以调整日期, 我觉得我们不需要因为我 们可以有时间变化的图标。这个主要是给中老年人看的,所以比较简洁。

API:
https://lab.isaaclin.cn/nCoV/
这个网站有所有dataset的所有变量的description

https://github.com/BlankerL/DXY-2019-nCoV-Data
这个网站里面有所有的数据,包括:
全国数据 DXYOverall.csv
地区数据 DXYArea.csv
新闻数据 DXYNews.csv
谣言数据 DXYRumors.csv

我觉得我们可以用他在丁香园上的数据,选出每天最晚更新的时间的数据作为那天的总数和增量,以天为时间单位去做数据可视化去看总量和增量的变化。

model3中 Ioss 求导

能帮忙解释一下model3中 Ioss 对 beta 求导, 是根据哪个公式么?我知道 loss 对 I 求导, 但是 I 对 beta 求导 是怎么计算的呢, 谢谢

Model SEIR Spelling

Hi Yiran,

谢谢你在大组里的分享,我从你的分享中受益匪浅。

善意的提醒,模型拼写是SEIR,readme中有笔误写成SIER

Yijun

dataset format and coding improve

Hi @chrisyifanjin,

Can you please do something further for data process ?

1. can you please add more information and as the same format with:

https://github.com/pdtyreus/coronavirus-ds/blob/master/data/snapshot_jan25_12pm.csv
(please use English column name only, which is easier for non-Chinese looking)

2. define a python function for data processing, for example

def data_cleaning(folder_path):
     """
    Combine data from multiple files within the given folder, and resture them
     """
     all_filenames=[i for i in glob.glob('*.{}'.format(extension))]
     combined_csv=pd.concat([pd.read_csv(f) for f in all_filenames],ignore_index=True)
     cleaned_data = 
     ......
     return cleaned_data 

folder_path = '../data/China'
cleaned_data  = data_cleaning(folder_path)

Another example

def preprocess_data(df: pandas.core.frame.DataFrame) -> pandas.core.frame.DataFrame:
    """
Apply data processing. 
        1)  Rename columns name
        2)  Columns type cast
    """    
   # 1)  Rename column
    df = df.withColumnRenamed("POS Margin on Net Sales", "Margin")
   
   # 2)  Conver the `df` columns to `FloatType()`
    columns = ['NetSales', 'QtySold', 'Margin', 'StockQty']
    df = convertColumn(df, columns, FloatType())
    # Convert Date column to timestamp 
    df = df.withColumn("Date", to_timestamp(df.Date, "yyyyMMdd"))

    return df

3. Please use relative path instead of absolute path for the file, then we can run your code without change the file path:

i.e. instead of

combined_data=combined_csv.to_csv('/Users/jinyifan/Desktop/Coronavirus-Epidemic-2019-nCov/Data_processing/Data_pro_China.csv',header=True, index=False)

## relative path
combined_data=combined_csv.to_csv('../Data/Conbined_data/China/Data_pro_China.csv',header=True, index=False)
  1. (optional )You can combine Combined_data_China.ipynb and Combined_data_International.ipynb into one notebook only.
    because they use quite similar code, (can refer to the same function), just define different path!

IncorrectCIresult

The calculation of 95% CI is incorrect, compared to paper's output

def calculate_conf_interval(self, alpha=0.05):
        """
        Calculate confidence interval by MLE:
           Suppose population follows binomial distribution Bin(p,N)
           p: probability anyone case will be detected overseas
           N: negative binomially distributed function of X (number of cases detected outside mainland China)
        """
       pass 

See notebook:
https://github.com/YiranJing/Coronavirus-Epidemic-2019-nCov/blob/master/Estimating%20current%20cases%20in%20Wuhan.ipynb

Corresponding description in the paper:

Confidence intervals can be calculated from the observation that the number of cases
detected overseas, X, is binomially distributed as Bin(p,N), where p = probability any one
case will be detected overseas, and N is the total number of cases. N is therefore a negative
binomially distributed function of X. The results in Table 1 are maximum likelihood estimates
obtained using this negative binomial likelihood function.

I am confused about how to use MLE in the negative binomial likelihood function.

I currently tried two ways:

def calculate_conf_interval(self, alpha=0.05):
        # based on formula mensioned in paper
        p = self.calculate_pro_detected_overseas()  # 0.0017373684210526318
        # the estimated number based on formula mensioned in paper
        n = self.calculate_total_cases_inWuhan()  # 4029
        # our sample is number of cases detected overseas
        k = self.international.cases  # 7 
        mean, var, skew, kurt = nbinom.stats(k, p, moments='mvsk')
 
        # Lower bound of 95% density dist
        lb = nbinom.ppf((alpha/2), k, p)  
        # Upper bound of 95% density dist
        ub = nbinom.ppf((1-alpha/2), k, p) 
        return (lb, ub) # incorrect !!! it is density dist, not the dist of mean 

or

def calculate_conf_interval(self, alpha=0.05):
        p = self.calculate_pro_detected_overseas()
        n = self.calculate_total_cases_inWuhan()
        # our sample is number of cases detected overseas
        k = self.international.cases

        mean, var, skew, kurt = nbinom.stats(k, p, moments='mvsk')

        margin = stats.norm.ppf(1-alpha/2)*np.sqrt(var)
        return (mean-margin, mean+margin) # traditional way of CI calculation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.