Coder Social home page Coder Social logo

zika's Introduction

Zika Data Repository

This data repository will be used to share publicly available data related to the ongoing Zika epidemic. It is being provided as a resource to the scientific community engaged in the public health response. The data provided here are not official and should be considered provisional and non-exhaustive. The data in reports may change over time, reflecting delays in reporting or changes in classifications. And while accurate representation of the reported data is the objective in the machine readable files shared here, that accuracy is not guaranteed. Before using any of these data, it is advisable to review the original reports and sources, which are provided whenever possible.

If you find the data useful, support data sharing by referencing both the original source and this repository (use DOI below).

DOI

Users

Original reports and data extracted from those reports are categorized by country. For each country, there is a README file (with basic information about currently available datasets), a data guide, and a place names database. Detailed information on formats can be found in the data dictionary. If you find a mistake please create an issue or (preferrably) fix it and submit a pull request.

Contributors

Please see the data dictionary for information on standardization. We are working on getting more countries online. Pull requests will be accepted.

How to contribute

Follow the how to contribute guide to contribute to the CDC zika repo from your fork or local git clone.

Links to other data sources

Additional data

zika's People

Contributors

chendaniely avatar cmrivers avatar contatp avatar daniel-mietchen avatar dmrodz avatar eyq9 avatar fabinhojorge avatar henrichung avatar karzak avatar kevinislas2 avatar luismieryteran avatar majohansson avatar malon avatar moiradillon2 avatar mvevans89 avatar nickreich avatar trvrb avatar yojimbodurant avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zika's Issues

Fix Unknown Municipalities for Colombia

It seems as though the first municipality for each department in Colombia is listed as Unknown in the CO_Places.csv file. Is it possible to fix these? If you can point me in the direction of the script that generates this file, I'd be happy to look into it and submit a pull request.

Thanks

Brazil data guide

the csv file on github here

has 2 columns that are not in the current brazil data guide.

how should microcephaly_indicative_of_infection and microcephaly_zika_positive be coded?

Places metadata for Colombia

There needs to be a CO_Places.csv file that needs the following columns

  • location
  • location_type
  • country
  • state_province
  • district_county_municipality
  • city
  • alt_name1
  • alt_name2

While cleaning the Colombia data, the department and municipality columns can be used to create the CO_Places.csv file

After A bit of cleaning I have this:

selection_011

The first set of questions are:

  1. How do I address the Municipio Desconocido which means "Unknown municipality". I believe they are unknown counts for a particular municipality, but I could be wrong
  2. Does department map to state_province and municipality maps to district_county_municipality?

Next, there are 94 observations that look like the following:
selection_012

  1. Is BOGOTA a state and municipality?
  2. would the [country]-[state/province]-[county/municipality/city] format then be: CO-Bogota-Bogota-Usaquen-Los Cedros?
    • would the hyphens between cities be confusing?
    • One option is to replace the hyphens in city names into underscores, or change the [country]-[state/province]-[county/municipality/city] to be [country]_[state/province]_[county/municipality/city]

cannot use combine-data.R to combine csv files

I have tried to use the combine_data.R file to combine all of the zika csv files in this repository. Given the error and warning messages, I suspect there are multiple csv files that are not following the formatting standards and have additional/fewer columns than required. Code:

tables <- lapply(files, readr::read_csv)
warnings()
# Warning messages:
# 1: Missing column names filled in: 'X10' [10]
# 2: Missing column names filled in: 'X10' [10]
# 3: Missing column names filled in: 'X10' [10]
# 4: Missing column names filled in: 'X10' [10]
# 5: Missing column names filled in: 'X10' [10]
# 6: Missing column names filled in: 'X10' [10], 'X11' [11]
# 7: Missing column names filled in: 'X10' [10], 'X11' [11]
# 8: Missing column names filled in: 'X10' [10]
# 9: Missing column names filled in: 'X10' [10]
# 10: Missing column names filled in: 'X10' [10]
# 11: Missing column names filled in: 'X10' [10]
# 12: Missing column names filled in: 'X10' [10]
# 13: Missing column names filled in: 'X10' [10]
# 14: Missing column names filled in: 'X10' [10]
# 15: Missing column names filled in: 'X10' [10]
# 16: Missing column names filled in: 'X10' [10]
# 17: Missing column names filled in: 'X10' [10]
# 18: Missing column names filled in: 'X10' [10]
# 19: Missing column names filled in: 'X10' [10]
# 20: Missing column names filled in: 'X10' [10]

combined_df <- do.call(rbind , tables)
# Error in rbind(deparse.level, ...) : 
#  numbers of columns of arguments do not match

date format inconsistent

some dates are in the 2016-02-20 format and others are in 2/27/2016 format

current problematic files:

cdc_data_commit <- '05e6c978330da18ee5902cceabeab742f54294f2'

files <- list.files(path = sprintf('data/zika-%s', cdc_data_commit),
                    pattern = '[0-9]{4}-[0-9]{2}-[0-9]{2}.csv$',
                    recursive = TRUE,
                    full.names = TRUE)

tables <- lapply(files, readr::read_csv)

not_dates <- c()
for(i in 1:length(tables)){
    print(i)
    print(class(tables[[i]]$report_date))
    if(class(tables[[i]]$report_date) != 'Date'){
        not_dates <- c(not_dates, i)
    }
}
files[not_dates]
 [1] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-02-27.csv"                        
 [2] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-03-05.csv"                        
 [3] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-03-12.csv"                        
 [4] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-03-19.csv"                        
 [5] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-03-26.csv"                        
 [6] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-04-02.csv"                        
 [7] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-04-09.csv"                        
 [8] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-04-16.csv"                        
 [9] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-03-26.csv"
[10] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-04-02.csv"
[11] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-04-09.csv"
[12] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-04-16.csv"
[13] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-04-23.csv"
[14] "data/zika-03022e42828e69ce19b448d40fa806545368b348/United_States/CDC_Report/data/CDC_Report-2016-04-06.csv"                                 

error in field name for 2017 Week 29 in Colombia

for Colombia/Epidemiolgical_Bulletin_2017-Week-29.csv, there is "ï.." before report_date in the name of that field for some reason. needs to be removed.

was going to fix but don't have a fork set up and am otherwise not intending to contribute, so am posting as an issue instead.

Rise then dip in Texas totals from March 9-23

I noticed in the dataset the CDC reported Texas had 19 travel-associated cases on March 9, then 34 on March 16, then 23 on March 23. I thought this was an anomaly because how can the number of new cases drop?

I spoke with a spokesman, and he told me Texas had double counted what they sent to the CDC. Would it be possible to get the correct number for Texas in the March 16 update? Thanks!

NA rows in data

After stacking the data, there are a bunch of rows that are all NA, this probably comes from opening a file in a spreadsheet program

> combined_df[is.na(combined_df$report_date), ]
# A tibble: 7 x 9
  report_date location location_type data_field data_field_code time_period time_period_type value  unit
       <date>    <chr>         <chr>      <chr>           <chr>       <chr>            <chr> <chr> <chr>
1        <NA>                                                          <NA>             <NA>  <NA>      
2        <NA>                                                          <NA>             <NA>  <NA>      
3        <NA>                                                          <NA>             <NA>  <NA>      
4        <NA>                                                          <NA>             <NA>  <NA>      
5        <NA>                                                          <NA>             <NA>  <NA>      
6        <NA>                                                          <NA>             <NA>  <NA>      
7        <NA>                                                          <NA>             <NA>  <NA>      

Need data validation script

#31 (spaces) #32 (NA) and #35 (missing location) address some problems with the raw data.

Should have a separate script to validate all data so users can check for 'bad' data

Any explanation for negative counts?

Great job pulling this data together, and many thanks for making it public!

From what I gather, most of these counts are cumulative, and if you difference them, you get a surprising number of negative values. Anyone know why? It looks like someone is aware of this issue, and censoring them to 0 --

if (any(confirmed_codes != "")){
confirmed[2:length(confirmed)][confirmed[2:length(confirmed)] < 0] <- 0 }
if (any(suspected_codes != "")){
suspected[2:length(suspected)][suspected[2:length(suspected)] < 0] <- 0 }

BTW, I'm developing an R package that provides some more tools for exploring this data -- https://github.com/cpsievert/zikar

incomplete entry for Colombia pre-2017

The only data that have been entered for Colombia before 2017 pertained specifically to pregnant women. It would be really helpful to have confirmed and suspected case numbers for the population as a whole. By department would be a good place to start if the more numerous municipalities are slowing things down.

Github Pages

I think this should have a GitHub Pages anytime now soon, it can display in a friendlier/graphical way its current README.md

Haiti Admin 1 location is not the country

the first location in the Haiti Data should be Haiti. All but one of the values have the country listed as the first loction value.

E.g., should be Haiti-Artibonite-Desssalines_Marchand not just Artibonite-Desssalines_Marchand

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.