cdcepi / zika Goto Github PK

Data repository of publicly available Zika data

License: Apache License 2.0

R 2.85% HTML 97.15%

zika's Introduction

Zika Data Repository

This data repository will be used to share publicly available data related to the ongoing Zika epidemic. It is being provided as a resource to the scientific community engaged in the public health response. The data provided here are not official and should be considered provisional and non-exhaustive. The data in reports may change over time, reflecting delays in reporting or changes in classifications. And while accurate representation of the reported data is the objective in the machine readable files shared here, that accuracy is not guaranteed. Before using any of these data, it is advisable to review the original reports and sources, which are provided whenever possible.

If you find the data useful, support data sharing by referencing both the original source and this repository (use DOI below).

Users

Original reports and data extracted from those reports are categorized by country. For each country, there is a README file (with basic information about currently available datasets), a data guide, and a place names database. Detailed information on formats can be found in the data dictionary. If you find a mistake please create an issue or (preferrably) fix it and submit a pull request.

Contributors

Please see the data dictionary for information on standardization. We are working on getting more countries online. Pull requests will be accepted.

How to contribute

Follow the how to contribute guide to contribute to the CDC zika repo from your fork or local git clone.

Links to other data sources

Additional data

zika's People

Contributors

Stargazers

Watchers

Forkers

nickreich aashanand margareta-c ndssl cmrivers chendaniely trvrb rydela am3rica daniel-mietchen quranw vsolano luismieryteran lalalammy jverma yojimbodurant mattk7 dinacmistry edygarcia breandan adamkucharski rocket-ron majacaci00 ballykea karzak moiradillon2 glambertx rhodnius myeyeisopen anarinsk vessy yuhangwang fabinhojorge geoyi anergictcell yamini1473 dmellop zysong havarisa ltnoce ian-flores avisec sumendar malon lizhihao2013 kaiut titaniumtroop arnholdinstitute anushabala marcusmariano jbarreto1 jldelda rehrenk1 spatial-time-r taoc2016 jmarsh422 fredericksilva srava22s tizianas apulaski jatinrajani fdzul nm-usaid heatxg dbolivar91 mmschmitz ktargows nelsonyao-mindset ashminahacks osyahbana beatrizneto lrexxx gallaghg rborchering lematt1991 odongow qberto jtstacruz brunolucian kath-o-reilly danielelsav mshimizu12 chrisjmello rajeevak40 kevinislas2 harvardlmp mvevans89 jindalpankaj aleesham puzuwe amoslintw syher-rumak aileia tasabeehosama suecavalheiro kevin-ghorbani-gallup tejasingawale lusiferajay gdodeva lifenoteasy

zika's Issues

United_States-Florida-Miami-Dade_County has 4 location levels instead of 3

the data dictionary specifies:
[country]-[state/province]-[county/municipality/city]

but there is data that looks like this: United_States-Florida-Miami-Dade_County
which have 4 parts when split with -

Fix Unknown Municipalities for Colombia

It seems as though the first municipality for each department in Colombia is listed as Unknown in the CO_Places.csv file. Is it possible to fix these? If you can point me in the direction of the script that generates this file, I'd be happy to look into it and submit a pull request.

Thanks

data_field and data_field_code swapped

In the Colombia municipalities data set, these fields are swapped in the top row for the dates 2016-07-23 and 2016-07-30.

Brazil data guide

the csv file on github here

has 2 columns that are not in the current brazil data guide.

how should microcephaly_indicative_of_infection and microcephaly_zika_positive be coded?

Places metadata for Colombia

There needs to be a CO_Places.csv file that needs the following columns

location
location_type
country
state_province
district_county_municipality
city
alt_name1
alt_name2

While cleaning the Colombia data, the department and municipality columns can be used to create the CO_Places.csv file

After A bit of cleaning I have this:

The first set of questions are:

How do I address the Municipio Desconocido which means "Unknown municipality". I believe they are unknown counts for a particular municipality, but I could be wrong
Does department map to state_province and municipality maps to district_county_municipality?

Next, there are 94 observations that look like the following:

Is BOGOTA a state and municipality?
would the [country]-[state/province]-[county/municipality/city] format then be: CO-Bogota-Bogota-Usaquen-Los Cedros?
- would the hyphens between cities be confusing?
- One option is to replace the hyphens in city names into underscores, or change the [country]-[state/province]-[county/municipality/city] to be [country]_[state/province]_[county/municipality/city]

inconsistent NA values in El Salvador

NA values in Epidemiological_Bulletin-2016-03-12.csv have values of N/A in the file.

Should be just NA

cannot use combine-data.R to combine csv files

I have tried to use the combine_data.R file to combine all of the zika csv files in this repository. Given the error and warning messages, I suspect there are multiple csv files that are not following the formatting standards and have additional/fewer columns than required. Code:

tables <- lapply(files, readr::read_csv)
warnings()
# Warning messages:
# 1: Missing column names filled in: 'X10' [10]
# 2: Missing column names filled in: 'X10' [10]
# 3: Missing column names filled in: 'X10' [10]
# 4: Missing column names filled in: 'X10' [10]
# 5: Missing column names filled in: 'X10' [10]
# 6: Missing column names filled in: 'X10' [10], 'X11' [11]
# 7: Missing column names filled in: 'X10' [10], 'X11' [11]
# 8: Missing column names filled in: 'X10' [10]
# 9: Missing column names filled in: 'X10' [10]
# 10: Missing column names filled in: 'X10' [10]
# 11: Missing column names filled in: 'X10' [10]
# 12: Missing column names filled in: 'X10' [10]
# 13: Missing column names filled in: 'X10' [10]
# 14: Missing column names filled in: 'X10' [10]
# 15: Missing column names filled in: 'X10' [10]
# 16: Missing column names filled in: 'X10' [10]
# 17: Missing column names filled in: 'X10' [10]
# 18: Missing column names filled in: 'X10' [10]
# 19: Missing column names filled in: 'X10' [10]
# 20: Missing column names filled in: 'X10' [10]

combined_df <- do.call(rbind , tables)
# Error in rbind(deparse.level, ...) : 
#  numbers of columns of arguments do not match

date format inconsistent

some dates are in the 2016-02-20 format and others are in 2/27/2016 format

current problematic files:

cdc_data_commit <- '05e6c978330da18ee5902cceabeab742f54294f2'

files <- list.files(path = sprintf('data/zika-%s', cdc_data_commit),
                    pattern = '[0-9]{4}-[0-9]{2}-[0-9]{2}.csv$',
                    recursive = TRUE,
                    full.names = TRUE)

tables <- lapply(files, readr::read_csv)

not_dates <- c()
for(i in 1:length(tables)){
    print(i)
    print(class(tables[[i]]$report_date))
    if(class(tables[[i]]$report_date) != 'Date'){
        not_dates <- c(not_dates, i)
    }
}
files[not_dates]

 [1] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-02-27.csv"                        
 [2] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-03-05.csv"                        
 [3] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-03-12.csv"                        
 [4] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-03-19.csv"                        
 [5] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-03-26.csv"                        
 [6] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-04-02.csv"                        
 [7] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-04-09.csv"                        
 [8] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Colombia/Municipality_Zika/data/Municipality_Zika_2016-04-16.csv"                        
 [9] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-03-26.csv"
[10] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-04-02.csv"
[11] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-04-09.csv"
[12] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-04-16.csv"
[13] "data/zika-03022e42828e69ce19b448d40fa806545368b348/Dominican_Republic/Epidemiological_Bulletin/data/Epidemiological_Bulletin-2016-04-23.csv"
[14] "data/zika-03022e42828e69ce19b448d40fa806545368b348/United_States/CDC_Report/data/CDC_Report-2016-04-06.csv"

error in field name for 2017 Week 29 in Colombia

for Colombia/Epidemiolgical_Bulletin_2017-Week-29.csv, there is "ï.." before report_date in the name of that field for some reason. needs to be removed.

was going to fix but don't have a fork set up and am otherwise not intending to contribute, so am posting as an issue instead.

It looks like some Mexico files are corrupted

All files changes within last 11 days apparently.
For example:
https://github.com/cdcepi/zika/blob/master/Mexico/DGE_Zika/data/DGE_Zika-2016-11-26.csv
https://github.com/cdcepi/zika/blob/master/Mexico/DGE_Zika/data/DGE_Zika-2016-12-24.csv

Rise then dip in Texas totals from March 9-23

I noticed in the dataset the CDC reported Texas had 19 travel-associated cases on March 9, then 34 on March 16, then 23 on March 23. I thought this was an anomaly because how can the number of new cases drop?

I spoke with a spokesman, and he told me Texas had double counted what they sent to the CDC. Would it be possible to get the correct number for Texas in the March 16 update? Thanks!

NA rows in data

After stacking the data, there are a bunch of rows that are all NA, this probably comes from opening a file in a spreadsheet program

> combined_df[is.na(combined_df$report_date), ]
# A tibble: 7 x 9
  report_date location location_type data_field data_field_code time_period time_period_type value  unit
       <date>    <chr>         <chr>      <chr>           <chr>       <chr>            <chr> <chr> <chr>
1        <NA>                                                          <NA>             <NA>  <NA>      
2        <NA>                                                          <NA>             <NA>  <NA>      
3        <NA>                                                          <NA>             <NA>  <NA>      
4        <NA>                                                          <NA>             <NA>  <NA>      
5        <NA>                                                          <NA>             <NA>  <NA>      
6        <NA>                                                          <NA>             <NA>  <NA>      
7        <NA>                                                          <NA>             <NA>  <NA>

breakdown of sa and jb cases

Hi All - I noticed on http://www.cdc.gov/zika/geo/united-states.html that we are now breaking down number of pregnant, sexually acquired , Guillain-Barré syndrome cases for states and territories. Do we want to capture this information? If so, where?

Are the scripts used to parse the PDF files available?

I'm curious about how the CSV files were generated and was hoping to look at the scripts that were used for parsing the PDF documents found in the repository. Are these available somewhere?

Need data validation script

#31 (spaces) #32 (NA) and #35 (missing location) address some problems with the raw data.

Should have a separate script to validate all data so users can check for 'bad' data

Inconsistent location naming for El_Salvador

in the Epidemiological_Bulletin-2016-03-12.csv and Epidemiological_Bulletin-2016-03-05.csv datasets for El Salvador,

The locations that have location_type == 'country' have a space in "El Salvador".
Per data guidelines, they should be "El_Salvador"

Front-end data dashboard

I've started working on a dashboard for the data.
It can currently be viewed here: https://chendaniely.shinyapps.io/zika_cdc_dashboard/

It is still very much in the alpha stage, eventually it would be nice to have a link here to the dashboard

Any explanation for negative counts?

Great job pulling this data together, and many thanks for making it public!

From what I gather, most of these counts are cumulative, and if you difference them, you get a surprising number of negative values. Anyone know why? It looks like someone is aware of this issue, and censoring them to 0 --

zika/code/Plot_cases_by_country/Plot_all.R

Lines 83 to 87 in 03365a5

    
           if (any(confirmed_codes != "")){ 
        
             confirmed[2:length(confirmed)][confirmed[2:length(confirmed)] < 0] <- 0 } 
        
           if (any(suspected_codes != "")){ 
        
             suspected[2:length(suspected)][suspected[2:length(suspected)] < 0] <- 0 }

BTW, I'm developing an R package that provides some more tools for exploring this data -- https://github.com/cpsievert/zikar

NA values in brazil Location

Brazil data for

have NA values under the location but have a value for value, how are those rows interpreted?

US data for the beginning of August?

I've seen I can get the 2016-08-31 data here http://www.cdc.gov/zika/geo/united-states.html
but I would like to study the evolution during August too and the repo only has the reports till July included. Is there a history in CDC where I can download the other weeks of August? or any other way I can get that?

Thank you very much for making this public!

incomplete entry for Colombia pre-2017

The only data that have been entered for Colombia before 2017 pertained specifically to pregnant women. It would be really helpful to have confirmed and suspected case numbers for the population as a whole. By department would be a good place to start if the more numerous municipalities are slowing things down.

samples_received
samples_testable
samples_tested
samples_in_progress

	if (any(confirmed_codes != "")){
	confirmed[2:length(confirmed)][confirmed[2:length(confirmed)] < 0] <- 0 }

	if (any(suspected_codes != "")){
	suspected[2:length(suspected)][suspected[2:length(suspected)] < 0] <- 0 }

cdcepi / zika Goto Github PK

zika's Introduction

Zika Data Repository

Users

Contributors

How to contribute

Links to other data sources

zika's People

Contributors

Stargazers

Watchers

Forkers

zika's Issues

Recommend Projects

Recommend Topics

Recommend Org