rfordatascience / tidytuesday Goto Github PK
View Code? Open in Web Editor NEWOfficial repo for the #tidytuesday project
License: Creative Commons Zero v1.0 Universal
Official repo for the #tidytuesday project
License: Creative Commons Zero v1.0 Universal
From data.world and originally posed as a Tableau Makeover Monday: the cost of beer at MLB stadiums. The original visualization is a good exercise in recreating a comparative plot. This would also be a good use case to explore creating a Shiny app.
Article: https://data.world/makeovermonday/2018w43-what-will-a-beer-cost-you-at-every-major-league-ba
Data - requires login via link above. You can create an account and login with GitHub credentials.
https://www.figure-eight.com/data-for-everyone/
Thanks for setting up Tidy Tuesday in the first place!!
2018-12-11 README Data Dictionary reads:
"
inspection_type | This field represents the date of inspection; NOTE: Inspection dates of 1/1/1900 mean an establishment has not yet had an inspection | Date & Time |
---|
"
But inspection_type is not Date & Time field. You probably meant to put "inspection_date" here. You have inspection_type at the end of the table. i.e.
"
inspection_date | This field represents the date of inspection; NOTE: Inspection dates of 1/1/1900 mean an establishment has not yet had an inspection | Date & Time |
---|
"
Hi @jthomasmock it was great meeting you today. As I mentioned, a couple of cool data resources include the European Social Survey which has an API package available in CRAN (essurvey). Another one is weather data from Environment Canada, which is available via weathercan.
Another one I didn't mention, but I think you may find a pretty fun challenge is bikesharing data, which is made easy to get using the bikedata package.
NoOlympics LA conducted a large survey (>1000 respondents) on public opinion regarding the Olympics. This provides real survey responses for TidyTuesday participants to analyze and visualize.
Github link with more info, context and basic analyses.
https://github.com/NOlympicsLA/Olympics-Public-Survey
This isn't linked to any existing article, but there's some interesting open data from OpenLitterMap - a crowdsourced map that captured different types of litter data.
Ireland is currently on the leaderboard with 18k verified pieces of litter.
https://openlittermap.com/en/maps
You need to login to be able to download but I've spoken to the dev/owner who says the data is freely available to anyone to use. Can be accessed via direct download (4.5MB file) https://openlittermap.com/maps/Ireland/download
Just wanted to let you know that the URL used for the obs_gender variable is incorrect. I believe it should point to the jobs_gender.csv file instead of the gender_earnings.csv file.
I am a glider pilot, and when gliders go above 14,000 feet, the pilot is required to have a supplemental oxygen source.
The magazine of the Soaring Society of America (SSA), Soaring, recently published an article about the lack of oxygen and/or carbon dioxide during flight, and the table caught my eye.
I received permission from the author and editor to post the article and "crowdsource" different means of presenting the data, which could include alternative tabular representations or other more visual means.
Here is the article in full. Hypoxia Article proof.pdf
I've transcribed the Table 1 data into a comma-separated text file, since the table is an image in the article. table1.txt
The author, the editor, and I are very interested in the products of everyone's imaginations! SSA is a non-profit, the author was not paid for his work, and the table originated from Guyton & Hall: Textbook of Medical Physiology, 12th ed. Attribution is all that is requested.
Johnson, D. (2018, August). Hypoxia, Hyperventilation, and Supplemental Oxygen Systems. Soaring, 19-27.
Hall, J. E., & Guyton, A. C. (2011). Guyton and Hall Textbook of Medical Physiology, 12e. Philadelphia: Elsevier Saunders.
Hello! I've started a longtime project on a global news analysis. What if we could collect the most popular news from different countries and continents? What if we would do it every day for a… let's say one month or year?
https://medium.com/@storozhenko.dmitry/whats-happened-in-a-world-last-month-world-news-analysis-b7e540d45d64
If somebody would like to join, please let me know:)
The Caselaw Access Project (CAP) has digitized over 40 million pages of US court decisions. What better way to promote civic engagement than through caselaw?!?
Use the bulk download (currently only IL and AR) without login. Other states with login and user agreement.
Use API for up to 500 cases per user per day.
Service: https://case.law/
Bulk: https://case.law/bulk/
API: https://case.law/api/
Announcement: https://lil.law.harvard.edu/blog/2018/10/29/caselaw-access-project-cap-launches-api-and-bulk-data-service/
https://www.chronicle.com/interactives/student-diversity-2018
"race, ethnicity, and gender of students at 4,342 colleges and universities in the fall of 2016"
I'm pretty sure the data comes from the data table 12-Month Enrollment (12-month unduplicated headcount: 2016-17) found at https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx
*choose 2017 (this is where the 2016 school year data lives), then all surveys
To get the institution names merge to the Institutional Characteristics (Directory information) table.
I manually created a spreadsheet which will allow you to merge the Google Review score (as of 2019-02-25) by name of the departure or arrival station. Just wanted to offer share in case others would find it useful! I'm new to Github so should you want the file, please let me know the best way to get it to you.
note: I did this by hand and can't speak French very well, so take the scores with a grain of salt
NPR story on who benefits most from FEMA relief. Data are here. Part of their investigation, "How Federal Disaster Money Favors The Rich"
https://github.com/TheEconomist/big-mac-data
Plus more data from The Economist to come: https://medium.economist.com/peeling-back-the-curtain-487bd3be0c47
Data can be found here and is part of a package. I discovered the data here - https://twitter.com/gelliottmorris/status/1089612612474732544.
| 17 | 2018-07-24 | Dallas Animal Shelter FY2017 | Dallas OpenData | Dallas OpenData FY2017 Summary|
Just an idea:
Statistical comparison of climate model data versus paleoclimatological history data, both available through:
https://www.ncdc.noaa.gov/data-access
Brian
Hi there,
The WHO release Tuberculosis data (TB) which I have wrapped in an R package. It might be a good fit for tidytuesday as the data itself is pretty interesting (i.e covers a full range of countries with a good level of detail on the epidemiology of TB) and there are quite a few angles that can be explored.
The tooling I have built up in the package is meant to facilitate first passes at visualisations and could potentially help newer R users get going with their own more novel visualisations. A general intro to the package is here (also links out to several case studies and gists containing additional visualisations).
No worries if not of interest!
I publish a dataset of all matches played on the AVP and FIVB professional beach volleyball circuits:
https://github.com/BigTimeStats/beach-volleyball
There is some intro code to access and start querying the data. Would be happy to write additional posts to generate some visuals.
Some ideas to work more generally with the data:
Let me know your thoughts.
The Organisation for Economic Co-operation and Development (OECD) has an enormous amount of data on a variety of topics. You could probably do an entire year of tidytuesdays on their data alone.
The only issue is that the data comes in fairly clean, so less of a learning opportunity on the tidying side.
The Baltimore Sun Data Desk has been making their data and analyses public on GitHub:
https://github.com/baltimore-sun-data
Topics include Maryland specific voter registration, condition of bridges, shootings, and public salaries.
From a Swedish page (https://oppnadata.se/datamangder/#esc_term=nobel), the dataset is in English with these variables: year, category, overallMotivation, id, firstname, surname, motivation, share
The readme states
Include a copy of the code used to create your visualization when you post to Twitter.
Is there a preferred method? For example, tweet the graphic with the hash tag, then reply to your own tweet with a link? Or as a Twitter essay? Or as an attachment? If attachment, then as UTF-8 text file?
Very interesting dataset for the R community. Survey conducted prior to Rstudio conf 2019 abotu R users and how they are learning R.
https://about.twitter.com/en_us/values/elections-integrity.html#data
"Twitter is making publicly available archives of Tweets and media that we believe resulted from potentially state-backed information operations on our service."
I ran across this dataset on Medium. It is a sample dataset of grocery orders from Instacart, the Uber Eats of grocery delivery. Might be good for both data visualization as well as ML techniques.
Link to article: https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2
Link to dataset: https://www.instacart.com/datasets/grocery-shopping-2017
As seen on Hacker News:
https://www.datafix.com.au/BASHing/2019-03-31.html
Clicking on the link to 2018 data shows a 404 error
Kaggle is having a competition to find a story in their Machine Learning and Data Science Survey results.
https://www.kaggle.com/kaggle/kaggle-survey-2018/home
Challenge ends December 3rd, but there are weekly prizes as well up until that point.
I know it is short notice, but tomorrow is the Transgender Day of Remembrance. Forwards, Rainbow R and Cardiff RUG held a datathon at the weekend to work on a dataset of reports of killings and suicides of transgender people, who will be memorialized on TDoR. More about the datathon can be found here: https://github.com/rlgbtq/TDoR2018. The data is now available as an R package: https://github.com/CaRdiffR/tdor. Although a difficult subject, it would be great if R folk could explore this data for Tidy Tuesday and raise awareness of TDoR.
https://www.synapse.org/#!Synapse:syn16788291/wiki/583310
Malaria data challenge opens Nov 12th
A few months ago, Harper and Palayew[1] published a study looking at whether a signal could be detected in fatal car crashes in the United States based on the "4/20" holiday, based on a previous study by Staples and Redelmeier[2] that suggested a strong link. Using more robust methods and a more comprehensive time window, Harper and Palayew could not find a signal for 4/20, but could for other holidays, such as July 4.
This is a great example of how charts can mislead based on choices in analysis and plotting.
Some of Harper and Palayew's analysis was done in R, but more was done in Stata and Stan. Their manuscript and their original data/code is at https://osf.io/qnrg6/. I built a script to download their original raw data and tidy it up into datasets they used in their paper as a possible #tidytuesday activity using R. Other dataset creation from the raw data and additional tidying possibilities exist, of course.
The entire script, which includes a couple of starter plots, is at https://github.com/Rmadillo/Harper_and_Palayew/blob/master/Load_Data_and_Clean.R, but you can download and tidy the data from this:
#### Load packages -------------------------------------------------------------
library(haven)
library(tidyverse)
library(lubridate)
#### Acquire raw data ----------------------------------------------------------
# Crash data (from Harper and Palayew)
download.file("https://osf.io/kj7ub/download", "~/Downloads/farsp/farsp.zip")
unzip("~/Downloads/farsp/farsp.zip", exdir = "~/Downloads/farsp")
dta_files = list.files(path = "~/Downloads/farsp", pattern = "*.dta", full.names = TRUE)
dta_files = setNames(dta_files, dta_files)
fars = map_df(dta_files, read_dta, .id = "id")
# Geographic lookup
geog = read_csv("https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt",
col_names = c("state_name", "state_code", "county_code",
"county_name", "FIPS_class_code")) %>%
mutate(state = as.numeric(state_code),
count = as.numeric(county_code),
FIPS = paste0(state_code, county_code))
#### Data wrangling ------------------------------------------------------------
# Used https://osf.io/drbge/ Stata code as a guide for cleaning
# All data
# This might take awhile... go get a coffee
all_accidents = fars %>%
# What are state and county codes/look ups?
select(id, state, county, month, day, hour, minute, st_case, per_no, veh_no,
per_typ, age, sex, inj_sev, death_da, death_mo, death_yr,
death_hr, mod_year, death_mn, death_tm, lag_hrs, lag_mins) %>%
# CAPS used to avoid conflict with lubridate
rename(MONTH = month, DAY = day, HOUR = hour, MINUTE = minute) %>%
mutate_at(vars(MONTH, DAY, HOUR, MINUTE), na_if, 99) %>%
mutate(crashtime = HOUR * 100 + MINUTE,
YEAR = as.numeric(gsub("\\D", "", id)) - 10000,
DATE = as.Date(paste(YEAR, MONTH, DAY, sep = "-")),
TIME = paste(HOUR, MINUTE, sep = ":"),
TIMESTAMP = as.POSIXct(paste(DATE, TIME), format = "%Y-%m-%d %H:%M"),
e420 = case_when(
MONTH == 4 & DAY == 20 & crashtime >= 1620 & crashtime <= 2359 ~ 1,
TRUE ~ 0),
e420_control = case_when(
MONTH == 4 & (DAY == 20 | DAY == 27) & crashtime >= 1620 & crashtime < 2359 ~ 1,
TRUE ~ 0),
d420 = case_when(
crashtime >= 1620 & crashtime <= 2359 ~ 1,
TRUE ~ 0),
sex = factor(case_when(
sex == 2 ~ "F",
sex == 1 ~ "M",
sex >= 8 ~ NA_character_,
TRUE ~ NA_character_)),
Period = factor(case_when(
YEAR < 2004 ~ "Remote (1992-2003)",
YEAR >= 2004 ~ "Recent (2004-2016)",
TRUE ~ NA_character_)),
age_group = factor(case_when(
age <= 20 ~ "<20y",
age <= 30 ~ "21-30y",
age <= 40 ~ "31-40y",
age <= 50 ~ "41-50y",
age <= 97 ~ "51-97y",
age == 98 | age == 99 | age == 998 ~ NA_character_,
is.na(age) ~ NA_character_,
TRUE ~ NA_character_))
) %>%
filter(per_typ == 1,
!is.na(MONTH),
!is.na(DAY))
# Daily+Time Group
# This should match 420-data.dta observations at https://osf.io/ejz28/
# Verify: dta_orig = read_dta("https://osf.io/ejz28/download")
# arsenal::compare(daily_accidents_time_groups, dta_orig)
daily_accidents_time_groups = all_accidents %>%
group_by(DATE, d420) %>%
summarize(fatalities_count = n())
# Daily+Time Group final working data
# Only use data starting in 1992
daily_accidents_time_groups = all_accidents %>%
filter(YEAR > 1991) %>%
group_by(DATE, d420) %>%
summarize(fatalities_count = n())
# Daily final working data
daily_accidents = all_accidents %>%
filter(YEAR > 1991) %>%
group_by(DATE) %>%
summarize(fatalities_count = n())
For this dataset, it's especially important to remember the following caveats from the main #tidytuesday page:
"We will have many sources of data and want to emphasize that no causation is implied. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our guidelines are to use the data provided to practice your data tidying and plotting techniques. Participants are invited to consider for themselves what nuancing factors might underlie these relationships.
The intent of Tidy Tuesday is to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions. While we understand that the two are related, the focus of this practice is purely on building skills with real-world data."
[1]. Harper S, Palayew A The annual cannabis holiday and fatal traffic crashes. BMJ Injury Prevention. Published Online First: 29 January 2019. doi: 10.1136/injuryprev-2018-043068. Manuscript and original data/code at https://osf.io/qnrg6/
[2]. Staples JA, Redelmeier DA. The April 20 cannabis celebration and fatal traffic crashes in the United States. JAMA Intern Med. 2018 Feb;178(4):569–72.
Someone did a web scrape of 1.4 million Medium articles between 8/2017-8/2018, including:
Data:
https://www.kaggle.com/harrisonjansma/medium-stories
Article - removed as the link is no longer active.
https://towardsdatascience.com/i-just-published-a-massive-dataset-of-medium-stories-heres-the-link-to-get-it-889bab324138
US GDP data by county from the Bureau of Economic Analysis.
Multiple data files to explore with varying need for structuring. Data comes in .xls format.
Site & Data: https://www.bea.gov/news/2018/prototype-gross-domestic-product-county-2012-2015
I found the data on high speed trains on the French train company (SNCF) open data site
They have several datasets:
The data on the stations (including the zipcode and the GPS coordinates are in another file): https://frama.link/Qh7-jkJ7 . I am afraid I have not been able to join both dataframes, my fuzzy joining and general data wrangling skills are not good enough yet, but maybe some people will be up to the challenge.
blog - https://schneeworld.netlify.com/2019/01/24/legislative-endurance/
data - https://github.com/unitedstates/congress-legislators
YAML data, which may prove didactic.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.