Coder Social home page Coder Social logo

tidytuesday's People

Contributors

abichat avatar dgrtwo avatar dicook avatar erictleung avatar fgazzelloni avatar francisbarton avatar gkaramanis avatar havishak avatar iamericfletcher avatar jacquietran avatar jmcastagnetto avatar jonthegeek avatar jthomasmock avatar jtipton25 avatar kierisi avatar kjewell avatar ky-james avatar mine-cetinkaya-rundel avatar nrennie avatar philip-khor avatar pursuitofdatascience avatar pythoncoderunicorn avatar sharlagelfand avatar sofigs-gt avatar statsrhian avatar tanho63 avatar thebioengineer avatar tracykteal avatar z3tt avatar zdelrosario avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tidytuesday's Issues

Repository with many data sets

https://www.figure-eight.com/data-for-everyone/

  • Image descriptions
  • Judge emotions about nuclear energy from Twitter
  • Decide whether two English sentences are related
  • Similarity judgment of word combinations
  • Sentiment Analysis – Global Warming/Climate Change
  • Judge Emotion About Brands & Products
  • Colors in 9 Languages
  • Claritin Twitter
  • Sentence plausibility
  • Academy Awards demographics
  • Agreement between long and short sentences
  • Company categorizations (with URLs)

Thanks for setting up Tidy Tuesday in the first place!!

2018-12-11#data-dictionary: inspection_type is not Date & Time field. You probably meant to put "inspection_date" here.

2018-12-11 README Data Dictionary reads:
"

inspection_type This field represents the date of inspection; NOTE: Inspection dates of 1/1/1900 mean an establishment has not yet had an inspection Date & Time

"
But inspection_type is not Date & Time field. You probably meant to put "inspection_date" here. You have inspection_type at the end of the table. i.e.

"

inspection_date This field represents the date of inspection; NOTE: Inspection dates of 1/1/1900 mean an establishment has not yet had an inspection Date & Time

"

more CA fire damage data available

I've got more historical data for fires that occurred in CAL FIRE's jurisdiction, including NUMBER OF FIRES , ACRES BURNED and DOLLAR DAMAGE form 1933 to 2016.
Repo here
Data file available here

Ireland litter data

This isn't linked to any existing article, but there's some interesting open data from OpenLitterMap - a crowdsourced map that captured different types of litter data.

Ireland is currently on the leaderboard with 18k verified pieces of litter.
https://openlittermap.com/en/maps

You need to login to be able to download but I've spoken to the dev/owner who says the data is freely available to anyone to use. Can be accessed via direct download (4.5MB file) https://openlittermap.com/maps/Ireland/download

Wrong URL for obs_gender

Just wanted to let you know that the URL used for the obs_gender variable is incorrect. I believe it should point to the jobs_gender.csv file instead of the gender_earnings.csv file.

Hypoxia data set

I am a glider pilot, and when gliders go above 14,000 feet, the pilot is required to have a supplemental oxygen source.

The magazine of the Soaring Society of America (SSA), Soaring, recently published an article about the lack of oxygen and/or carbon dioxide during flight, and the table caught my eye.

I received permission from the author and editor to post the article and "crowdsource" different means of presenting the data, which could include alternative tabular representations or other more visual means.

Here is the article in full. Hypoxia Article proof.pdf

I've transcribed the Table 1 data into a comma-separated text file, since the table is an image in the article. table1.txt

The author, the editor, and I are very interested in the products of everyone's imaginations! SSA is a non-profit, the author was not paid for his work, and the table originated from Guyton & Hall: Textbook of Medical Physiology, 12th ed. Attribution is all that is requested.

Johnson, D. (2018, August). Hypoxia, Hyperventilation, and Supplemental Oxygen Systems. Soaring, 19-27.

Hall, J. E., & Guyton, A. C. (2011). Guyton and Hall Textbook of Medical Physiology, 12e. Philadelphia: Elsevier Saunders.

Caselaw Access Project - Bulk Download or API

The Caselaw Access Project (CAP) has digitized over 40 million pages of US court decisions. What better way to promote civic engagement than through caselaw?!?

Use the bulk download (currently only IL and AR) without login. Other states with login and user agreement.
Use API for up to 500 cases per user per day.

Service: https://case.law/
Bulk: https://case.law/bulk/
API: https://case.law/api/
Announcement: https://lil.law.harvard.edu/blog/2018/10/29/caselaw-access-project-cap-launches-api-and-bulk-data-service/

Student Diversity

https://www.chronicle.com/interactives/student-diversity-2018
"race, ethnicity, and gender of students at 4,342 colleges and universities in the fall of 2016"

I'm pretty sure the data comes from the data table 12-Month Enrollment (12-month unduplicated headcount: 2016-17) found at https://nces.ed.gov/ipeds/datacenter/DataFiles.aspx
*choose 2017 (this is where the 2016 school year data lives), then all surveys

To get the institution names merge to the Institutional Characteristics (Directory information) table.

Google Reviews for Train Stations 2019-02-26

I manually created a spreadsheet which will allow you to merge the Google Review score (as of 2019-02-25) by name of the departure or arrival station. Just wanted to offer share in case others would find it useful! I'm new to Github so should you want the file, please let me know the best way to get it to you.

note: I did this by hand and can't speak French very well, so take the scores with a grain of salt

Anime Dataset?

Hi, I found this anime list dataset is fascinating and I wonder if you could do a webcast on this. The link of the dataset is here: .

It shows the ranking of anime based on different criteria.

WHO Tuberculosis Data

Hi there,

The WHO release Tuberculosis data (TB) which I have wrapped in an R package. It might be a good fit for tidytuesday as the data itself is pretty interesting (i.e covers a full range of countries with a good level of detail on the epidemiology of TB) and there are quite a few angles that can be explored.

The tooling I have built up in the package is meant to facilitate first passes at visualisations and could potentially help newer R users get going with their own more novel visualisations. A general intro to the package is here (also links out to several case studies and gists containing additional visualisations).

No worries if not of interest!

Beach Volleyball Match Data

I publish a dataset of all matches played on the AVP and FIVB professional beach volleyball circuits:

https://github.com/BigTimeStats/beach-volleyball

There is some intro code to access and start querying the data. Would be happy to write additional posts to generate some visuals.

Some ideas to work more generally with the data:

  • City/Country could be used to generate geographic maps and/or travel patterns from tournament to tournament or season to season
  • Contains both the men's and women's tours along with demographic information
  • Contains match stats data like aces, errors, kills, etc. that can be analyzed further

Let me know your thoughts.

OECD Data

The Organisation for Economic Co-operation and Development (OECD) has an enormous amount of data on a variety of topics. You could probably do an entire year of tidytuesdays on their data alone.

https://data.oecd.org/

The only issue is that the data comes in fairly clean, so less of a learning opportunity on the tidying side.

Clarify Code posting on Twitter

The readme states

Include a copy of the code used to create your visualization when you post to Twitter.

Is there a preferred method? For example, tweet the graphic with the hash tag, then reply to your own tweet with a link? Or as a Twitter essay? Or as an attachment? If attachment, then as UTF-8 text file?

November 20th Transgender Day of Remembrance data

I know it is short notice, but tomorrow is the Transgender Day of Remembrance. Forwards, Rainbow R and Cardiff RUG held a datathon at the weekend to work on a dataset of reports of killings and suicides of transgender people, who will be memorialized on TDoR. More about the datathon can be found here: https://github.com/rlgbtq/TDoR2018. The data is now available as an R package: https://github.com/CaRdiffR/tdor. Although a difficult subject, it would be great if R folk could explore this data for Tidy Tuesday and raise awareness of TDoR.

Proposal for week of 4/20: is the cannabis "holiday" related to car crashes?

A few months ago, Harper and Palayew[1] published a study looking at whether a signal could be detected in fatal car crashes in the United States based on the "4/20" holiday, based on a previous study by Staples and Redelmeier[2] that suggested a strong link. Using more robust methods and a more comprehensive time window, Harper and Palayew could not find a signal for 4/20, but could for other holidays, such as July 4.

This is a great example of how charts can mislead based on choices in analysis and plotting.

Some of Harper and Palayew's analysis was done in R, but more was done in Stata and Stan. Their manuscript and their original data/code is at https://osf.io/qnrg6/. I built a script to download their original raw data and tidy it up into datasets they used in their paper as a possible #tidytuesday activity using R. Other dataset creation from the raw data and additional tidying possibilities exist, of course.

The entire script, which includes a couple of starter plots, is at https://github.com/Rmadillo/Harper_and_Palayew/blob/master/Load_Data_and_Clean.R, but you can download and tidy the data from this:

#### Load packages -------------------------------------------------------------

library(haven)
library(tidyverse)
library(lubridate)

#### Acquire raw data ----------------------------------------------------------

# Crash data (from Harper and Palayew)
download.file("https://osf.io/kj7ub/download", "~/Downloads/farsp/farsp.zip")
unzip("~/Downloads/farsp/farsp.zip", exdir = "~/Downloads/farsp")

dta_files = list.files(path = "~/Downloads/farsp", pattern = "*.dta", full.names = TRUE)
dta_files = setNames(dta_files, dta_files)

fars = map_df(dta_files, read_dta, .id = "id") 

# Geographic lookup
geog = read_csv("https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt",
                col_names = c("state_name", "state_code", "county_code", 
                              "county_name", "FIPS_class_code")) %>%
    mutate(state = as.numeric(state_code),
           count = as.numeric(county_code),
           FIPS = paste0(state_code, county_code))

#### Data wrangling ------------------------------------------------------------
# Used https://osf.io/drbge/ Stata code as a guide for cleaning

# All data
# This might take awhile... go get a coffee
all_accidents = fars %>%
    # What are state and county codes/look ups?
    select(id, state, county, month, day, hour, minute, st_case, per_no, veh_no,
           per_typ, age, sex, inj_sev, death_da, death_mo, death_yr, 
           death_hr, mod_year, death_mn, death_tm, lag_hrs, lag_mins) %>%
    # CAPS used to avoid conflict with lubridate
    rename(MONTH = month, DAY = day, HOUR = hour, MINUTE = minute) %>%
    mutate_at(vars(MONTH, DAY, HOUR, MINUTE), na_if, 99) %>%
    mutate(crashtime = HOUR * 100 + MINUTE,
           YEAR = as.numeric(gsub("\\D", "", id)) - 10000,
           DATE = as.Date(paste(YEAR, MONTH, DAY, sep = "-")),
           TIME = paste(HOUR, MINUTE, sep = ":"),
           TIMESTAMP = as.POSIXct(paste(DATE, TIME), format = "%Y-%m-%d %H:%M"), 
           e420 = case_when(
               MONTH == 4 & DAY == 20 & crashtime >= 1620 & crashtime <= 2359 ~ 1,
               TRUE ~ 0),
           e420_control = case_when(
               MONTH == 4 & (DAY == 20 | DAY == 27) & crashtime >= 1620 & crashtime < 2359 ~ 1,
               TRUE ~ 0),
           d420 = case_when(
               crashtime >= 1620 & crashtime <= 2359 ~ 1,
               TRUE ~ 0),
           sex = factor(case_when(
               sex == 2 ~ "F",
               sex == 1 ~ "M",
               sex >= 8 ~ NA_character_,
               TRUE ~ NA_character_)),
           Period = factor(case_when(
               YEAR < 2004  ~ "Remote (1992-2003)",
               YEAR >= 2004 ~ "Recent (2004-2016)",
               TRUE ~ NA_character_)),
           age_group = factor(case_when(
               age <= 20 ~ "<20y",
               age <= 30 ~ "21-30y",
               age <= 40 ~ "31-40y",
               age <= 50 ~ "41-50y",
               age <= 97 ~ "51-97y",
               age == 98 | age == 99 | age == 998 ~ NA_character_,
               is.na(age) ~ NA_character_,
               TRUE ~ NA_character_))
           ) %>%
        filter(per_typ == 1, 
           !is.na(MONTH),
           !is.na(DAY))

# Daily+Time Group
# This should match 420-data.dta observations at https://osf.io/ejz28/ 
# Verify: dta_orig = read_dta("https://osf.io/ejz28/download")
# arsenal::compare(daily_accidents_time_groups, dta_orig)
daily_accidents_time_groups = all_accidents %>%
    group_by(DATE, d420) %>%
    summarize(fatalities_count = n())

# Daily+Time Group final working data
# Only use data starting in 1992
daily_accidents_time_groups = all_accidents %>%
    filter(YEAR > 1991) %>%
    group_by(DATE, d420) %>%
    summarize(fatalities_count = n())

# Daily final working data
daily_accidents = all_accidents %>%
    filter(YEAR > 1991) %>%
    group_by(DATE) %>%
    summarize(fatalities_count = n())

For this dataset, it's especially important to remember the following caveats from the main #tidytuesday page:

"We will have many sources of data and want to emphasize that no causation is implied. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our guidelines are to use the data provided to practice your data tidying and plotting techniques. Participants are invited to consider for themselves what nuancing factors might underlie these relationships.

The intent of Tidy Tuesday is to provide a safe and supportive forum for individuals to practice their wrangling and data visualization skills independent of drawing conclusions. While we understand that the two are related, the focus of this practice is purely on building skills with real-world data."


[1]. Harper S, Palayew A The annual cannabis holiday and fatal traffic crashes. BMJ Injury Prevention. Published Online First: 29 January 2019. doi: 10.1136/injuryprev-2018-043068. Manuscript and original data/code at https://osf.io/qnrg6/

[2]. Staples JA, Redelmeier DA. The April 20 cannabis celebration and fatal traffic crashes in the United States. JAMA Intern Med. 2018 Feb;178(4):569–72.

Medium Article Scrape & Analysis

Someone did a web scrape of 1.4 million Medium articles between 8/2017-8/2018, including:

  • Title
  • Sub-title
  • Author
  • Publication Date
  • Tags
  • Read-Time
  • Claps-Received
  • Story URL
  • Author URL

Data:
https://www.kaggle.com/harrisonjansma/medium-stories

Article - removed as the link is no longer active.
https://towardsdatascience.com/i-just-published-a-massive-dataset-of-medium-stories-heres-the-link-to-get-it-889bab324138

Github:
https://github.com/harrisonjansma/Analyzing_Medium

Data on French High Speed Train Delays

I found the data on high speed trains on the French train company (SNCF) open data site
They have several datasets:

The data on the stations (including the zipcode and the GPS coordinates are in another file): https://frama.link/Qh7-jkJ7 . I am afraid I have not been able to join both dataframes, my fuzzy joining and general data wrangling skills are not good enough yet, but maybe some people will be up to the challenge.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.