Introduction
1. Tools and Notes
Project 2: TV's Golden Age is real.
Project 1: Federal R&D spending by agency.

Introduction

This repo will contain my code explorations with datasets available via the #TidyTuesday project. The idea is to evolve towards a document based on literate programming principles, with the code side by side with the exploration / analysis.

The latest project is on top.

Tools and Notes

I use ESS (and Emacs) for all my workflows. This is a single Org document containing each exploration under a heading, and code blocks enclosed within Org-babel headers.
Completed code snippets are tangled into scripts in the 00_scripts folder.

Project 2: TV's Golden Age is real.

Dataset: link

Questions

DONE How many unique genres?

How many unique titles per genre?

how many shows released every year, for each genre.

Find the average rating of each title across seasons

Average rating variation across genres. Which genres seem more popular?

Average number of seasons and the connection with the average rating

Top 10, based on number of seasons and average rating, and bringing out the connection with genre.

Code notebook

Loading Libraries and reading in data

                                        # Loading libraries
library("easypackages")
libraries("tidyverse", "tidyquant")
                                        # Reading in the data directly from github
tv_ratings_raw  <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-08/IMDb_Economist_tv_ratings.csv")

Let's get a quick overview of the data available.

tv_ratings_raw %>% skimr::skim()
tv_ratings_raw %>% glimpse()

Observations:

We have 7 variables, and all of them appear to be useful.
There are no missing values.
Each serial has a unique titleId. This can be useful to separate out serials from seasons.
what does the 'share' column imply?

To get deeper insight on these serials: the genres have to be separated out. This will allow analysis of the serials based on the Genre.

Viewing the data will show that individual genres are separated by a comma. Though I initially started with 6 splits - I found the maximum to be 3 and there are several with less than 3 genres.

So how many unique genres do we have?

                                        # Note:  the unique genres can also obtained by starting with map(unique), and with some further processing.

## test <- tv_ratings_raw %>%
##   separate(genres,
##            into = c("genre1_", "genre2_", "genre3_"),
##            sep = ',' ,
##            remove = FALSE
##            ) %>%
##   select(ends_with("_")) %>%
##   map(unique)

all_genres <-  tv_ratings_raw %>%
  separate(genres,
           into = c("genre1_", "genre2_", "genre3_"),
           sep = ',' ,
           remove = FALSE
           ) %>%
  select(ends_with("_")) %>%
  gather(
    key = genres_col,
    value = genre_all,
    .,
    na.rm = TRUE
  ) %>%
  select(genre_all) %>%
  unique() %>%
  spread(
    key = genre_all,
    value = genre_all
  )

There are 22 unique genre categories. To enable further analysis and make the data more suitable to machine learning, the genres could be combined with the dataset as separate columns with a logical value denoting whether the serial falls into the category or Not. In addition, it would nice to standardise the genre names formatting, especially those like Sci-If and Reality-TV.

With 22 genre columns to consider, a tidyeval function might help.

names(all_genres) <- str_c(rep("g_", length(all_genres)), names(all_genres))
tv_ratings_conditioned <- bind_rows(tv_ratings_raw, all_genres)

all_genres %>% glimpse()

genre_flag_fn <- function(data, search_col = genres, search_pattern, flag_col )
{
  flag_col <- enquo(flag_col)
  flag_col_name <- quo_name(flag_col)
  search_col <- enquo(search_col)
  data %>%
    mutate(!!flag_col_name := case_when(
               str_detect(!!search_col, search_pattern)  ~ 1,
               TRUE ~ 0
             )
           )
}

tv_ratings_genre_sep_tbl <- tv_ratings_conditioned %>%
  genre_flag_fn(search_pattern = "Action"      , flag_col = g_Action      ) %>%
  genre_flag_fn(search_pattern = "Adventure"   , flag_col = g_Adventure   ) %>%
  genre_flag_fn(search_pattern = "Animation"   , flag_col = g_Animation   ) %>%
  genre_flag_fn(search_pattern = "Biography"   , flag_col = g_Biography   ) %>%
  genre_flag_fn(search_pattern = "Comedy"      , flag_col = g_Comedy      ) %>%
  genre_flag_fn(search_pattern = "Crime"       , flag_col = g_Crime       ) %>%
  genre_flag_fn(search_pattern = "Documentary" , flag_col = g_Documentary ) %>%
  genre_flag_fn(search_pattern = "Drama"       , flag_col = g_Drama       ) %>%
  genre_flag_fn(search_pattern = "Family"      , flag_col = g_Family      ) %>%
  genre_flag_fn(search_pattern = "Fantasy"     , flag_col = g_Fantasy     ) %>%
  genre_flag_fn(search_pattern = "History"     , flag_col = g_History     ) %>%
  genre_flag_fn(search_pattern = "Horror"      , flag_col = g_Horror      ) %>%
  genre_flag_fn(search_pattern = "Music"       , flag_col = g_Music       ) %>%
  genre_flag_fn(search_pattern = "Musical"     , flag_col = g_Musical     ) %>%
  genre_flag_fn(search_pattern = "Mystery"     , flag_col = g_Mystery     ) %>%
  genre_flag_fn(search_pattern = "Reality-TV"  , flag_col = `g_Reality-TV`) %>%
  genre_flag_fn(search_pattern = "Romance"     , flag_col = g_Romance     ) %>%
  genre_flag_fn(search_pattern = "Sci-Fi"      , flag_col = `g_Sci-Fi`    ) %>%
  genre_flag_fn(search_pattern = "Sport"       , flag_col = g_Sport       ) %>%
  genre_flag_fn(search_pattern = "Thriller"    , flag_col = g_Thriller    ) %>%
  genre_flag_fn(search_pattern = "War"         , flag_col = g_War         ) %>%
  genre_flag_fn(search_pattern = "Western"     , flag_col = g_Western     )

Considering releases of serials per year, what was the distribution of genres of the released serials per year? This could include a new season of a serial. A serial will generally run atleast a season or two, unles sit is terribly unpopular. Howevever, what was the overall mood and distribution per genre?

tv_genre_y <- tv_ratings_genre_sep_tbl %>%
  mutate(dt_year = year(date)) %>%
  select(-c(seasonNumber,
            titleId,
            title,
            av_rating,
            share,
            genres,
            date)) %>%
  group_by(dt_year) %>%
  filter(!duplicated(titleId)) %>%
  summarise_if(.predicate = is.numeric,funs(sum(.))) %>%
  ungroup() %>%
  arrange(desc(dt_year))

Todo Alternate and cleaner method of generating flags

all_genres <-  tv_ratings_raw %>%
  separate(genres,
           into = c("genre1_", "genre2_", "genre3_"),
           sep = ',' ,
           remove = FALSE
           ) %>%
  select(ends_with("_")) %>%
  gather(
    key = genres_col,
    value = genre_all,
    .,
    na.rm = TRUE
  ) %>%
  select(genre_all) %>%
  unique() %>%
  spread(
    key = genre_all,
    value = genre_all
  )




key_terms <- all_genres %>% select(genre_all)  %>% count(genre_all)

tv_ratings_raw %>% select(genres)

tv_ratings_conditioned <- bind_rows(tv_ratings_raw, all_genres)

tv_ratings_conditioned %>%


genre_flag_multi_fn <- function(data, search_col = genres, ... , flag_col )
{
  flag_col <- enquo(flag_col)
  flag_col_name <- quo_name(flag_col)
  search_col <- enquo(search_col)
  search_pattern <- enquos(...)
  ## search_pattern
  ## ## print((!!!search_pattern))
  ## vars(!!!search_pattern[2])

  data %>%
    mutate(expr(str_c(!!!flag_col_name, "_g")) := map(~ case_when(
                                    expr(str_detect(!!search_col, !!!search_pattern))  ~ 1,
                                    TRUE ~ 0
                                  )
                                  )
                 )
}


tv_ratings_raw %>%
  select(genres) %>%
  genre_flag_multi_fn(search_pattern, flag_col = search_pattern)


search_pattern <-
  key_terms %>%
  select(genre_all) %>%
  spread(
    key = genre_all,
    value = genre_all
  ) %>%
  names()




  key_terms %>% select(genre_all)

Project 1: Federal R&D spending by agency.

Dataset: link

Notes

The current code only explores only the portion of Climate Spending.

Viz posted on Twitter

Code

                                        # Loading libraries
library("easypackages")
libraries("tidyverse", "tidyquant")

                                        # Reading in data directly from github
climate_spend_raw  <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/climate_spending.csv", col_types = "cin")


                                        # This initial conditioning need not have involved the date manipulation, as the year extracted from a date object is still a double.
climate_spend_conditioned <- climate_spend_raw %>%
  mutate(year_dt = str_glue("{year}-01-01")) %>%
  mutate(year_dt = as.Date(year_dt)) %>%
  mutate(gcc_spending_txt = scales::dollar(gcc_spending,
                                           scale = 1e-09,
                                           suffix = "B"
                                           )
         )

climate_spend_dept_y <- climate_spend_conditioned %>%
  group_by(department, year_dt = year(year_dt)) %>%
  summarise(
    tot_spend_dept_y = sum(gcc_spending)) %>%
  mutate(tot_spend_dept_y_txt = tot_spend_dept_y %>%
           scales::dollar(scale = 1e-09,
                          suffix = "B")
         ) %>%
  ungroup()

glimpse(climate_spend_dept_y)

climate_spend_plt_fn <- function(
                               data,
                               y_range_low = 2000,
                               y_range_hi  = 2010,
                               ncol = 3,
                               caption = ""
                               )
{
  data %>%
    filter(year_dt >= y_range_low & year_dt <= y_range_hi) %>%
    ggplot(aes(y = tot_spend_dept_y_txt, x = department, fill = department ))+
    geom_col() +
    facet_wrap(~ year_dt,
               ncol = 3,
               scales = "free_y") +
    theme_tq() +
    scale_fill_tq(theme = "dark") +
    theme(
      axis.text.x = element_text(angle = 45,
                                 hjust = 1.2),
      legend.position = "none",
      plot.background=element_rect(fill="#f7f7f7"),
    )+
    labs(
      title = str_glue("Federal R&D budget towards Climate Change: {y_range_low}-{y_range_hi}"),
                       x = "Department",
                       y = "Total Budget $ Billion",
                       subtitle = "NASA literally dwarfs all the other departments, getting to spend upwards of 1.1 Billion dollars every year since 2000.",
                       caption = caption
    )

}

climate_spend_plt_fn(climate_spend_dept_y,
                     y_range_low = 2000,
                     y_range_hi = 2017,
                     caption = "#TidyTuesday:\nDataset 2019-02-12\nShreyas Ragavan"
                       )


## The remaining code is partially complete and is in place for further exploration planned in the future.

## Code to download all the data.
## fed_rd <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/fed_r_d_spending.csv")
## energy_spend <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/energy_spending.csv")
## climate_spend <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-02-12/climate_spending.csv")

## climate_spend_pct_all <- climate_spend_conditioned %>%
##   group_by(year_dt = year(year_dt)) %>%
##   summarise(
##     tot_spend_all_y = sum(gcc_spending)
##   ) %>%
##   mutate(tot_spend_all_y_txt = tot_spend_all_y %>%
##            scales::dollar(scale = 1e-09,
##                           suffix = "B"
##                           )
##          )%>%
##   ungroup() %>%
##   mutate(tot_spend_all_lag = lag(tot_spend_all_y, 1)) %>%
##   tidyr::fill(tot_spend_all_lag ,.direction = "up") %>%
##   mutate(tot_spend_all_pct = (tot_spend_all_y - tot_spend_all_lag)/ tot_spend_all_y,
##          tot_spend_all_pct_txt = scales::percent(tot_spend_all_pct, accuracy = 1e-02)
##          )

shrysr / sr-tidytuesday Goto Github PK

sr-tidytuesday's Introduction

Table of Contents

Introduction

Tools and Notes

Project 2: TV's Golden Age is real.

Dataset: link

Questions

DONE How many unique genres?

How many unique titles per genre?

how many shows released every year, for each genre.

Find the average rating of each title across seasons

Average rating variation across genres. Which genres seem more popular?

Average number of seasons and the connection with the average rating

Top 10, based on number of seasons and average rating, and bringing out the connection with genre.

Code notebook

Todo Alternate and cleaner method of generating flags

Project 1: Federal R&D spending by agency.

Dataset: link

Notes

Viz posted on Twitter

Code

sr-tidytuesday's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org