TidyX

Hosts

Ellis has been working with R since 2015 and has a background working as a statistical programmer in support of both Statistical Genetics and HIV Vaccines, and currently works as a Data Science Lead. He also runs the Seattle UseR Group.

Patrick's current work centers on research and development in professional sport with an emphasis on data analysis in American football. Previously, He was a sport scientist within the Nike Sports Research Lab. Research interests include training and competition analysis as they apply to athlete health, injury, and performance.

Description

The goal of TidyX is to explain how R code works. We are focusing on explaining topics either we find interesting or submissions from our viewers. Historically we explained how submissions to the #TidyTuesday Project worked to help promote the great work being done there.

In this repository, you will find copies of the code we've explained, and the code we wrote to show the concept on a new dataset.

To submit code for review, email us at [email protected]

To watch more episodes, go to our youtube channel.

Patreon

If you appreciate what we are doing and would like to support TidyX, please consider signing up to be a patron through Patreon.

https://www.patreon.com/Tidy_Explained

TidyX Episodes

Episode 1: Introduction and Treemaps!
- UseR Highlighted: Courtney Gerver
- Original Tweet
- Source Code
Episode 2: The Office, Sentiment, and Wine
- UseR Highlighted: Robin Sifre
- Original Tweet
- Source Code
Episode 3: TBI, Polar Plots and the NBA
- UseR Highlighted: Raniere Silva
- Original Tweet
- Source Code
Episode 4: A New Hope, {Patchwork} and Interactive Plots
- UseR Highlighted: Maggie Sogin
- Original Tweet
- Source Code
Episode 5: Tour de France and {gganimate}
- UseR Highlighted: Owen Churches
- Original Tweet
- Source Code
Episode 6: Lollipop Charts
- UseR Highlighted: Priya Shukla
- Original Tweet
- Source Code
Episode 7: GDPR Faceting
- UseR Highlighted: Danielle Barnas
- Original Tweet
- Source Code
Episode 8: Broadway Line Tracing
- UseR Highlighted: Jake Kaupp
- Original Tweet
- Source Code
Episode 9: Tables and Animal Crossing
- UseR Highlighted: Ted Lederas
- Original Tweet
- Source Code
Episode 10: Volcanoes and Plotly
- Ellis and Patrick explore this weeks TidyTuesday Dataset!
Episode 11: Times Series and Bayes
- UseR Highlighted: Eric Ekholm
- Original Tweet
- Source Code
Episode 12: Cocktails with Thomas Mock
- UseR Highlighted: Joshua de la Bruere
- Original Tweet
- Source Code
Episode 13: Marble Races and Bump Plots
- UseR Highlighted: Cédric Scherer
- Original Tweet
- Source Code
Episode 14: African American Achievements
- UseR Highlighted: Catriona Cunningham
- Original Tweet
- Source Code
Episode 15: Juneteenth and Census Tables
- Ellis and Patrick show US Census tables in a report, broken down into divisions and highlight values using {colortable}
- Source Code
Episode 16: Caribou Migrations and NBA Shots on Basket
- UseR Highlighted: Jihong Zhang
- Original Tweet
- [Source Code](https://github.com/thebioengineer/TidyX/blob/master/TidyTuesday_Explained/016-Caribou_Migrations_and_Spatial_Analysis/Jihong Zhang - Caribou Migration Map.Rmd)
Episode 17: Uncanny X-men and Feature Engineering
- UseR Highlighted: Rebecca Stevick
- Original Tweet
- Source Code
Episode 18: Coffee and Random Forest
- UseR Highlighted: Nyssa Silbiger
- Original Tweet
- Source Code
Episode 19: Astronauts and Dashboards
- UseR Highlighted: Lauren Pandori
- Original Tweet
- Source Code
Episode 20: Cocktails with David Robinson
- UseR Highlighted: David Robinson
- Original Tweet
- Source Code
Episode 21: The Birds
- UseR Highlighted: Roman Link
- Original Tweet
- Source Code
Episode 22: European Energy and Ball Hogs
- UseR Highlighted: Kelly Cotton
- Original Tweet
- Source Code
Episode 23: Mailbag and Expected Wins
- Ellis and Patrick go into our mailbag and focus on a request we recently had on loops and functions.
- Source Code
Episode 24: Waffle plots and Shiny
- UseR Highlighted: Jared Braggins
- Original Tweet
- Source Code
Episode 25: Intro To Shiny
- This is a start of a series of episodes covering more in-depth uses for {Shiny}, an R package for creating web applications by Joe Cheng. In this episode we cover basics of Shiny, and explain the concept of reactive programming.
- Source Code
Episode 26: Labels and ShinyCARMELO - Part 1
- UseR Highlighted: Mr. Ochiwar
- Original Tweet
- Source Code
Episode 27: LIX and ShinyCARMELO - Part 2
- UseR Highlighted: Leon Jessen
- Original Tweet
- Source Code
Episode 28: Nearest Neighbors and ReactiveValues
- This week Ellis and Patrick explore how to perform career analysis and projections using the KNN algorithm.Using those concepts, we jump into part three of our shiny demo series where we have shiny execute a KNN for our input players. We show how to create an action button to execute our code, and reactiveValues to store the results to then plot!
- Source Code
Episode 29: Palettes and Random Effects
- UseR Highlighted: Kaylea Haynes
- Original Tweet
- Source Code
Episode 30: Tweet Sentiment
- Patrick and Ellis were inspired this week by all the sentiment analysis performed for #TidyTuesday this week so we decided to look at tweets to show and comment on additional things to be aware of when doing sentiment analysis. Using {rtweet}, we pull over 50,000 tweets that used the #Debate2020, and discuss how context is incredibly important to analysis.
- Source Code
Episode 31: Reactable
- This weeks #TidyTuesday dataset was on NCAA Womens Basketball Tournament appearances. Patrick and Ellis in the past have shown how tables can be used for data visualization, and wanted to learn more about another one. {reactable} is a really cool looking package, so we spend some time showing how to use the package, apply column definitions, and even apply html widgets within the table!
- Source Code
Episode 32: Shiny with Eric Nantz
- This weeks #TidyTuesday dataset was a super fun one. Ellis and Patrick are joined by Eric Nantz, who created a shiny app to explore and animate the data. We talk through several new shiny concepts, like using {golem}, cross-talk, and other shiny packages like {bs4dash}!
- UseR Highlighted: Eric Nantz
- Source Code
Episode 33: Beer and State Maps
- UseR Highlighted: Richard Bamattre
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 34: Wind and Maps
- UseR Highlighted: Florence V. Dubois
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 35: Rectangles
- UseR Highlighted: Henry Wakefield
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 36: Animated Plotly
- This weeks #TidyTuesday dataset was on Mobile and Landline subscriptions across the world. This week we saw lots of animation type plots, and wanted to add our own. Using {plotly}, we make an interactive plot that animates across time to show how GDP is related to the raw subscription numbers. We also do some exploration with line plots.
- Source Code
Episode 37: Code Review
- Looking back at ones code can show you just how far you have come. Sparked by a conversation between Ben Baldwin (@benbaldwin), Patrick and Ellis, this weeks episode is on code review and refactoring. Ben went into his past and has furnished a set of code for us to try to refactor. In the spirit of things, neither of us looked closely at the code ahead of time, and recorded our initial reactions and process of refactoring Bens code into a function that could be applied to multiple datasets!
- UseR Highlighted: Ben Baldwin
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 38: Polar Plots
- UseR Highlighted: Tobias Stalder
- Original Tweet
- Tweet Source Code
Episode 39: Imputing Missingness
- This weeks we reach into our mailbag to answer a request from Eric Fletcher(@iamericfletcher) on imputing NA's. In this video we scrape 2013 draft data, and impute using various techniques missing times for the three cone event. We also attempt to discuss Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) - but we decide at the end to leave it to the professionals.
- Source Code
Episode 40: Inspiring Women and Plotly
- UseR Highlighted: Jackie Torres
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 41: Worm Charts with Alice Sweeting
- Alice Sweeting(@alicesweeting) joins us as a guest explainer this week! We are very excited to have her on as she explains with us how she worked through creating a worm chart of a super netball game! She talks with us on common techniques she uses to process data, mixing base R with tidyverse. Then we spend some time discussing Alice's background, current role, and advice for folks looking to get started in sports analytics or R programming in general.
- UseR Highlighted: Alice Sweeting
- Source Code
Episode 42: Highlighting Lines
- UseR Highlighted: Peter
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 43: Funnel Plots, Plotly, and Hockey
- With no #TidyTuesday dataset this week, we decide to continue to work through our learning of plotly. This time, using a tool known as a funnel plot.
- Source Code
Episode 44: Transit Costs, steps, and Plotly Maps
- UseR Highlighted: Martin Devaux
- Original Tweet
- Blog Post
- TidyX Source Code
Episode 45: NHL Pythagorean Wins and Regression
- This week we reflect back on the past year and combine techniques from multiple episodes. We scrape multiuple tables from the the hockey reference website, use regular expressions to clean and organize the data, and use for loops to determine the optimal pythagorean win exponent. We visualize the data using several different techniques, like scatter and lollipop charts. We show some fun tools with regularizing values for linear regressions and how how to predict and visualize the results.
- Source Code
Episode 46: Circle Plots, NHL Salaries, and Logistic Regression
- UseR Highlighted: Natalie O'Shea
- Original Tweet
- Tweet Source Code
- TidyX Source Code
Episode 47: NHL Win Probabilities and GT Tables
- This week we play with a new technique for optimizing, the optim function! We scrape the 2019-2020 NHL season to generate power rankings for every NHL team and home-ice-edge. We can use this to then predict team winning probability! We then combine that with season summary data to generate a pretty GT table!
- Source Code
Episode 48: NBA Point Simulations
- In this episode we show how to scrape the current NBA seasons scores to then generate a simple game simulator. Using {purrr} with some base R functions we generate outputs and show how to simulate thousands of games to generate outcome predictions.
- Source Code
Episode 49: MLB Batting Simulations
- We continue looking at simulations this week, but this time for individual players. Using {Lahman}, we pull the 2019 MLB player batting stats, and visualize the stats using histograms and density plots. Next, to generate confidence intervals around their batting averages we use rbinom() combined with techniques from the {tidyverse} to make simulation easy. Finally we visualize the data using {gt} combined with {sparklines}.
- Source Code
Episode 50: MLB Batting Simulations
- Another MLB Batting episode. This time we use the James Stein Estimator (paper below) to apply a shrinkage estimate to player batting averages to get a "true" estimate, removing luck. Using {Lahman}, we pull the 2018 MLB player batting stats, and explain how to implement the estimator. Next, we compare estimates against the 2019 season. Finally we visualize the data using {gt}, using header spans, and cell styling. For the grand finale we combine this gt table with batting averages with plots using patchwork!.
- Source Code
Episode 51: Deploying Models with Shiny
- Sharing the results of a modeling effort is an important skill of any data scientist. However, just sharing the weights of each predictory is often not good enough to get buy in from stakeholders who are understandably skeptical of your results. Using the power of shiny, you can show your stakeholders exactly how your model interprets and then predicts the results. In this episode, we use the {palmerpenguins} package with {randomforest} to generate a model to predict the species of a new penguin. With shiny, we then deploy our model to allow the users to record new penguins attributes to see whether the model things they are an Adelie, Chinstrap, or Gentoo! The output is a boxplot indicating the models probablity for each species given the inputs.
- Source Code
Episode 52: Too Many Gentoo with Xaringan
- There are too many Gentoo, your PI proclaims. This weeks episode Patrick and Ellis talk how to use the {xaringan} package to produce reproducible html presentations using Rmarkdown syntax. We discuss how we looked at "raw" tech data and used summary statistics to compare against the gold standard {palmerpenguins} package from Dr. Allison Horst, Dr. Alison Hill, Data from Dr. Kristen Gorman. We use last weeks highly powerful machine learning model to generate presictions of species, and generate a confusion matrix of our data vs the predictions. Finally, we talk about the value of making your presentation based on Rmd and being able to update the presentation at the click of a button.
- Source Code
Episode 53: MLB Pitch Classification Introduction
- This week we start a series on using machine learning to automate pitch classification. In this first episode, we discuss ways to start looking at your data and questions to formulate. We use hierarchical clustering a few different ways to start to see relationships between the different pitch types and the statistics that were captured around each pitch!
- Source Code
Episode 54: MLB Pitch Classification 2 - KNN, Caret and UMAP
- In the second episode on using machine learning to automate pitch classification from pitchfx data, we apply the K-nearest-neighbors algorithm as our first attempt at classification. We start with using the results from our naive hierarchical clustering to select 4 groups and apply the KNN algorithm. We then look at how we could evaluate performance of the model both with total mis-classification and within class mis-classification. Then we use {carat} to optimize for the best clustering and compare the results. Finally, we use UMAP to perform dimensional reduction to visualize mulitple dimensions as two and view relationships within the clusters.
- Source Code
Episode 55: MLB Pitch Classification 3 - Decision Trees, Random Forests, optimization
- For the third episode in the series on using machine learning to automate pitch classification from pitchfx data, we talk about decision trees and its famous variant: random forests. We start by discussing what a decision tree is and its value. We visualize the results and discuss the quality of the fit. Then we expand on decision trees, using the Random Forest algorithm, and discuss its performance. Finally, we use {caret} and {doParallel} to do a grid search for optimal mtry, using parallel processes to speed up the search!.
- Source Code
Episode 56: MLB Pitch Classification 4 - XGBoost
- We now turn to the famous XGBoost algorithm to help us in our fourth episode in the series on using machine learning to automate pitch classification from pitchfx data. We start by training using default parameters and observe some tricks to make training faster. Then we use {caret} and {doParallel} to do a grid search for optimal settings to be using to train and discuss the merits and disadvantages of using ever more complicated ML models.
- Source Code
Episode 57: MLB Pitch Classification 5 - Naive Bayes Classification
- We naively turn to bayes...okay, I'm done. In this episode we use the Niave Bayes Classifier from the {e1071} package to classify pitches from our pitchfx data. We discuss briefly how this algorithm works, and review its performance against the other tree-based algorithms we've used so far.
- Source Code
Episode 58: MLB Pitch Classification 6 - TensorFlow
- The next model type is one that has had a lot of excitement over the last decade with the promise of "AI" - deep learning. Using the {keras} package from RStudio, we attempt to train a model to automate pitch classification from pitchfx data. We talk about the differences to consider when building a deep learning algorithm, and data prep that must be done. We finally review the restuls and talk a bit about black-box ML models.
- Source Code
Episode 59: MLB Pitch Classification 7 - Class Imbalance and Model Evaluation Intro
- Throughout this series, we been attempting to predict pitch type using PitchF/X data. However, we have not directly addressed a major flaw in our data, class imbalance. The Four-seam FastBall consists of nearly 37% of our data! In this episode we apply a couple techniuqes to help address the class imbalance, and look at ways to evaluate our models performance. We talk about the pros and cons to consider, and set up for our last episode for the series.
- Source Code
Episode 60: MLB Pitch Classification 8 - Model Evaluation and Visualization
- This week we apply everything we have learned over the last several weeks to attempt to pick the best model for our project. As a reminder, we are attempting to predict pitch type using a subset of PitchF/X data. We attempt to productionalize our evaluations by writing a series of functions that allow quick iteration across multiple input types and capturing of information. Finally, we visualize the evaluations using 2 gt tables. Thank you all so much for joining us for this mini series on ML models and being with us as we hit episode 60. This has been a wonderful ride!
- Source Code
Episode 61: Data Cleaning - Regular Expressions
- Okay, we've gotta say it - there is nothing "regular" about regular expressions. BUT that does not mean they are not an incredibly valuable tool in your programming toolbox. In this episode we go through how to apply regular expressions to a dataset and talk through some of the common tokens you might use when applying a regular expression to your dataset.
- Source Code
Episode 62: Data Cleaning - REGEX applied & stringr
- This week we continue using Regex, and this time talk about applying it to generate data for plots. Additionally we discuss techniques such as grouping, and using the stringr package for its str_* variants of the base R regex functions.
- Source Code
Episode 63: Data Cleaning - REGEX lookarounds & Player Gantt Charts
- We lookaround with regex this week, showing an alternative approach to setting anchors in your regular expressions using lookarounds. We apply this to extracting player substitutions. Then we calculate the number of stints and duration for players to create a player gantt chart across the Miami Heat and Milwaukee Bucks Game 2 of the Eastern Conference Playoffs.
- Source Code
Episode 64: Data Cleaning - Ugly Excel Files Part 1
- Ugly data. Ugly EXCEL data. Thats pretty common to come across as a data scientist. People unfamiliar with how to format data are often the ones creating the excel files you work with. This week, Patrick and Ellis talk through some techniques to handle these data and turn it into usable data. Patrick wrote up this weeks example, parsing through the data to generate a nice data.frame from the ugly excel example.
- Source Code
Episode 65: Data Cleaning - Ugly Excel Files Part 2
- This week Ellis works through the ugly excel file, writing out the code live as he goes, and explaining how to break up the parsing into nice, bite-size pieces and generalize them. Patrick is there asking questions and clarifying how things worked. At the end of the cast they end up with similar data.frames, ready to munge for final processing.
- Source Code
Episode 66: Data Cleaning - Ugly Excel Files Part 3
- Now that we have the excel file into a nice format, we go over the final pieces of processing to turn the incorrectly formatted fields into usable data. We talk about generating date objects, ifelse vs if_else, and have some fun!
- Source Code
Episode 67: Data Cleaning - Viewer Submitted Excel File
- For the first time in over a year, and 65 episodes, Ellis and Patrick are in the same room! This week they work on a viewer-submitted excel file. After last weeks episode, we put out a call to our viewers to submit the ugly data they see so we can try to help. Github user MikePrt submitted a file from the UK Government statistics organisation (Office for National Statistics (ONS)) as an example. We extract the data and produce a simple plot.
- Source Code
Episode 68: Data Cleaning - Ugly Excel Files Part 4 - Saving Outputs
- We continue our series on data cleaning and discuss sharing your outputs. Patrick and Ellis go over a few output file formats and two different excel libraries that give you differing levels of control over the outputs.
- Source Code
Episode 69: Modern Pentathlons with Mara Averick
- Ellis and Patrick are joined today by Mara Averick, a Developer Advocate for RStudio. We conclude our series on messy excel data by talking through cleaning an excel file from UIPM and reasoning out what the field and scoring are.Then we talk about Mara's role, career history, and advice she has for our viewers.
- UseR Highlighted: Mara Averick
- Source Code
Episode 70: Databases with {dplyr}
- Making friends with your friendly database administrator is a great way to improve your effectiveness as a data scientist in your organization. But what do you do if you don't know any SQL? We present {dbplyr} by the folks at RStudio. Easily connect, interact with and send queries to databases using familiar dplyr syntax and commands.
- Source Code
Episode 71: Databases in R | Exploring Your Database with NBA data
- Being handed a database without knowing its contents or where to start can be daunting. We talk about techniques we can use to start exploring it just like any other dataset. We get a list of the tables in your database, the column names, and show how you can write SQL to get the head of a table.
- Source Code
Episode 72: Databases in R | Shiny and Databases
- The fastest way for a data scientist to multiply their impact is to get their customers to be able to do the analysis themselves (with guiderails of course). Shiny provides a great user interface, combing this with some basic queries your clients may want improves response time and allows them to search to their hearts content. This week we show you a simple way to add interactivity with your database using {shiny} to query teams mean point differential at home across the 2001-2002 seasons.
- Source Code
Episode 73: Databases in R | Shiny,Databases, and Reactive Polling
- Now that we have a shiny app that allows our users to access and interact with the data in our database, how do we make sure that the user configuration is showing the most up-to-date information for selection? This is done through reactive polling - a time out feature that checks to see if there are any update to the database and updates the UI selection interface accordingly. We discuss the benefits and how to use the reactivePoll function combined with an observeEvent function to really supercharge our shiny app!
- Source Code
Episode 74: Databases with R | Joins in SQL vs Local
- Continuing the SQL/Database saga, we look at joins. we scrape a bunch of play by play information and game info and look at generating a database with this information. We then compare the speed of joining tables locally or within the sql database!
- Source Code
Episode 75: Databases with R | J Joins, databases, and commits in Shiny
- Now that we have a database full of data, and a shiny app to play with it, how do we capture and share the information across our users using the database? In this episode we share how we might create a sample database filled with play-by-play NBA data and create a shiny app to allow a coach or SME to review and add comments to the data as they review it. Then, they can decide to commit and save their thoughts for the future!
- Source Code
Episode 76: Databases with R | Polling databases in Shiny
- In Episode 75 we introduced the idea of committing changes from a shiny app to a database. But what about scenarios with multiple users? Ellis and Patrick explore an idea to allow for polling of the database and add updates that were committed to the database to active views of the rest of the users. We use reactive polling as introduced in episode 73 and updating reactiveValues.
- Source Code
Episode 77: Tidymodels - LM
- tidymodels is an ecosystem of packages developed by RStudio (Max Kuhn, Julia Silge to name a few) that is developed to help folks apply good modeling practices from start of the cleaned data to a fully productionalized model. We are going to be stepping through and learning how to apply tidymodels together. The first episode is on applying a simple linear model versus the base R method!
- Source Code
Episode 78: Tidymodels - Splits and Recipes
- tidymodels is an ecosystem of packages developed by RStudio (Max Kuhn, Julia Silge to name a few) that is developed to help folks apply good modeling practices from start of the cleaned data to a fully productionalized model. We are going to be stepping through and learning how to apply tidymodels together. The second episode we discuss how to set up your test/train splits as well as data preprocessing using the {recipes} package in conjuction with {workflow}! This smooths out and applies good practices simply and effectively to make data prep for modeling a breeze.
- Source Code
Episode 79: Tidymodels - Cross-validation and Metrics
- The third episode on tidymodels, we continue our data prep and model training by exploring cross-validation and metric evalidation. Ellis and Patrick show to set up a 5-fold cross validation set on your training split as well as fitting a tidymodels workflow! We finally show how to display and extract model fitting evaluation metrics.
- Source Code
Episode 80: Tidymodels - Decision Trees and Tuning
- The fourth episode on tidymodels, we sort out how to do parameter tuning of a model using the tune package. We set up a grid to train across and select the best model based on model metrics. We then retrain this model on the full test set and evaluate its performance against the final test set.
- Source Code
Episode 81: Tidymodels - Logistic Regression with GLM
- This week we look at how to perform a logistic regression using the tidymodels framework. During the fifth episode tidymodels, we show how to set up a logistic regression using GLM, perform a custom test/train split on the data, and calculate metrics such as ROC AUC, kappa, and accuracy. We visualize the performance and evaulate how well our model performed.
- Source Code
Episode 82: Tidymodels - Logistic Regression with GLM
- Continuing looking at classification models via tidymodels, this week we look at how to perform a multiple classification problem using random forests. We show how to tune your model, extract the optimal workflow, and then train it against your full training set and compare its performance on the test set. We calculate performance metrics such as ROC AUC and visualize the results.
- Source Code
Episode 83: Tidymodels - Naive Bayes of Penguins
- Naive bayes is the model we apply in this weeks Tidymodels series. We look at how to perform a multiple classification problem using the naive bayes theorem applied in the discrim package from tidymodels, and the klaR package to supply the engine. We show how to evaluate your model using 5-fold cross valudation, and then train it against your full training set and compare its performance on the test set. We calculate performance metrics such as ROC AUC and visualize the results.
- Source Code
Episode 84: Tidymodels - Workflow Sets and model selection
- Tidymodels makes it simple to try a multitude of modeling types by separating the preprocessing from the model type and creating a standardized way to apply different models. Workflow sets takes this a step further and makes it so you can train and compare these models at the same time, just like tuning. Using data from Kaggle, we look at how to perform model fitting for three model types and select the best workflow to train using our full train set and compare against our hold out test set. We calculate performance metrics such as RMSE and R-squared and visualize the results.
- Source Code
Episode 85: Tidymodels - Tuning Workflow Sets
- Tidymodels makes it simple to try a multitude of modeling types by separating the preprocessing from the model type and creating a standardized way to apply different models. In this episode we show how you can use Workflowsets along with tuning to create optimal models. Using wine data from Kaggle, we look at two different recipes and 3 different models requiring different levels of tuning. We select the best workflow and optimal tuned paramets to train using our full train set and compare against our hold out test set. We calculate performance metrics such as RMSE and R-squared and visualize the results.
- Source Code
Episode 86: Tidymodels - Julia Silge and Tune Racing
- This week have are thrilled to have Dr Julia Silge from RStudio join us to talk about tidymodels. Julia is one of the software engineers we have to thank for tidymodels and the ecosystem of packages that help us perform our data preprocessing and modeling steps with ease! In this episode we have a short interview with Julia where she talks a bit about her background, her current role and Tidymodels. We then jump into explaining how some code she wrote and shared in one of her owns screen casts on training an xgboost model to predict homeruns. One unique part of it is that Julia applies tune racing, making the tuning run faster using some clever comparisons to make sure only the best models continue to get trained across all cross folds. Patrick and Ellis ask questions throughout on how the code works and Julia's philosophies.
- Julia Silge's Blog Post on Racing Methods
Episode 87: Advent of Code Day 6 - Efficient Problem Solving
- This week we take a look at a problem from the Advent of Code, specifically day 6. Advent of Code is a fun time of year where the data science community comes together to solve a series of 25 problems posed by Eric Wastl. The goal is to see who all can solve the problems quickly and efficiently. It also provides an opportunity to work on problems unlike most of what you see in your day-to-day job. We work on finding an efficient solution to day 6 - Lanternfish. The fish reproduce at a standard rate, but calculating how many exist after a certain number of days is a problem that is trivial for small number of days, but quickly becomes too large for your computer if you approach the problem the wrong way!
- Source Code
Episode 88: Advent of Code Day 7 - For Loops and Lookup Vectors
- We work on finding an efficient solution to Advent of Code Day 7 - Whales. We need to find the most efficient location to align a series of crab submarines in order to escape with several different constraints. We discuss how to set up an efficient for loop, and create a lookup vector!!
- Source Code
Episode 89: Tables for Research
- We reach into our mailbag this week to answer a question from one of our viewers. In one of our episodes we talked about how you could extract coefficients from your fit models using the {broom} package. However, how would one turn that into a publication ready table? In this episode we use {gt} by Rich Iannone to convert our coefficients data.frame into a nice, publication-ready table!
- Source Code
Episode 90: Rmarkdown Guide - RMD Formatting
- Rmarkdown is an incredible tool that is widely used by R analysts to combine prose and code together into a beautiful symphony of reproducible outputs and information sharing. However, some of the set up as a new comer can be confusing. We are starting a series to discuss some of the knowledge to help users get going with their Rmarkdown journey. This week we start on the bones and structures of Rmarkdown documents, talk about markdown syntax, setting up your text to format as expected, and add some code chunks!
- Source Code
Episode 91: Rmarkdown Guide - Code Chunk Options & Figure Options
- Rmarkdown is an incredible tool that is widely used by R analysts to combine prose and code together into a beautiful symphony of reproducible outputs and information sharing. However, some of the set up as a new comer can be confusing. We are starting a series to discuss some of the knowledge to help users get going with their Rmarkdown journey. This week continues where we left off, talking through common chunk options that modify how your code and its outputs appear in the resulting output, and whether it even gets run at all. Then we cover common chunk options that modify figure outputs that are incredbly useful! Finally we start an rmarkdown report to demonstrate how we would use these options in a real report.
- Source Code
Episode 92: Rmarkdown Guide - Formatting Tabs for HTML outputs
- This weeks episode features a trick on how to make tabsets in your html outputs in Rmarkdown, as well as some advice on how to start organizing your code within an Rmarkdown document. Using the palmerpenguins dataset, we show how to make your code chunks super easy to update and things to think about when making your output.
- Source Code
Episode 93: Rmarkdown Guide - YAML Header
- The YAML header controls the macro level behaviors of your rmarkdown, from the output type, to the title, author, date, custom styling, table of contents, etc. In this episode we cover the basic YAML header contents, and how to add this customization to your rmarkdown documents. We also show two example outputs for html and word.
- Source Code
Episode 94: Rmarkdown Guide - Parameterized Reports
- Parameterised reports allow data scientists to multiply their impact by reducing the amount of work they need to do to produce new reports. Using the YAML header, a data scientist can set parameters that change based on user inputs to create customized reports at the click of a button. In this episode we go over the basics of adding a parameter, how to customize the input either interactively or programmatically, and using the parameter in your code. Then we create a custom example on pulling NBA basketball data for multiple years and displaying a team of interest.
- Source Code
Episode 95: Rmarkdown Guide - Interactive Reports with htmlwidgets
- So far in our series on rmarkdown, we have covered ways to generate reports, sometimes dynamically running them with parameters. This week we cover how you can generate html reports with embedded interactivity from htmlwidgets. These widgets allow the users to inspect and explore the data embedded in the report. This sort of technique is used a lot and there are a number of html widgets in the R ecosystem. In this episode we demonstrate how to explore baseball data using interactive plots from plotly and datatables from the DT package.
- Source Code
Episode 96: Rmarkdown Guide - ASIS Outputs
- This week we discuss a fun rmarkdown r chunk option - results. This little argument can have a big impact on the looks and output of our rmarkdown reports and can give a bunch of power to the developer to make the behavior and content of the report change based on the results of the code. It can also make what could be a tedious task in rmarkdown super fast!
- Source Code
Episode 97: Sampling, Simluation, and Intro to Bayes - Base R Distributions
- A powerful tool in the R toolbox is the set of distribution functions included in base R. These functions allow data scientists to explore a variety of potential distributions to simulate data and explore possibilities. This week we go over the meaning of the p, q, d, and r prefixes of the distribution functions and work through examples of how to use them using baseball data from the {lahman} package.
- Source Code
Episode 98: Sampling, Simluation, and Intro to Bayes - Sampling and Bootstraps
- sample is a fun and useful base R function that allows you to select a sample of n values from a vector at random. This has important implications to setting up bootstap sets to resample existing datasets. This week we go through the differences between simulation and resampling and how to do some simple set ups to resampling that will be the foundation to the next few episodes.
- Source Code
Episode 99: Sampling, Simluation, and Intro to Bayes - Basic Bayes
- Applying what we have learned these last few weeks, we are ready for Bayesian Statistics and Bayes theorem! This week we work through the concept behind Bayes, and attempt to talk through it in more approachable terms. We then try to apply the theorem to a few different cases to help solidify our understanding.
- Source Code
Episode 100: Sampling, Simluation, and Intro to Bayes - Beta Bayes
- Continuing our series on bayes, this week we learn about the conjugate prior of the binomial distribution, the beta distributions! Applying what we have learned about bayes theorem last week, we work through an example where we are trying to evaluate the performance of a basketball player in a drill where the average participant hits 65% of their shots, and this person hit 16 of 20. We discuss how to calculate some credible intervals, and update our analysis as we get more data on this player!
- Source Code
Episode 101: Sampling, Simluation, and Intro to Bayes - Poisson/Gamma
- Ever wonder how you could estimate the probability of a rate? Well, enter the Poisson distribution. Armed with "lambda", representing both the mean and sd of a distribution, we are able to simulate and calculate probabilities of number of occurrences, such as points scored in a game by a player. However, to apply bayes theorem and get credible intervals, we need a continuous probability, enter the conjugate prior: gamma. We use this to perform bayesian updating and calculate credible intervals to give us insight on a new player to our pretend basketball team.
- Source Code
Episode 102: Sampling, Simluation, and Intro to Bayes - Normal-Normal Conjugate
- This week we take a look at the most common, but also potential the most confusing distribution for our purposes - the normal distribution. We discuss how a bayesian looks at and uses a normal distribution, in where our mean and standard deviation both have their own distribution. An assumption is applied for us to work through a simple problem this week where we determine the probability of a basketball player being above average in a made up efficiency metric, and we demonstrate how to use bayesian updating as we gain new information on the player.
- Source Code
Episode 103: Sampling, Simluation, and Intro to Bayes - Normal-Gibbs Sampler
- The final episode in this series on bayes, we use learnings from several prior episodes to apply a new technique, Gibbs Sampling. This tool is used when there are multiple parameters that are being evaluated, each with their own parameters. We continue with the example from last week, but demonstrate how we can use a Gibbs Sampler to generate a distribution without having set the mean and standard deviation for a players efficiency metric. We also show a simple function that applies what we have learned in a simple API.
- Source Code
Episode 104: R Classes and Objects - dates and POSIXt
- This week we go on a date. Well, we talk about a date. Okay, okay, we talk about how to look and use date and datetime objects in R. We start with a high level overview of object systems that exist in R, and then reach into our mailbag to answer a question about lubridate. We talk about the fundamentals of date and POSIXt type objects, and ways to use them. Then we go over some of the difficulties of their behavior and how the {lubridate} package really makes dealing with dates much simpler.
- Source Code
Episode 105: R Classes and Objects - Base
- In the past 104 episodes, we realized we never spent time going over the base object types in R, how to build them up, and access them. This is something we have done in every episode, but decided this was the week we go over the mechanics of how it all works. We use four base object types; boolean, integer, numeric, and character, and show you how to build vectors, matrices, and data.frames. We go over how we think about objects.
- Source Code
Episode 106: R Classes and Objects - Factors
- A common question until R4.0.0 was "why is stringsAsFactors TRUE by default" to many a new R programmer. In this episode we discuss the mysterious factor object that is in base R. Why does it exist, how do you use it, and how to work with it are questions we attempt to answer here. We demonstrate changing vectors to and from factors, how factors impact regression models, and how to use factors to generate plots!
- Source Code
Episode 107: R Classes and Objects - Lists, Part 1
- Listy, list, lists. This episode we talk about one of our favorite, most flexible objects in R, a list. These objects can do almost anything, because they just don't care. Ellis and Patrick talk about how to create lists, discuss how they can nest and contain different object types, extraction of contents, and iterating over them. They talk about the {purrr} package and the valuable map family of functions, compare them to some of the apply family of functions, and compare them to a list.
- Source Code
Episode 108: R Classes and Objects - Lists, Part 2
- Listy, list, lists. AGAIN. This episode we continue our talk about lists. Last week we showed some methods to create and work with lists, and this week we show a variety of ways that lists can be used. We demonstrate summary statistics gathering, recording model results, and even looping over a list to generate a PDF report!
- Source Code
Episode 109: R Classes and Objects - Making an S3 Object, Part 1
- So far we have discussed the EXISTING objects included in base R. But our viewers may remember mention of additional object systems; s3, s4, RC, R6. In this episode we introduce the idea of making your own object in the S3 object system. Ever wonder how a tibble was made and how so many functions "just work" with it. Here we start to give you some insight to this idea by creating our own object and its own print method. Then we demo how you write a function to serve as a constructor of that object!
- Source Code
Episode 110: R Classes and Objects - Making an S3 Object - Part 2 - S3 Tournament
- We extend the idea of creating our own objects this week by demonstrating "s3 in practice". We pretend to be a Data Scientist for a a local sports betting company. The season has just ended for a local sports league. And we want to predict who will will the whole enchilada. First we need to sort out how we will simulate a single game. We create objects representing teams, and a series of functions to predict team performance and eventually a game winner!
- Source Code
Episode 111: Nate Latshaw, UFC Data, and data.table
- This week we are joined by the one and only Nate Latshaw. Nate is a software engineer and open source contributor, making amazing visualizations of UFC data in R. Some of Nates work includes a complex shiny app that has a lot of different ways to explore UFC fighter data. This week we are walked through how some of the visualizations are made, get a quick introduction to data.table, and get an inside look to how Nate creates such amazing visualizations. After the code, we talk about Nates career, experience in the open source community, and advice for those looking to start their own open works!
- Nate can be found at @NateLatshaw
- Source Code
Episode 112: R Classes and Objects - Making an S3 Object - Part 3 - S3 Tournament
- This is the final episode on creating and applying s3 objects. We discuss some comments we recieved from viewers asking about why s3 objects vs a named list, and then get down to business to completing our single round elimination tournatent. We create an object to represent a matchup, then abstract up to a tournament round, and finally the full tournament.
- Source Code
TidyX Episode 113 | R Classes and Objects - Making an S4 Object - Part 1
- We move onto the next, and possibly one of the more divisive (is that possible?) object systems in R - the S4 sytem. This system takes the free-wheeling s3 object class and says no more. Everything must be clearly defined up front, from the content of your object to its methods. We discuss some basics of why we use objects before getting into the nitty gritty of creating a few objects using the s4 system. We create a "print" method to demonstrate how to create a custom method, and show how to make a constructor.
- Source Code
TidyX Episode 114 | camcorder R package
- We take a step away from R objects to talk about a package that Ellis has been developing for the past 2 years - camcorder. Ellis talks about why he wrote the package, the ideas behind it, and folks might find value in it. He walks through an example of how to use the package, and gives a few call outs for folks that have supported the project throughout its two years.
- Source Code
TidyX Episode 115 | R Classes and Objects - Making an S4 Object, Part 2 - S4 Tournament
- We finally come back to talking about S4 objects this week and talk about how one might use an S4 object IRL. We look back and what we did for S3 objects and decide to use the same context but this time talk about how we would solve this problem in S4 instead of S3. We talk about creating new S4 generics and methods, and incorporate some view suggestions to create an object holding the results of a simulated game! We also open with a quick tangent to talk about an amusing thread by Danielle Navarro (https://twitter.com/djnavarro/status/1565515145488797696) on S3 chaos.
- Source Code
TidyX Episode 116 | R Classes and Objects - Making an S4 Object, Part 3 - S4 Tournament
- This week we close out our s4 discussion by finalizing our code to simulate tournaments. we built up last week simluated games, but now we simulate tournament rounds, create new s4 classes and methods, and ultimately simluate a tournament. We update how we were approaching calling the likely winner of a tournament to simulate a tournament 1000 times.
- Source Code
TidyX Episode 117 | Creating Participant IDs
- Ever wonder how you can use tidyverse tools to create unique identifiers for your experiment records? wonder no longer. This week we show you how to use the lesser known cur_group_id() function to get the group id number to serve as a participant ID. Then we demonstrate how you can also use joins, and finally discuss creating an index for grouping observations of the same participant using integer division!
- Source Code
TidyX Episode 118 | Windowing Functions with {zoo} and tidyverse
- What technology lets you see through a wall? Windows. This episode we take a look at the ever useful tidyverse and how we can perform windowing to calculate values across windows. We celebrate Albert Pujols hitting 700 career home runs by looking at his career home runs and perform examples of different windowing calculations. We show how these calculations can be used as part of your visualization to add context.
- Source Code
TidyX Episode 119 | Slice n' Dicing data with tidyverse
- This week we look at a common function we use to help us select random subsets of data in tidyverse - slice_sample. This function is the predecessor to sample_n and sample_frac, and allows us to quickly and easily grab n rows or a proportion of the data in a single line of code. We go over a few different arguments and set ups that people might use these functions!
- Source Code
TidyX Episode 120 | Working with columns in Tidyverse
- Selecting, renaming, and moving around columns around is an incredibly common task for data scientists. So much so that there are loads of little helpers embedded into the tidyverse world to improve quality of life. This episode we highlight the use of some of these helpers, such as the starts_with, ends_with, matches, and where functions, along with super important functions such as relocate and rename to move around columns and rename them respectively. Finally, we close with going over the differences between the dplyr::pull function and the purrr:pluck function.
- Source Code
TidyX Episode 121 | Tell me what you want - user submitted data
- This week we get into some data engineering problems provided by a viewer! Ellis and Patrick are provided some wide data that contains some simulated data of a few patients after surgery. Our job is to turn this into some useful long data based on what we were provided. Using tools from the last few weeks, we demonstrate how to use mutate, relocate, rename, pivot_longer and pivot_wider. We show to approaches, one using more advanced regex and pivoting tools to make the data useful.
- Source Code
TidyX Episode 122 | Event based data and filtering
- Event-based time series data is a super common type of data. But it does come with some unique challenges. This week we talk through some techniques a person might use to explore and filter this data into something more useful. We simulate event data where three participants have N observations and at any one of these observations an event may occur. We calculate number of events, time between events, how to get n observations post each event, and how to grab observations from two named events!
- Source Code
TidyX Episode 123 | Criss Cross Apple Sauce - Crossing in Tidyverse
- Crossing vectors and dataframes to generate new data or compare existing data is a very common practice in data analysis. Whether generating values to allow you to grid search or comparing values, there are helpers in R to make this process much easier. The tidyr crossing function and its familiars make this process a piece of cake. We explore the behaviors of these functions and give an example of how they can be useful!
- Source Code
TidyX Episode 124 | Combining Multiple Conditions
- This episode we work through a problem that was submitted by a colleague of Patrick: "I have multiple different potential values that I want to report based on a reference value. What would be a good way to combine them? case_when and ifelse don't seem to be doing it". We walk through the scenario, explain why case_when and ifelse fail, and provide a few solutions!
- Source Code
TidyX Episode 125 | Combining Multiple Conditions, Followup
- We reach into our mailbag to answer questions submitted by our viewers from our last episode. We go over a suggestion from @datadavidz to use the enframe function, explain how to un-rowwise your tibble, and give a solution to a similar problem submitted by Jeff Rothschild!
- Source Code
TidyX Episode 126 | Keeping duplicates on pivoting
- This week we pick up a problem that you too may have faced - pivoting your data and not getting the expected format due to some unexpected content in your data. This week we go through an example from Patrick, where we want to pivot values and keep the duplicated values independent. We work through a few different approaches to explain the thought process and how you too can preserve duplicates on pivot_wider.
- Source Code
TidyX Episode 127 | Fuzzy Wuzzy Joiny Tools
- How do you match two datasets that have ever so slightly different spellings for the values you want to match on? In comes Fuzzy Matching! This week we pick up a question from one of our patreon patrons on how can you match the names of different sports ball players across multiple sources! We generate a simple example using a "name bank" or reference dataset along with some simulated scraped data and show you two ways to do so. Ellis shows us how we can use agrep/agrepl from Base R, and Patrick walks through an example from the {fuzzyjoin} package!
- Source Code
TidyX Episode 128 | Data formats as data - AOC Day 1
- We solve Day 1 of advent of code in two ways this week. Ellis bases his approach using base R, applying a loop and pre-allocating a vector, while Patrick reads the in as a data.frame and applies tidyverse functions to come to the same conclusions. We discuss how sometimes the format of the data can be informational, and how you should approach processing when that matters.
- Source Code
TidyX Episode 129 | Generating Snowflakes
- Inspired by a blog Ellis saw online (see below), we write a snowflake generator using R. We talk about how you can write functions to build up to more complicated processes and use our highschool trig again. Happy Holidays from TidyX.
- Original Snowflake Blog Post: https://cloudfour.com/thinks/coding-a-snowflake-generator/
- Source Code
TidyX Episode 130 | Independent Interactive Reports with Plotly
- Ellis and Patrick got a question from a viewer asking how they can share interactive reports with their stakeholders without using shiny! Well, the answer is right in front of us in the use of Rmarkdown to generate html reports combined with the power of htmlwidgets from plotly. We generate a report that can be shared through a single file, that provides some fun interactivity to look at baseball batting averages.
- Source Code
TidyX Episode 131 | Player Selection in Shiny
- This week we work on a problem likely most sports scientists have dealt with - how to select a player by name when there are multiple players with the same name! We show two ways, first using selectInput and creating unique records for each player in the selection choices, and using DT and the ability datatable's have to create reactive inputs when they are clicked.
- Source Code
TidyX Episode 132 | Fuzzy Matching Shiny
- Expanding on what was done in episode 127, and taking the theme from the last few episodes on shiny, we demonstrate how you too can create a shiny app that will empower your non-programmer team to perform their own fuzzy matching. We use a collection of different techniques including uioutputs, DataTable, and a download handler!
- Source Code
TidyX Episode 133 | Intro to Flexdashboard - Flexing your Dashboard
- Somehow for 132 episodes we have not done a flexdashboard! This changes now. A flexdashboard is an advanced rmarkdown that allows you to create serverless dashboards. Nicely format and display your content for your stakeholders in interactive websites, and move away from manually creating them or using excel.
- Source Code
TidyX Episode 134 | Conditional Styling with DT
- DT offers a lot of power to the users in the ability to quickly make interactive tables in R. However, that is not its only superpower. Offering a number of formatting functions to style the contents for both in visual display and string formatting, there are many options for a power user. We go through the basics and some advanced skills like formatting based on another column or styling an entire row.
- Source Code
TidyX Episode 135 | Github cron jobs
- This week, we discuss using cron jobs in Github to automate the process of scraping webpages at a set cadence (e.g., every morning at 6am).
- Source Code
TidyX Episode 136 | Fuzzy Joins on Dates
- Sometimes. That's the thing. Some Times. We answer a viewer question extending from a prior episode answering the question of how do I join participant data from samples to the closest date of an event within N number of days? We try to give an answer, working through problem solving and try to give you the tools to solve this problem too!
- Source Code

ukrcherry / tidyx Goto Github PK

tidyx's Introduction

TidyX

Hosts

Description

Patreon

TidyX Episodes

tidyx's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent