bcongelio / nfl-analytics-with-r-book Goto Github PK

View Code? Open in Web Editor NEW

51.0 2.0 15.0 389.03 MB

The repo for Introduction to NFL Analytics with R (published with CRC Press)

Home Page: https://bradcongelio.com/nfl-analytics-with-r-book/

License: Creative Commons Zero v1.0 Universal

CSS 0.09% R 39.01% TeX 42.63% Lua 5.39% SCSS 0.49% JavaScript 12.40%

analytics nfl sports-analytics

nfl-analytics-with-r-book's Introduction

nfl-analytics-with-r-book

The repo for Introduction to NFL Analytics with R (forthcoming with CRC Press).

It is slated for publication in December 2023.

nfl-analytics-with-r-book's People

Contributors

Stargazers

Watchers

Forkers

shen3340 genostack jkope892 rohanalexander doctormike87 jaseziv ngiangre anhnguyendepocen acastill42 gabriel-ribeirob jco207 toshibachess iangow addison-mcghee isaactpetersen

nfl-analytics-with-r-book's Issues

Section 3.2.0.1 - Garbage Time Metrics

Currently listed as over 95% or under .05%.

Obviously, .05% is a typo.

Author Note: I kind of want to run a garbage time filter with .05 just to see how hilariously skewed the results are.

Tibble not sorted in descending order of avg_cpoe

2.5.4: NFL Data and the mutate() verb.
I noticed that the following command includes an "arrange()" verb that should list the values in descending order of avg_cpoe:
airyards_cpoe_mutate %>% filter(ay_distance == "Medium") %>% arrange(-avg_cpoe) %>% slice(1:10)

but in the tibble displayed in the book, and when I copy/paste into RStudio, the tibble is arranged in ascending alphabetical order of passer:

A tibble: 82 x 3
Groups: passer [82]
passer ay_distance avg_cpoe

1 A.Brown Medium -6.09
2 A.Cooper Medium -43.1
3 A.Dalton Medium 5.01
4 A.Rodgers Medium 2.02
5 B.Allen Medium 44.5
6 B.Hoyer Medium -44.0
7 B.Mayfield Medium -15.4
8 B.Perkins Medium 0.266
9 B.Purdy Medium 9.72
10 B.Rypien Medium 18.0
i 72 more rows

any idea why?

B2 - further reading typo

Need to correct the spelling of "Practical Recipes" in the book title.

Section 3.5.1 - update link to Joseph Jefe's website

His ShinyApp is now hosted on a dedicated website here: https://jefeshandiwork.com/

Section 3.1.1 - clear up data re. Mahomes and Manning

In trying to explain that load_player_stats() brings in passing, rushing, and receiving yards for each player, I quickly switched to Peyton Manning to showcase that the data provides, for example, his rushing EPA (-0.740, if you were wondering) which is a stat that is not all that useful considering what we know about Peyton.

I switch back to using Mahomes in the examples right after. I think doing a bit of editing in that section will make it a bit more clear what I am trying to highlight about the data in that getting information pertaining to Mahomes' rushing EPA is perhaps relevant (especially compared to Peyton).

Typo in 1.3

After running oline_snap_counts for the first time ...

Are these five plays necessarily the same five that started the game as the two tackles, two guards, and one center? Perhaps not. But, hypothetically, a team’s starting center could have been injured a series or two into the game and the second-string center played the bulk of the snaps in that game.

... where "plays" should be "players."

5.3.1.3 - Number of k typo

At the beginning of 5.3.1.3:

As a result of the PCA process, we know we will be grouping the running back into four distinctive clusters, so we will create a value in our environment called k and set it to 4.

... but the code is k <- 3 (which is correct). Change the 4 to 3 in the intro.

5.3.2 Typo of Greere's last name

In the reference in the 2nd paragraph, need to change "Geerre" to "Greere"

5.2.3 Typo

The paragraph reads:

"To that end, there are three distinct types of logistic regressions: binary, multinomial, and ordinal (we will only be covering binary and ordinal in this chapter). All three allow for working with various types of dependent variables."

However, I cover binary and multinomial without covering ordinal. Just swap "ordinal" for "multinomial" in the ().

Removing reference to "line of best fit" in initial rushing yards over expected model

As outlined by NeuroNisl in the Discord:

The graph shows the “line of best fit” between Actual Rushing Yards and Expected Rushing Yards. Those running backs that outperformed the model’s expectations (that is, more actual rushing yards than expected rushing yards) fall below the line, while those that underperformed the model are above the line of best fit.

Wouldn't this only be true if you plotted the linear function y=x instead of the linear regression result which doesn't necessarily need to be this function?

He is correct. It is not a correlation (such in the linear regression chapter) and this needs a slight edit to remove the terminology.

Typo in 1.3.1 code example

Hi, loving this book so far! I ran into a quick issue in section 1.3.1. The code offered is:
oline_snap_counts <- oline_snap_counts %>% group_by(game_id, team) %>% arrange(player, .by_group = TRUE)

but line 3 should use
arrange(player, .group_by = TRUE)

basically the words ".group_by" are just reversed. I was able to figure it out real quick with chat gpt, but in case people want to copy and past without fuss, I thought it would be worth highlighting.

Thanks!

Error in Group By

Hello, for section 3.5.12 I think there is an issue with this section:

dakota_composite <- nflreadr::load_player_stats(2022) %>%
filter(position == "QB") %>%
group_by(player_display_name, season) %>%
summarize(attempts = sum(attempts, na.rm = TRUE),
mean_dakota = mean(dakota, na.rm = TRUE)) %>%
filter(attempts >= 200)

This caused the resulting plot to have a lot more points than the example given. Once I removed season from the group by it worked as expected.

**Update: Nevermind I was doing something wrong.

First k-means graph data is opposite the example

rushing_kmeans_data <- vroom("http://nfl-book.bradcongelio.com/kmeans-data")

rusher_names <- rushing_kmeans_data$player
rusher_ids <- rushing_kmeans_data$player_id

rushers_pca <- rushing_kmeans_data |> 
  select(-player, -player_id)

rownames(rushers_pca) <- rusher_names

rushers_pca <- prcomp(rushers_pca, center = TRUE, scale = TRUE)

fviz_pca_biplot(rushers_pca, geom = c("point", "text"), ggtheme = nfl_analytics_theme()) + 
  xlim(-6, 3) + labs(title = "PCA Biplot: PC1 and PC2") + 
  xlab("PC1 - 35.8%") + ylab("PC2 - 24.6%")

On Chapter 5's first K-means plot, the PC2 data seems flipped with the example plot. On my plot, it shows Dalvin Cook with -3 PC2 but on your graph it shows Cook with +3 PC2. Was wondering why the data seems opposite for every player.

Chapter 4.6

"When charting passing attempts in football, a player’s individual passing yards are aggregated under gross yards with all lost yardage resulting from a sack being included. On the other hand, team passing yards are always presented in net yards and any lost yardage from sacks are not included. "

Shouldn't this be the other way around. Doesn't net yards include sacks and gross yards excludes sacks?

Section 2.5.3 code cleanup opportunity

From an nflverse Discord comment:

hey @bcongelio looking at this section in the book (2.5.3) - is the additional filter to exclude NAs on the down necessary since you're already asking for a specific down?

Reproducing graphs from Chapter 1.

I was interested in your book since I'm unfamiliar with football (NFL) analytics. Trying to produce your output from Chapter 1, the text labels on the ggplot2 graphs didn't appear. There were warnings that no font could be found for family "Roboto". I have installed and loaded all (34) packages that you list in Table 0.1.

How can I get that particular font installed? I apologize if you've discussed this somewhere in your book.

Jim Albert

5.3.1.1 Typo

In #1 (Variable Scaling) ...

The scaling process computers the data frame so that each variable has a standard deviation of one and a mean of zero, with the equation to produce these results below.

"computers" = "computes"