Coder Social home page Coder Social logo

understatr's Introduction

understatr

lifecycle R-CMD-check

Overview

An R package to help with retrieving tidy understat data.

Install

understatr is not likely to be submitted to CRAN. Get the latest development version from GitHub:

remotes::install_github('ewenme/understatr')

Use

library(understatr)

Check currently available leagues/seasons:

get_leagues_meta()
#> # A tibble: 48 × 4
#>    league_name  year season    url                                        
#>    <chr>       <dbl> <chr>     <chr>                                      
#>  1 EPL          2021 2021/2022 https://understat.com/league/EPL/2021      
#>  2 EPL          2020 2020/2021 https://understat.com/league/EPL/2020      
#>  3 EPL          2019 2019/2020 https://understat.com/league/EPL/2019      
#>  4 EPL          2018 2018/2019 https://understat.com/league/EPL/2018      
#>  5 EPL          2017 2017/2018 https://understat.com/league/EPL/2017      
#>  6 EPL          2016 2016/2017 https://understat.com/league/EPL/2016      
#>  7 EPL          2015 2015/2016 https://understat.com/league/EPL/2015      
#>  8 EPL          2014 2014/2015 https://understat.com/league/EPL/2014      
#>  9 La liga      2021 2021/2022 https://understat.com/league/La%20liga/2021
#> 10 La liga      2020 2020/2021 https://understat.com/league/La%20liga/2020
#> # … with 38 more rows

Get stats for a team’s playing squad in a league season:

get_team_players_stats(team_name = "Manchester City", year = 2018)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   player_id = col_double(),
#>   player_name = col_character(),
#>   games = col_double(),
#>   time = col_double(),
#>   goals = col_double(),
#>   xG = col_double(),
#>   assists = col_double(),
#>   xA = col_double(),
#>   shots = col_double(),
#>   key_passes = col_double(),
#>   yellow_cards = col_double(),
#>   red_cards = col_double(),
#>   position = col_character(),
#>   team_name = col_character(),
#>   npg = col_double(),
#>   npxG = col_double(),
#>   xGChain = col_double(),
#>   xGBuildup = col_double()
#> )
#> # A tibble: 21 × 19
#>    player_id player_name     games  time goals    xG assists     xA shots key_passes
#>        <dbl> <chr>           <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl> <dbl>      <dbl>
#>  1       619 Sergio Agüero      33  2515    21 19.9        8  5.23    118         34
#>  2       618 Raheem Sterling    34  2788    17 15.9       10 10.8      77         66
#>  3       337 Leroy Sané         31  1866    10  6.98      10  8.10     56         40
#>  4       750 Riyad Mahrez       27  1333     7  6.62       4  5.01     54         24
#>  5      3635 Bernardo Silva     36  2851     7  8.20       7  8.63     62         71
#>  6      5543 Gabriel Jesus      29   993     7 12.6        3  2.65     43         21
#>  7       314 Ilkay Gündogan     31  2133     6  4.21       3  4.97     43         43
#>  8       617 David Silva        33  2426     6  8.13       8 10.1      51         73
#>  9      2498 Aymeric Laporte    35  3059     3  3.75       3  0.839    26         13
#> 10       447 Kevin De Bruyne    19   965     2  1.47       2  6.65     31         36
#> # … with 11 more rows, and 9 more variables: yellow_cards <dbl>,
#> #   red_cards <dbl>, position <chr>, team_name <chr>, npg <dbl>, npxG <dbl>,
#> #   xGChain <dbl>, xGBuildup <dbl>, year <dbl>

Get stats for a player across all seasons:

get_player_seasons_stats(player_id = 618)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   position = col_character(),
#>   games = col_double(),
#>   goals = col_double(),
#>   shots = col_double(),
#>   time = col_double(),
#>   xG = col_double(),
#>   assists = col_double(),
#>   xA = col_double(),
#>   key_passes = col_double(),
#>   year = col_double(),
#>   team_name = col_character(),
#>   yellow = col_double(),
#>   red = col_double(),
#>   npg = col_double(),
#>   npxG = col_double(),
#>   xGChain = col_double(),
#>   xGBuildup = col_double(),
#>   player_name = col_character()
#> )
#> # A tibble: 8 × 19
#>   position games goals shots  time    xG assists     xA key_passes  year
#>   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>  <dbl>      <dbl> <dbl>
#> 1 FWL          3     1     7   128  1.71       0  0.111          1  2021
#> 2 AML         31    10    70  2539 12.1        7  6.63          39  2020
#> 3 FWL         33    20   100  2678 19.8        1  7.21          48  2019
#> 4 AML         34    17    77  2788 15.9       10 10.8           66  2018
#> 5 Sub         33    18    87  2594 18.8       11  8.84          55  2017
#> 6 AMR         33     7    64  2532  8.11       6  5.50          46  2016
#> 7 AML         31     6    52  1943  7.15       2  3.25          35  2015
#> 8 AML         35     7    84  3059  8.79       7  6.04          75  2014
#> # … with 9 more variables: team_name <chr>, yellow <dbl>, red <dbl>, npg <dbl>,
#> #   npxG <dbl>, xGChain <dbl>, xGBuildup <dbl>, player_id <dbl>,
#> #   player_name <chr>

Issues

If you encounter a clear bug, please file a minimal reproducible example on GitHub. For questions and other discussion, try stackoverflow or e-mail.

Disclaimer

While there is no official notice on the site condoning web scraping activity, Understat’s support have previously confirmed (via e-mail exchange, 8th November 2018) that their data is free to use for non-commercial purposes. This stance is subject to change.

Also, be polite and attribute the source.


Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

understatr's People

Contributors

ewenme avatar imgbotapp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

understatr's Issues

mature team-level stats functionality

Currently, team-level stats are fetched for a whole league via get_league_teams_stats(). It should be possible to get team-level stats for a single club, and also to get data on different game situations e.g. set-pieces. See a team's page for an example.

type.convert error for get_league_teams_stats and get_players_seasons_stats


The type.convert function for the two functions above throws up an error that the first argument must be of mode character. Is there an additional package needed or an issue with my operating system?

All functions in team.R and players.R give me the same error. Except for get_team_meta().

library(reprex)
library(understatr)
league_data <- get_league_teams_stats(league_name = "EPL", year = 2018)
Error in type.convert(teams_df) : 
  the first argument must be of mode character
player_data <- get_player_stats(player_id = 882)
Error in get_player_stats(player_id = 882) : 
  could not find function "get_player_stats"
player_data <- get_player_matches_stats(player_id = 882)
Error in type.convert(player_data) : 
  the first argument must be of mode character
Sys.info()
                                                                                           sysname 
                                                                                          "Darwin" 
                                                                                           release 
                                                                                          "18.6.0" 
                                                                                           version 
"Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64" 
                                                                                          nodename 
                                                                                        "---" 
                                                                                           machine 
                                                                                          "x86_64" 
                                                                                             login 
                                                                                             "--" 
                                                                                              user 
                                                                                             "--" 
                                                                                    effective_user 
                                                                                             "--" 


Wrong entries for home/away (h_a)

First of all thank you for this package!

As the title states there is the problem of wrong allocation of home/away for some teams and some match days. When I download the data for the German Bundesliga and do some summaries, I found that in 2019 (season before the actual one) some teams had less than 17 home or away matches. For 2014 to 2018 it seems to be fine.
Therefore, I had a closer look into the season 2019 in week 24. The object team_data has all seasons. In week 24 of season 2019 Cologne and Bremen are given as away teams (must be home) and M.Gladbach as home team (must be away).
(please see https://www.sport.de/fussball/deutschland-bundesliga/se31723/2019-2020/ro100673/spieltag/md8/ergebnisse-und-tabelle/)

Thanks and kind regards

team_data %>%filter(year==2019 & week==24) %>%select(year,team_name,h_a,pts,npxGD,week) %>% arrange(h_a)

Could not find function "get_data_element"

get_match_shots(match_id = 11662)
Error in get_data_element(match_data, "shotsData") :
could not find function "get_data_element"
In addition: Warning messages:
1: In if (wmc == TRUE & type != "score") { :
the condition has length > 1 and only the first element will be used
2: In if (gf == TRUE & type != "score") { :
the condition has length > 1 and only the first element will be used

I would have thought this function is a user-defined function within the package understatr, could you assist please?

Connection Timeout Issue

Getting connection timeout errors upon trying to run each command - any help appreciated!

get_leagues_meta()
Error in open.connection(x, "rb") :
Timeout was reached: Connection timed out after 10000 milliseconds

Error in get_league_team_stats

get_league_team_stats(league_name = 'Ligue 1', year = 2020) throws this error:

Error in rbind(deparse, level, ...) : numbers of columns of arguments do not match.

url encoding in `get_league_teams_stats()`

It seems that there is an issue with La liga, Serie A, and Ligue 1 with get_league_teams_stats(), presumably due to the space in the league names. (I've used this function before for these leagues and did not have the same issue, so perhaps there was a change on the understat website?)

Wrapping the url generating line with URLencode() seems to fix the issue.

library(understatr)
get_league_teams_stats('EPL', 2020)
#> # A tibble: 760 x 25
#>    h_a      xG   xGA  npxG  npxGA  deep deep_allowed scored missed  xpts result
#>    <chr> <dbl> <dbl> <dbl>  <dbl> <int>        <int>  <int>  <int> <dbl> <chr> 
#>  1 h     0.805 0.850 0.805 0.0885    17            2      1      0 1.16  w     
#>  2 a     2.03  0.535 2.03  0.535     10            5      3      0 2.46  w     
#>  3 h     3.08  1.66  3.08  1.66       7           18      7      2 2.26  w     
#>  4 a     0.874 0.672 0.874 0.672      7            4      1      0 1.53  w     
#>  5 h     1.50  2.38  1.50  2.38       7           20      0      3 0.824 l     
#>  6 h     2.45  1.00  1.69  1.00       5            2      3      4 2.39  l     
#>  7 a     1.99  1.39  1.99  1.39      16            6      3      0 1.81  w     
#>  8 h     1.77  1.50  1.77  1.50       6            4      1      2 1.62  l     
#>  9 a     2.39  0.572 1.63  0.572      8            2      1      2 2.68  l     
#> 10 a     1.27  1.14  0.508 1.14       5            6      1      0 1.49  w     
#> # ... with 750 more rows, and 14 more variables: date <date>, wins <int>,
#> #   draws <int>, loses <int>, pts <int>, npxGD <dbl>, ppda.att <int>,
#> #   ppda.def <int>, ppda_allowed.att <int>, ppda_allowed.def <int>,
#> #   team_id <chr>, team_name <chr>, league_name <chr>, year <dbl>
get_league_teams_stats('La liga', 2020)
#> Error in open.connection(x, "rb"): HTTP error 400.

library(stringr)
library(rvest)
library(jsonlite)
library(tibble)
get_league_teams_stats2 <- function(league_name, year) {
  
  stopifnot(is.character(league_name))
  
  home_url <- "https://understat.com"
  # construct league url
  league_url <- URLencode(stringr::str_glue("{home_url}/league/{league_name}/{year}"))
  
  # read league page
  league_page <- rvest::read_html(league_url)
  
  # locate script tags
  teams_data <- understatr:::get_script(league_page)
  
  # isolate player data
  teams_data <- understatr:::get_data_element(teams_data, "teamsData")
  
  # pick out JSON string
  teams_data <- sub(".*?\\'(.*)\\'.*", "\\1", teams_data)
  
  # parse JSON
  teams_data <- jsonlite::fromJSON(teams_data, simplifyDataFrame = TRUE,
                                   flatten = TRUE)
  
  # get teams data
  teams_data <- lapply(
    teams_data, function(x) {
      df <- x$history
      df$team_id <- x$id
      df$team_name <- x$title
      df
    })
  
  # convert to df
  teams_df <- do.call("rbind", teams_data)
  
  # add reference fields
  teams_df$league_name <- league_name
  teams_df$year <- as.numeric(year)
  
  # fix col classes
  teams_df$date <- as.Date(teams_df$date, "%Y-%m-%d")
  
  tibble::as_tibble(teams_df)
  
}

get_league_teams_stats2('La liga', 2020)
#> # A tibble: 760 x 25
#>    h_a      xG   xGA  npxG npxGA  deep deep_allowed scored missed  xpts result
#>    <chr> <dbl> <dbl> <dbl> <dbl> <int>        <int>  <int>  <int> <dbl> <chr> 
#>  1 a     2.21  1.10  2.21  1.10     13            3      3      1 2.15  w     
#>  2 h     1.05  0.375 1.05  0.375     8            0      1      0 1.95  w     
#>  3 a     1.26  1.06  1.26  1.06      2            3      1      1 1.53  d     
#>  4 a     0.699 1.09  0.699 1.09      5            5      0      1 0.946 l     
#>  5 h     1.99  0.198 1.99  0.198     5            2      0      1 2.74  l     
#>  6 a     1.15  1.33  1.15  1.33      6            3      1      2 1.17  l     
#>  7 h     1.15  0.499 0.408 0.499     3            5      1      0 2.06  w     
#>  8 h     2.23  2.08  2.23  2.08      5            3      4      2 1.46  w     
#>  9 a     1.70  0.168 1.70  0.168    11            2      1      0 2.65  w     
#> 10 h     0.534 1.70  0.534 1.70      6            4      0      1 0.397 l     
#> # ... with 750 more rows, and 14 more variables: date <date>, wins <int>,
#> #   draws <int>, loses <int>, pts <int>, npxGD <dbl>, ppda.att <int>,
#> #   ppda.def <int>, ppda_allowed.att <int>, ppda_allowed.def <int>,
#> #   team_id <chr>, team_name <chr>, league_name <chr>, year <dbl>

Created on 2021-08-28 by the reprex package (v2.0.0)

get_team_players_stats returns error

get_team_players_stats returns error

understatr::get_team_players_stats(team = "Chelsea", 2019)
#> Error in type.convert(players_data, as.is = TRUE): the first argument must be of mode character

Error in open.connection(x, "rb") : SSL certificate problem: certificate has expired error

I'm struggling to retrieve data from understat. I get the following error: Error in open.connection(x, "rb") : SSL certificate problem: certificate has expired. I attached more info below.

library(understatr)
library(reprex)

get_leagues_meta()
#> Error in open.connection(x, "rb"): SSL certificate problem: certificate has expired
get_team_players_stats(team_name = "Manchester City", year = 2018)
#> Error in open.connection(x, "rb"): SSL certificate problem: certificate has expired
get_player_seasons_stats(player_id = 2371)
#> Error in open.connection(x, "rb"): SSL certificate problem: certificate has expired
get_league_teams_stats('EPL', 2020)
#> Error in open.connection(x, "rb"): SSL certificate problem: certificate has expired
reprex()

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
understatr package version 1.0.1.90

Any help would be appreciated, thanks!

Can't install understatr - Error : object 'str_glue' is not exported by 'namespace:stringr'


Getting this problem trying to install the package in R - has anyone come across this, am i missing something?

> remotes::install_github('ewenme/understatr')
Downloading GitHub repo ewenme/understatr@master
* installing *source* package 'understatr' ...
** R
** preparing package for lazy loading
Error : object 'str_glue' is not exported by 'namespace:stringr'
ERROR: lazy loading failed for package 'understatr'
* removing 'C:/Program Files/Microsoft/ML Server/R_SERVER/library/understatr'

Thank You.

SSL certificate problem

For each function the same problem:
Error in open.connection(x, "rb") : 
  SSL certificate problem: certificate has expired```

type.convert error (similar to issues #7 & #10)

I'm getting the same type.convert error reported in issues #7 and #10

get_league_teams_stats(league_name = "EPL", year = 2018)
#> Error in get_league_teams_stats(league_name = "EPL", year = 2018): could not find function "get_league_teams_stats"
get_player_stats(player_id = 882)
#> Error in get_player_stats(player_id = 882): could not find function "get_player_stats"
get_player_matches_stats(player_id = 882)
#> Error in get_player_matches_stats(player_id = 882): could not find function "get_player_matches_stats"
Sys.info()
#>                                       sysname 
#>                                       "Linux" 
#>                                       release 
#>                           "4.15.0-91-generic" 
#>                                       version 
#> "#92-Ubuntu SMP Fri Feb 28 11:09:48 UTC 2020" 
<sup>Created on 2020-03-24 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0)</sup>

Also, I was able to make the functions get_league_teams_stats() & get_player_matches_stats() work by removing the type.convert line from the code (as was the case with leobeta92 from issue #10)

Trailing Garbage

Hiya,

Having a play with this package (it's great thanks). But have come across this error message:

"Error: parse error: trailing garbage
Chain":"0","xGBuildup":"0"}] [] [] []
(right here) ------^"

The code which triggers this is below:

get EPL team data

epl_team_stats <- get_league_teams_stats(league_name = "EPL", year = 2018)
epl_team_stats

get EPL player data

epl_player_stats <- purrr::map_dfr(unique(epl_team_stats$team_name), get_team_players_stats, year = 2018)

determine the historical average delta_xG for EPL players

n <- epl_player_stats$player_id

xG_coeff_calc <- function(n)
{Player <- data.frame(get_player_seasons_stats(player_id = n))
Player$delta_xG <- Player$goals - Player$xG
Player <- subset(Player, select = c("xG", "goals", "delta_xG"))
mean(Player[,"delta_xG"])}

epl_player_stats$xG_coeff <- lapply(n, xG_coeff_calc)

I'm quite new to all of this but I think this problem originates from the code in the package? I'm not sure though... Do you know of a solution?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.