mcbarlowe / nba_scraper Goto Github PK

View Code? Open in Web Editor NEW

69.0 6.0 13.0 466 KB

A scraper to scrape the NBA API and compile a play by play file

License: GNU General Public License v3.0

Python 100.00%

nba-api scraper nba-scraper

nba_scraper's Introduction

This package is no longer maintained as of 2021/01/30. Any outstanding issues or new ones will not be fixed

`nba_scraper`

This is a package written in Python to scrape the NBA's api and produce the play by play of games either in a csv file or a pandas dataframe. This package has two main functions scrape_game which scrapes an individual game or a list of specific games, and scrape_season which scrapes an entire season of regular season games.

The scraper goes back to the 1999-2000 season and will pull the play by play along with who was on the court at the time of each play. Some other various statistics may be calculated as well.

As of version 1.0.8 the scraper will now scrape WNBA games as well as NBA games. Just call wnba_scrape_game instead of scrape_game. The parameters and usage is exactly the same as scrape_game function. As of right now I know it goes back to the 2005 season maybe further just haven't tested. Be warned it is much slower than the nba scraper due to the extra api calls needed to pull in player names that are readily available in the NBA api itself.

Installation

To install this package just type this at the command line:

pip install nba_scraper

Usage

`scrape_game`

The default data format is a pandas dataframe you can change this to csv with the data_format parameter. The default file path is the users home directory you can change this with the data_dir parameter

import nba_scraper.nba_scraper as ns

# if you want to return a dataframe
# you can pass the function a list of strings or integers
# all nba game ids have two leading zeros but you can omit these
# to make it easier to create lists of game ids as I add them on
nba_df = ns.scrape_game([21800001, 21800002])

# if you want a csv if you don't pass a file path the default is home
# directory
ns.scrape_game([21800001, 21800002], data_format='csv', data_dir='file/path')

`scrape_season`

The data_format and data_dir key words are used the excat same way as scrape_game. Instead of game ids though, you would pass the season you want scraped to the function. This season is a four digit year that must be an integer.

import nba_scraper.nba_scraper as ns

#scrape a season
nba_df = ns.scrape_season(2019)

# if you want a csv if you don't pass a file path the default is home
# directory
ns.scrape_season(2019, data_format='csv', data_dir='file/path')

`scrape_date_range`

This allows you to scrape all regular season games in the date range passed to the function. As of right now it will not scrape playoff games. Date format must be passed in the format YYYY-MM-DD.

import nba_scraper.nba_scraper as ns

#scrape a season
nba_df = ns.scrape_date_range('2019-01-01', 2019-01-03')

# if you want a csv if you don't pass a file path the default is home
# directory
ns.scrape_date_range('2019-01-01', 2019-01-03', data_format='csv', data_dir='file/path')

Contact

If you have any troubles or bugs please open an issue/bug report. If you have any improvements/suggestions please submit a pull request. If it falls outside those two areas please feel free to email me at [email protected].

nba_scraper's People

Contributors

Stargazers

Watchers

Forkers

harryshomer tikindi pvantol99 lejohndary alexcstern rneu31 chawy11 kmedved mcleblanc711 jvegreg gregfriedland meysubb ahruska12

nba_scraper's Issues

get_date_games function not pulling games before game id 021800110

The get_date_games function in scrape_functions module is not pulling the early season game_ids for the 2019 season. I discovered this in writing integration tests for the function. I'm assigning this to me but if someone wants to jump on it I'd be happy to accept a pull request. @HarryShomer if you have any advice that would help but if you're busy don't worry I'll handle it. Will need all tests to pass in the test_integration.py file before merging into master

Error scraping game 21800053

Looks like for some reason a players not found in the dataframe or the lineups are returning an empty list this needs to be looked at.

>>> test = ns.scrape_game([21800053])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mbarlowe/code/python/nba_scraper/nba_scraper/nba_scraper.py", line 98, in scrape_game
    scraped_games.append(sf.main_scrape(f"00{game}"))
  File "/Users/mbarlowe/code/python/nba_scraper/nba_scraper/scrape_functions.py", line 1138, in main_scrape
    game_df))
  File "/Users/mbarlowe/code/python/nba_scraper/nba_scraper/scrape_functions.py", line 959, in get_lineup
    ['player1_name'].unique()[0]) for x in away_lineups[0]]
  File "/Users/mbarlowe/code/python/nba_scraper/nba_scraper/scrape_functions.py", line 959, in <listcomp>
    ['player1_name'].unique()[0]) for x in away_lineups[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0

Add functionality for all NBA season on API

The NBA api goes back to 1999 season. The main issue with pulling those season in is that the data.nba.com API which has the xy locations for events only goes back 4 years. Work on removing all calls to that API for seasons older than 2016.

Write tests for WNBA functions

Write tests for WNBA scraper functions and stat calculations and setup CI for them

Playoff Games: List Index Out of Range

I have been trying to get the play by play for playoff games and running into the following error:

nba_df = ns.scrape_game([41700151])
Scraping game id: 0041700151
Traceback (most recent call last):

  File "<ipython-input-28-ffa52b1d949b>", line 1, in <module>
    nba_df = ns.scrape_game([41700151])

  File "/anaconda3/lib/python3.7/site-packages/nba_scraper/nba_scraper.py", line 33, in scrape_game
    scraped_games.append(sf.scrape_pbp(f"00{game}"))

  File "/anaconda3/lib/python3.7/site-packages/nba_scraper/scrape_functions.py", line 635, in scrape_pbp
    clean_df = get_lineups(clean_df)

  File "/anaconda3/lib/python3.7/site-packages/nba_scraper/scrape_functions.py", line 211, in get_lineups
    away_ids_names = [(x, dataframe[dataframe['player1_id'] == x]['player1_name'].unique()[0]) for x in away_lineups[0]]

IndexError: list index out of range

From some very initial checking, it looks like regular season game game id's start with a 2 while playoff games start with a 4 in the game id.

Issues with 19/20 season, game id 0020200577

The scraper seems to be having issues with this current season. Specifically something to do with the game of id 0020200577.

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    ns.scrape_date_range('2019-10-22', '2020-02-10', data_format='csv')
  File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\nba_scraper\nba_scraper.py", line 78, in scrape_date_range
    scraped_games.append(sf.main_scrape(game))
  File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\nba_scraper\scrape_functions.py", line 688, in main_scrape
    game_df = scrape_pbp(v2_dict)
  File "C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\nba_scraper\scrape_functions.py", line 112, in scrape_pbp
    if pbp_v2_df.game_id.unique()[0] == "0020200577":

Thanks

Fix bugs in WNBA Scraper

WNBA Scraper works but not for all games need to fix bugs that keep it from working for all games

'HOME' KeyError on Windows

in running the scraper on Windows Command Prompt, I get the following error:

Traceback (most recent call last):
File "", line 1, in
File "C:\Users*****\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\nba_scraper\nba_scraper.py", line 40, in
def scrape_date_range(date_from, date_to, data_format='pandas', data_dir=f"{os.environ['HOME']}/nbadata.csv"):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.7_3.7.1520.0_x64__qbz5n2kfra8p0\lib\os.py", line 679, in getitem
raise KeyError(key) from None
KeyError: 'HOME'

Windows doesn't have the HOME environment variable; it uses USERPROFILE instead.

Issue scraping game 0029900026

Issue when scraping game 0029900026 due to insufficient players to fill out which players are on the court fix incoming. But is odd that the API for players played in the period would be less than five.
Traceback (most recent call last):
File "scrape_seasons.py", line 8, in
ns.scrape_game([game], data_format="csv", data_dir="seasons/19992000")
File "/Users/MattBarlowe/.virtualenvs/historical_scrape/lib/python3.6/site-packages/nba_scraper/nba_scraper.py", line 109, in scrape_game
scraped_games.append(sf.main_scrape(f"00{game}"))
File "/Users/MattBarlowe/.virtualenvs/historical_scrape/lib/python3.6/site-packages/nba_scraper/scrape_functions.py", line 646, in main_scrape
get_lineup(game_df[game_df["period"] == period].copy(), lineups, game_df,)
File "/Users/MattBarlowe/.virtualenvs/historical_scrape/lib/python3.6/site-packages/nba_scraper/scrape_functions.py", line 621, in get_lineup
period_df.iat[i, 75] = away_ids_names[4][0]
IndexError: list index out of range

I tried to scrape one season but it got stuck after five games

I tried scarping 19-20 season, but the program got stuck after game 0021900005. Is it because my IP got blocked from the NBA Stats?

Issue with scrape_date_range on 1/27/21

I had an issue with the scraper for game 0022000278. I think it might be an issue with other games in Jan of this year as well.

Unable to scrape game 0021600559

Probably similar as issue #7 where player did nothing in game so unable to pull name

Traceback (most recent call last):
  File "get_season.py", line 8, in <module>
    ns.scrape_game([season], data_format='csv', data_dir=f'~/nbafiles/{season}nbapbp.csv')
  File "/Users/MattBarlowe/.virtualenvs/dataenv/lib/python3.6/site-packages/nba_scraper/nba_scraper.py", line 98, in scrape_game
    scraped_games.append(sf.main_scrape(f"00{game}"))
  File "/Users/MattBarlowe/.virtualenvs/dataenv/lib/python3.6/site-packages/nba_scraper/scrape_functions.py", line 1135, in main_scrape
    game_df))
  File "/Users/MattBarlowe/.virtualenvs/dataenv/lib/python3.6/site-packages/nba_scraper/scrape_functions.py", line 998, in get_lineup
    ['player1_name'].unique()[0]) for x in home_lineups[0]]
IndexError: list index out of range

Refactor Code

Source code needs to be refactored to allow proper testing for continuous integration as the NBA API blacklists a lot of IP addresses those services run on.

`nba_scraper` hangs at scrapetime

The nba_scraper package hangs at scraping the first game passed to it and never returns any data. This is due to the nba api requiring extra headers in the api call now this will be corrected in the next version and git commit push today

Getting a JSONDecodeError

I'm receiving a JSONDecodeError after game ID 0021800254 when scraping the 2019 season...

JSONDecodeError                           Traceback (most recent call last)
<ipython-input-5-fa487e61689b> in <module>
----> 1 nba_df = ns.scrape_season(2019)

/opt/anaconda3/lib/python3.8/site-packages/nba_scraper/nba_scraper.py in scrape_season(season, data_format, data_dir)
    189         else:
    190             print(f"Scraping game id: 00{game}")
--> 191             scraped_games.append(sf.main_scrape(f"00{game}"))
    192 
    193     if len(scraped_games) == 0:

/opt/anaconda3/lib/python3.8/site-packages/nba_scraper/scrape_functions.py in main_scrape(game_id)
    705         game_df = game_df[game_df["period"] < 5]
    706     for period in range(1, game_df["period"].max() + 1):
--> 707         lineups = get_lineup_api(game_id, period)
    708         periods.append(
    709             get_lineup(game_df[game_df["period"] == period].copy(), lineups, game_df,)

/opt/anaconda3/lib/python3.8/site-packages/nba_scraper/scrape_functions.py in get_lineup_api(game_id, period)
    373 
    374     lineups_req = requests.get(url, headers=USER_AGENT)
--> 375     lineup_req_dict = json.loads(lineups_req.text)
    376 
    377     return lineup_req_dict

/opt/anaconda3/lib/python3.8/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant,     object_pairs_hook, **kw)
    355             parse_int is None and parse_float is None and
    356             parse_constant is None and object_pairs_hook is None and not kw):
--> 357         return _default_decoder.decode(s)
    358     if cls is None:
    359         cls = JSONDecoder

/opt/anaconda3/lib/python3.8/json/decoder.py in decode(self, s, _w)
    335 
    336         """
--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338         end = _w(s, end).end()
    339         if end != len(s):

/opt/anaconda3/lib/python3.8/json/decoder.py in raw_decode(self, s, idx)
    353             obj, end = self.scan_once(s, idx)
    354         except StopIteration as err:
--> 355             raise JSONDecodeError("Expecting value", s, err.value) from None
    356         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Invalid literal in get_lineup

I've been pulling data on a daily basis, but today I seem to be getting this error from the get_lineup function:

nba_df = ns.scrape_season(2019)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-fa487e61689b> in <module>
----> 1 nba_df = ns.scrape_season(2019)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nba_scraper/nba_scraper.py in scrape_season(season, data_format, data_dir)
    132     for game in game_ids:
    133         print(f"Scraping game id: 00{game}")
--> 134         scraped_games.append(sf.main_scrape(f"00{game}"))
    135 
    136     nba_df = pd.concat(scraped_games)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nba_scraper/scrape_functions.py in main_scrape(game_id)
   1177         periods.append(get_lineup(game_df[game_df['period'] == period].copy(),
   1178                                   home_lineup_dict, away_lineup_dict,
-> 1179                                   game_df))
   1180     game_df = pd.concat(periods).reset_index(drop=True)
   1181 

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/nba_scraper/scrape_functions.py in get_lineup(period_df, home_lineup_dict, away_lineup_dict, dataframe)
   1121             print('home_ids:', home_ids_names[0][1])
   1122             period_df.iat[i, 62] = home_ids_names[0][0]
-> 1123             period_df.iat[i, 61] = home_ids_names[0][1]
   1124             period_df.iat[i, 64] = home_ids_names[1][0]
   1125             period_df.iat[i, 63] = home_ids_names[1][1]

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/indexing.py in __setitem__(self, key, value)
   2285         key = list(self._convert_key(key, is_setter=True))
   2286         key.append(value)
-> 2287         self.obj._set_value(*key, takeable=self._takeable)
   2288 
   2289 

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/frame.py in _set_value(self, index, col, value, takeable)
   2809             if takeable is True:
   2810                 series = self._iget_item_cache(col)
-> 2811                 return series._set_value(index, value, takeable=True)
   2812 
   2813             series = self._get_item_cache(col)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/series.py in _set_value(self, label, value, takeable)
   1221         try:
   1222             if takeable:
-> 1223                 self._values[label] = value
   1224             else:
   1225                 self.index._engine.set_value(self._values, label, value)

ValueError: invalid literal for int() with base 10: 'Al Horford'

Has the data changed from NBA side? I've made a few changes on my end, but I don't think it's from my code additions. I'll try to push those changes when I feel they're 100% necessary and correct.

mcbarlowe / nba_scraper Goto Github PK

nba_scraper's Introduction

This package is no longer maintained as of 2021/01/30. Any outstanding issues or new ones will not be fixed

nba_scraper

Installation

Usage

scrape_game

scrape_season

scrape_date_range