This is a collection of tools used to help with the annual March Madness bracket prediction tradition. This is a rewrite of the ancient C# tools for scraping and data aggregation.
The scope of this project includes tools for:
- scraping of files (downloaded web pages) for data
- Ruby 1.9
Given a file that contains data that can be found with regular expressions,
use the ./bin/scrape.rb
tool finding the tokens and outputting them in tab-
delimited format to stdout. Note that this could be done with a combination
of grep
and sed
to the same effect.
./bin/scrape.rb source_file regex [--headers]
Example:
./bin/scrape.rb data/teams.html "<a\s+[^>]*href=['\"]?[^'\"\s]*/team/_/id/(?<team_id>\d+)/[^'\"\s]*['\"]?[^>]*>\s*(?<team_name>[^<]*?)\s*</a>"
... creates a tab-delimited file with two columns: team_id
and team_name
.
Add the argument --headers
to the end to add a row at the top of the file
for team_id
and team_name
.
All steps in this example workflow assume you are working from ./data
.
curl http://example.com/teams > teams.html
../bin/scrape.rb teams.html "<a\s+[^>]*href=['\"]?[^'\"\s]*/team/_/id/(?<team_id>\d+)/[^'\"\s]*['\"]?[^>]*>\s*([^<]*?)\s*</a>" > team_ids.tsv
cat team_ids.tsv | while read line; do curl "http://example.com/team/schedule/id/$line" > schedules_$line.html; sleep 60; done
ls schedules_*.html | while read line; do id=$(echo $line | grep -oP '\d+'); ../bin/scrape.rb $line "<a\s+[^>]*href=['\"]/ncb/recap\?gameId=(?<game_id>\d+)" > game_list_$id.tsv; done
cat game_list_*.tsv | sort | uniq | while read line; do curl "http://example.com/boxscore?gameId=$line" > boxscore_$line.html; sleep 60; curl "http://example.com/playbyplay?gameId=$line" > playbyplay_$line.html; sleep 60; done
Note: to skip files that have already been downloaded, use the following:
cat game_list_*.tsv | sort | uniq | while read line; do if [ ! -f boxscore_$line.html ]; then curl "http://example.com/boxscore?gameId=$line" > boxscore_$line.html; sleep 30; fi; if [ ! -f playbyplay_$line.html ]; then curl "http://example.com/playbyplay?gameId=$line" > playbyplay_$line.html; sleep 30; fi; done