Coder Social home page Coder Social logo

march_madness's Introduction

March Madness Toolkit

This is a collection of tools used to help with the annual March Madness bracket prediction tradition. This is a rewrite of the ancient C# tools for scraping and data aggregation.

Tools

The scope of this project includes tools for:

  • scraping of files (downloaded web pages) for data

Dependencies

  • Ruby 1.9

Usage

Scraping for tokens in a file

Given a file that contains data that can be found with regular expressions, use the ./bin/scrape.rb tool finding the tokens and outputting them in tab- delimited format to stdout. Note that this could be done with a combination of grep and sed to the same effect.

./bin/scrape.rb source_file regex [--headers]

Example:

./bin/scrape.rb data/teams.html "<a\s+[^>]*href=['\"]?[^'\"\s]*/team/_/id/(?<team_id>\d+)/[^'\"\s]*['\"]?[^>]*>\s*(?<team_name>[^<]*?)\s*</a>"

... creates a tab-delimited file with two columns: team_id and team_name. Add the argument --headers to the end to add a row at the top of the file for team_id and team_name.

Example workflow

All steps in this example workflow assume you are working from ./data.

Download a list of teams from your favorite source

curl http://example.com/teams > teams.html

Scrape team IDs off of the downloaded page

../bin/scrape.rb teams.html "<a\s+[^>]*href=['\"]?[^'\"\s]*/team/_/id/(?<team_id>\d+)/[^'\"\s]*['\"]?[^>]*>\s*([^<]*?)\s*</a>" > team_ids.tsv

Download the schedule from each of the teams

cat team_ids.tsv | while read line; do curl "http://example.com/team/schedule/id/$line" > schedules_$line.html; sleep 60; done

Get the game IDs from each of the teams' schedules

ls schedules_*.html | while read line; do id=$(echo $line | grep -oP '\d+'); ../bin/scrape.rb $line "<a\s+[^>]*href=['\"]/ncb/recap\?gameId=(?<game_id>\d+)" > game_list_$id.tsv; done

Download all of the boxscores and play-by-play pages

cat game_list_*.tsv | sort | uniq | while read line; do curl "http://example.com/boxscore?gameId=$line" > boxscore_$line.html; sleep 60; curl "http://example.com/playbyplay?gameId=$line" > playbyplay_$line.html; sleep 60; done

Note: to skip files that have already been downloaded, use the following:

cat game_list_*.tsv | sort | uniq | while read line; do if [ ! -f boxscore_$line.html ]; then curl "http://example.com/boxscore?gameId=$line" > boxscore_$line.html; sleep 30; fi; if [ ! -f playbyplay_$line.html ]; then curl "http://example.com/playbyplay?gameId=$line" > playbyplay_$line.html; sleep 30; fi; done

march_madness's People

Contributors

warmer avatar

Watchers

 avatar

Forkers

aromatt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.