Coder Social home page Coder Social logo

yichenzhaonb / steam-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from prncc/steam-scraper

0.0 1.0 0.0 47 KB

Scrapers for games and game reviews from steampowered.com.

Home Page: https://intoli.com/blog/steam-scraper/

Shell 9.77% Python 90.23%

steam-scraper's Introduction

Steam Scraper

This repository contains Scrapy spiders for crawling products and scraping all user-submitted reviews from the Steam game store. A few scripts for more easily managing and deploying the spiders are included as well.

This repository contains code accompanying the Scraping the Steam Game Store article published on the Scrapinghub blog and the Intoli blog.

Installation

After cloning the repository with

git clone [email protected]:prncc/steam-scraper.git

start and activate a Python 3.6+ virtualenv with

cd steam-scraper
virtualenv -p python3.6 env
. env/bin/activate

Install Python requirements via:

pip install -r requirements.txt

By the way, on macOS you can install Python 3.6 via homebrew:

brew install python3

On Ubuntu you can use instructions posted on askubuntu.com.

Crawling the Products

The purpose of ProductSpider is to discover product pages on the Steam product listing and extract useful metadata from them. A neat feature of this spider is that it automatically navigates through Steam's age verification checkpoints. You can initiate the multi-hour crawl with

scrapy crawl products -o output/products_all.jl --logfile=output/products_all.log --loglevel=INFO -s JOBDIR=output/products_all_job -s HTTPCACHE_ENABLED=False

When it completes you should have metadata for all games on Steam in output/products_all.jl. Here's some example output:

{
  'app_name': 'Cold Fear™',
  'developer': 'Darkworks',
  'early_access': False,
  'genres': ['Action'],
  'id': '15270',
  'metascore': 66,
  'n_reviews': 172,
  'price': 9.99,
  'publisher': 'Ubisoft',
  'release_date': '2005-03-28',
  'reviews_url': 'http://steamcommunity.com/app/15270/reviews/?browsefilter=mostrecent&p=1',
  'sentiment': 'Very Positive',
  'specs': ['Single-player'],
  'tags': ['Horror', 'Action', 'Survival Horror', 'Zombies', 'Third Person', 'Third-Person Shooter'],
  'title': 'Cold Fear™',
  'url': 'http://store.steampowered.com/app/15270/Cold_Fear/'
 }

Extracting the Reviews

The purpose of ReviewSpider is to scrape all user-submitted reviews of a particular product from the Steam community portal. By default, it starts from URLs listed in its test_urls parameter:

class ReviewSpider(scrapy.Spider):
    name = 'reviews'
    test_urls = [
        "http://steamcommunity.com/app/316790/reviews/?browsefilter=mostrecent&p=1",  # Grim Fandango
        "http://steamcommunity.com/app/207610/reviews/?browsefilter=mostrecent&p=1",  # The Walking Dead
        "http://steamcommunity.com/app/414700/reviews/?browsefilter=mostrecent&p=1"   # Outlast 2
    ]

It can alternatively ingest a text file containing URLs such as

http://steamcommunity.com/app/316790/reviews/?browsefilter=mostrecent&p=1
http://steamcommunity.com/app/207610/reviews/?browsefilter=mostrecent&p=1
http://steamcommunity.com/app/414700/reviews/?browsefilter=mostrecent&p=1

via the url_file command line argument:

scrapy crawl reviews -o reviews.jl -a url_file=url_file.txt -s JOBDIR=output/reviews

An output sample:

{
  'date': '2017-06-04',
  'early_access': False,
  'found_funny': 5,
  'found_helpful': 0,
  'found_unhelpful': 1,
  'hours': 9.8,
  'page': 3,
  'page_order': 7,
  'product_id': '414700',
  'products': 179,
  'recommended': True,
  'text': '3 spooky 5 me',
  'user_id': '76561198116659822',
  'username': 'Fowler'
}

If you want to get all the reviews for all products, split_review_urls.py will remove duplicate entries from products_all.jl and shuffle review_urls into several text files. This provides a convenient way to split up your crawl into manageable pieces. The whole job takes a few days with Steam's generous rate limits.

Deploying to a Remote Server

This section briefly explains how to run the crawl on one or more t1.micro AWS instances.

First, create an Ubuntu 16.04 t1.micro instance and name it scrapy-runner-01 in your ~/.ssh/config file:

Host scrapy-runner-01
     User ubuntu
     HostName <server's IP>
     IdentityFile ~/.ssh/id_rsa

A hostname of this form is expected by the scrapydee.sh helper script included in this repository. Make sure you can connect with ssh scrappy-runner-01.

Remote Server Setup

The tool that will actually run the crawl is scrapyd running on the remote server. To set things up first install Python 3.6:

sudo add-apt-repository ppa:jonathonf/python-3.6
sudo apt update
sudo apt install python3.6 python3.6-dev virtualenv python-pip

Then, install scrapyd and the remaining requirements in a dedicated run directory on the remote server:

mkdir run && cd run
virtualenv -p python3.6 env
. env/bin/activate
pip install scrapy scrapyd botocore smart_getenv  

You can run scrapyd from the virtual environment with

scrapyd --logfile /home/ubuntu/run/scrapyd.log &

You may wish to use something like screen to keep the process alive if you disconnect from the server.

Controlling the Job

You can issue commands to the scrapyd process running on the remote machine using a simple HTTP JSON API. First, create an egg for this project:

python setup.py bdist_egg

Copy the egg and your review url file to scrapy-runner-01 via

scp output/review_urls_01.txt scrapy-runner-01:/home/ubuntu/run/
scp dist/steam_scraper-1.0-py3.6.egg scrapy-runner-01:/home/ubuntu/run

and add it to scrapyd's job directory via

ssh -f scrapy-runner-01 'cd /home/ubuntu/run && curl http://localhost:6800/addversion.json -F project=steam -F egg=@steam_scraper-1.0-py3.6.egg'

Opening port 6800 to TCP traffic coming from your home IP would allow you to issue this command without going through SSH. If this command doesn't work, you may need to edit scrapyd.conf to contain

bind_address = 0.0.0.0

in the [scrapyd] section. This is a good time to mention that there exists a scrapyd-client project for deploying eggs to scrapyd equipped servers. I chose not to use it because it doesn't know about servers already set up in ~/.ssh/config and so requires repetitive configuration.

Finally, start the job with something like

ssh scrapy-runner-01 'curl http://localhost:6800/schedule.json -d project=steam -d spider=reviews -d url_file="/home/ubuntu/run/review_urls_01.txt" -d jobid=part_01 -d setting=FEED_URI="s3://'$STEAM_S3_BUCKET'/%(name)s/part_01/%(time)s.jl" -d setting=AWS_ACCESS_KEY_ID='$AWS_ACCESS_KEY_ID' -d setting=AWS_SECRET_ACCESS_KEY='$AWS_SECRET_ACCESS_KEY' -d setting=LOG_LEVEL=INFO'

This command assumes you have set up an S3 bucket and the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. It should be pretty easy to customize it for non-S3 output, however.

The scrapydee.sh helper script included in the scripts directory of this repository has some shortcuts for issuing commands to scrapyd-equipped servers with hostnames of the form scrapy-runner-01. For example, the command

./scripts/scrapydee.sh status 1
# Executing status()...
# On server(s): 1.

will run the status() function defined in scrapydee.sh on scrapy-runner-01. See that file for more command examples. You can also run each of the included commands on multiple servers: First, change the all() function within scrapydee.sh to match the number of servers you have configured. Then, issue a command such as

./scripts/scrapydee.sh status all

The output is a bit messy, but it's a quick and easy way to run this job.

steam-scraper's People

Contributors

prncc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.