Coder Social home page Coder Social logo

scrapy-bench's Introduction

Benchmarking CLI for Scrapy

(The project is still in development.)

A command-line interface for benchmarking Scrapy, that reflects real-world usage.

Why?

  • Currently, the scrapy bench option present just spawns a spider which aggressively crawls randomly generated links at a high speed.
  • The speed thus obtained, which maybe useful for comparisons, does not actually reflects a real-world scenario.
  • The actual speed varies with the python version and scrapy version.

Current Features

  • Spawns a CPU-intensive spider which follows a fixed number of links of a static snapshot of the site Books to Scrape.
  • Follows a real-world scenario where various information of the books is extracted, and stored in a .csv file.
  • A broad crawl benchmark that uses 1000 copies of the site Books to Scrape which are dynamically generated using twisted. The server file is present here.
  • A micro benchmark that tests LinkExtractor() function by extracting links from a collection of html pages.
  • A micro benchmark that tests extraction using css from a collection of html pages.
  • A micro benchmark that tests extraction using xpath from a collection of html pages
  • Profile the benchmarkers with vmprof and upload to their website

Options

  • --n-runs option for performing more than one iteration of spider to improve the precision.
  • --only_result option for viewing the results only.
  • --upload_result option to upload the results to local codespeed for better comparison.

Installation

For Ubuntu

  • Firstly, download the static snapshot of the website Books to Scrape. That can be done by using wget.

      wget --mirror --convert-links --adjust-extension --page-requisites --no-parent \
          http://books.toscrape.com/index.html
    
  • Then place the whole file in the folder var/www/html:

      sudo ln -s `pwd`/books.toscrape.com/ /var/www/html/
    
  • nginx is required for deploying the website. Hence it is required to be installed and configured. If it is, you would be able to see the site here.

  • If not, then follow the given steps :

      sudo apt-get update
      sudo apt-get install nginx
    
  • For the broad crawl, use the server.py file to generate the various sites of local copy of Books to Scrape, which would already be in /var/www/html.

  • Add the following entries to /etc/hosts file :

    127.0.0.1    domain1
    127.0.0.1    domain2
    127.0.0.1    domain3
    127.0.0.1    domain4
    127.0.0.1    domain5
    127.0.0.1    domain6
    127.0.0.1    domain7
    127.0.0.1    domain8
    ....................
    127.0.0.1    domain1000
    
  • This would point the sites http://domain1:8880/index.html to the original site generated at http://localhost:8880/index.html.

There are 130 html files present in sites.tar.gz, which were downloaded using download.py from the top sites from Alexa top sites list.

There are 200 html files present in bookfiles.tar.gz, which were downloaded using download.py from the website Books to Scrape.

The spider download.py, dumps the response body as unicode to the files. The list of top sites was taken from here.

  • Do the following to complete the installation:

    git clone https://github.com/scrapy/scrapy-bench.git  
    cd scrapy-bench/  
    virtualenv env  
    . env/bin/activate   
    pip install --editable .
    

Usage

Usage: scrapy-bench [OPTIONS] COMMAND [ARGS]...

  A benchmark suite for Scrapy.

Options:
  --n-runs INTEGER  Take multiple readings for the benchmark.
  --only_result     Display the results only.
  --upload_result   Upload the results to local codespeed
  --book_url TEXT   Use with bookworm command. The url to book.toscrape.com on your local machine
  --vmprof          Profile the benchmarker with Vmprof
  --help            Show this message and exit.

Commands:
  bookworm       Spider to scrape locally hosted site
  broadworm      Broad crawl spider to scrape locally hosted...
  cssbench       Micro-benchmark for extraction using css
  linkextractor  Micro-benchmark for LinkExtractor()
  xpathbench     Micro-benchmark for extraction using xpath

scrapy-bench's People

Contributors

parth-vader avatar malloxpb avatar lopuhin avatar redapple avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.