Coder Social home page Coder Social logo

scrapemeagain's Introduction

Build Status Coverage Status PyPI version Code style: black

ScrapeMeAgain

ScrapeMeAgain is a Python 3 powered web scraper. It uses multithreading (ThreadPoolExecutor) and multiprocessing to get the work done quicker and stores collected data in an SQLite database.

Installation

pip install scrapemeagain

System requirements

Tor in combination with Privoxy are used for anonymity (i.e. regular IP address changes).

Using Docker and Docker Compose is the preferred (and easier) way to use ScrapeMeAgain.

You have to manually install and setup Tor and Privoxy on your system if not using Docker. For further information about installation and configuration refer to:

Usage

You have to provide your own database table description and an actual scraper class which must follow the BaseScraper interface. See examples/examplescraper for more details.

Dockerized

With Docker it is possible to use multiple Tor IPs at the same time and, unless you abuse it, scrape data faster.

The easiest way to start is to duplicate examples/examplescraper and then update, rename, expand, etc. your scraper and related classes.

Your scraper must define config.py and main_dockerized.py. These two names are hardcoded throughout the codebase.

scrapemeagain-compose.py dynamically creates a docker-compose.yml which orchestrates scraper instances. The idea is that the first scraper (scp1) is a master scraper and hence is the host for a controller app (see dockerized/controller).

  1. Get Docker host Ip
ip addr show docker0

NOTE Your Docker interface name may be different from docker0.

  1. Run examplesite on Docker host IP
python3 examples/examplescraper/examplesite/app.py 172.17.0.1

NOTE Your Docker host IP may be different from 172.17.0.1.

  1. Start docker-compose
scrapemeagain-compose.py -s $(pwd)/examples/examplescraper -c tests.integration.fake_config | docker-compose -f - up

NOTE A special config file path is provided: -c tests.integration.fake_config. This is required only for test/demo purposes. You don't have to provide specific config for a real/production scraper.

Local

  1. Run examplesite
python3 examples/examplescraper/examplesite/app.py
  1. Start examplescraper
python3 examples/examplescraper/main.py

NOTE You may need to update your PYTHONPATH, e.g. export PYTHONPATH=$PYTHONPATH:$(pwd)/examples.

Development

To simplify running integration tests with latest changes:

  • replace image: dusanmadar/scrapemeagain:x.y.z with image: scp:latest in the scrapemeagain/dockerized/docker-compose.yml template

  • and make sure to rebuild the image locally before running tests, e.g.

docker build . -t scp:latest; python -m unittest discover -p test_integration.py

Legacy

The Python 2.7 version of ScrapeMeAgain, which also provides geocoding capabilities, is available under the legacy branch and is no longer maintained.

scrapemeagain's People

Contributors

dusanmadar avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.