Coder Social home page Coder Social logo

sitescrapper's Introduction

Install the dependencies using

pip install -r requirements.txt

  1. Before installing requirements make sure your system has libffi-dev and libssl-dev libraries installed , required for the https support .

2.Phantomjs 1.9.7 is installed .

3.Python 2.7.6 .

Start scrapping by executing

The current version of the code is using python 2.7 . To scrape the site using the twisted version of the library execute :

python tornado_spider.py --url='http://www.example.com'

for javascript errors detection ( takes too much time )

python tornado_spider.py --jserrors --url='http://www.example.com'

The twisted version spawns quite large number of connections on the server resulting in conditions similar to DOS and might lead to pages returning 503 errors. In such scenarios modify the max concurrent connections settings in the config.py file .

Configurations

Certain configurations for the scrapper can be done via the scrapper config.py file the various configurations available are as follows

Starting url

START_URL = http://www.example.com/

Max concurrent requests done to the server, too high value and server is choked

MAX_CONCURRENT_REQUESTS_PER_SERVER = 10

Idle ping used for determining the termination of the process

IDLE_PING_COUNT = 10

comma separated sub domains that need to be skipped

DOMAINS_TO_BE_SKIPPED=sub1.example.com,sub2.example.com

Limitations

1.Currently the utility doesn't scrape the pages obtained after loggging in .

2.Handling localhost based urls might require some tweaking .

3.Currently only python 2.7.6 is supported .

sitescrapper's People

Contributors

jayeshpowar avatar

Watchers

 avatar

Forkers

craigdub

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.