Coder Social home page Coder Social logo

crawler's Introduction

MIT license

Crawler

A modular, extensible, but otherwise fairly simple, web crawler. The basic idea behind this project is to develop a generic web crawler that that has the optional capability of running user-specified modules to parse the data as it is crawled.

Installation

To clone the source code:

git clone https://github.com/msimms/Crawler

To install the dependencies:

cd Crawler
python setup.py

Results

You should have a MongoDB installation handy as results are stored in MongoDB.

The crawler will create a database, creatively called crawlerdb, that has a collection called pages. The pages collection will contain a document for each page crawled. The crawler will store the URL, last visited timestamp, along with anything added by the website module in the document.

Usage

python Crawler.py 
    [--file <name of a file from which to harvest URLs>]
    [--url <URL from which to start crawling>]
    [--rate <crawl rate, in seconds>]
    [--max-depth <maximum crawl depth>]
    [--min-revisit-secs <minimum number of seconds before allowing a URL to be revisited>\
    [--website-modules <command separated list of the Python modules that will parse each page>]
    [--mongodb-addr <URL of the mongodb instance which will store the result, defaults to localhost:27017>]
    [--crawl-other-websites]
    [--verbose]

Seeding the crawler is done with either the --file or --url parameter.

Extending

As this is a modular web crawler, it supports modules for dealing with specific websites. This is done by subclassing the ParseModule class and then passing the name of that class to the crawler using the website-modules option. Multiple modules can be supported by separating each module in the list with a comma. Data returned by a module is stored in the database, along with the raw page source.

Examples

python Crawler.py --url https://foo.com --website-modules foo.py --verbose --min-revisit-secs 86400

The above example will crawl links from foo.com, parsing the results using the foo.py module, while printing verbose output. Pages will not be revisited unless it has been more than one day since the last visit.

python Crawler.py --url https://foo.com --website-modules foo.py --verbose --crawl-other-websites --min-revisit-secs 86400

The above example will crawl links from foo.com as well as any page linked to from foo.com. Otherwise, it is the same as the previous example.

License

This library is released under the MIT license, see LICENSE for details.

crawler's People

Contributors

msimms avatar

Watchers

James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.