Coder Social home page Coder Social logo

crawler's Introduction

This is a quick and dirty /simple/ web crawler, and a work in progress. 
If you want it to do anything besides discover new links, you will need to 
write some code. It tries to respect robots.txt, and can be scaled fairly well 
across several machines thanks to the magic of celery and rabbitmq.

Please acquire more beer first if you're planning to use this for something 
serious, it will make it a more tolerable experience..

Libraries: 
	- urllib2 for document fetching
	- celery for task distribution
	- rabbitmq as the default message broker
	- beautifulsoup to extract links from html, even if the html is badly 
	  written
	- urlparse to make urls absolute, and to extract domain names

To use the crawler:

1) Start rabbitmq

2) If you're planning to run workers outside the local computer:
	- edit celeryconfig.py to point to the ip/hostname where you are 
	  running the rabbitmq server.
	- copy the program to the worker nodes

3) Start celeryd in the crawler folder on the worker nodes.

4) run example.py --seed url1 url2 ur3 ...


Examples
----------------------------------

Output of a test-run with a (too) low crawl-delay:

	Suspended crawl to 1298964969.suspended_crawl
	Downloaded bytes: 2603580
	Discovered links: 1449
	Discovered domains: 102
	Runtime: 40.9467639923 seconds
	Most prolific referrer was www. --- .com with an average of
	50.0 outgoing links per page.

To explore the results, fire up an interactive python shell and run:

	>>> from crawler import resume
	>>> crawler = resume("1298964969.suspended_crawl")

For example, these might be interesting:

	>>> print crawler.domains.keys()
	>>> print crawler.urls.values()

To resume the crawling process:

	>>> crawler.crawl()


To resume a killed crawler outside of the python shell, call the script 
with --resume filename

Assumptions, limitations, improvements
--------------------------------------

Assumptions:
- Web consists only of html (!)
- We don't need to get every single document, there's always others..
- Only http results of interest (optional https without cert checking available
  by passing schemes = ["http","https"] to crawler
- We're crawling a small subset of the internet so keeping some stuff in memory is ok.
- The goal was not to reimplement Googlebot in a few days :)

Good stuff:
- The crawler can add and remove workers while running
- RabbitMQ can be clustered, live (thanks Erlang..)
- Respects robots.txt (optionally..) including craw-delay, fetches this asynchronosly
  and delays further crawling of domain until robots.txt is parsed or found missing
- Can be restricted to crawling specific domains
- Accepts custom stopping criteria in the form of a function which can inspect the 
  crawl state
- In case of a crash, kill or completion, crawl state is saved so it can either be 
  resumed or inspected to find any problems.
- Crawls pretty fast after ramp-up

Limitations:
- The document ranking function is probably quite bad (it's no PageRank)
- The crawl oversight process is not distributed (but has been found capable of
  keeping up with hundreds of workers)
- Too much info is kept in memory for this to be usable on a grand scale,
  but it lets us get statistics easily
- Doesn't reuse connections

Suggested improvements:
- Crawl oversight should be made distributed
- Decouple link structure data storage from crawling
- Change to urllib3 to reuse connections (then again, maybe not, this would make us have to
  track which worker is talking to which server)
- Make your own crawler.rank(document) and termination criteria
- WARNING: uses pickle, which is unsafe on data from untrusted sources. Write some other method of saving
  state!

crawler's People

Contributors

petterw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.