retresco / spyder Goto Github PK

View Code? Open in Web Editor NEW

141.0 141.0 19.0 1.47 MB

A Python web crawler using Tornado and ZeroMQ

Home Page: http://retresco.github.com/Spyder/

License: Apache License 2.0

Python 100.00%

spyder's People

Contributors

Stargazers

Watchers

Forkers

truemped rona25 mt3 pombredanne wmttom scraping-xx openscripts-xx idkwim huobao36 doodlesomething 22iz zxbaby hhy5277 leongseomen python-repository-hub

spyder's Issues

Add a 'HOP' depth

Right now there is nothing like Heritrix' HOP-Depth in Spyder. That means it will crawl all URLs in the Crawl Scope no matter how many links it takes.

There are (many) areas where I want to set such a HOP depth.

Migrate the CouchDB Sink

In one of the internal crawler projects we have a CouchDB sink. Migrate this into Spyder itself.

Remove Mgmt sockets

All management stuff should be handled via the main socket instead of separate sockets.

Document MultipleHostFrontier

Create a How-To area in the documentation that will cover the MultipleHostFrontier for the start.

the repo code is not working anymore

CrawlUri instead of DataMessage in the Fetcher

This is not bad, but it would be nice, if the fetcher would work on the CrawlUri and not the DataMessage. It's cleaner

Statistics

It would be very nice to have some kind of statistics.

Problems with redirect and 404 error page

When the URL in the CouchDB redirects to a 404 error page instead of giving a 404 status code respone, the 404 error page will be indexed by SOLR.

If a worker dies while processing a CrawlUri the Master may never notice that the specific CrawlUri has never finished. This could lead to problems especially when crawling many hosts, since the host is blocked from then on.

Do not shutdown when there is an error in the sink

With the 0.1.7 release the master shuts down itself when there is an error in the sink. This is bad, instead mark the uri with a special error code and continue crawling!

the repo is not working can you guys guide how to set this up

Correct handling of the credentials

Right now, the credentials are only added to newly extracted links when they are relative and there is no URL Tag in the HTML.

Credentials in the seeds should be handled correctly also with respect to the host path.

Sinks to the worker

Right now, the sinks are added in the master. This means, that all content is transfered to the master before stored using the sink. IMHO it would be better to add the sinks to the workers. Depending on your sink/storage this could mean more HTTP connections to the $DB but I consider this not to be my problem! :)

decide on mime type
decide on url
decide on HTTP status codes
...