retresco / spyder Goto Github PK
View Code? Open in Web Editor NEWA Python web crawler using Tornado and ZeroMQ
Home Page: http://retresco.github.com/Spyder/
License: Apache License 2.0
A Python web crawler using Tornado and ZeroMQ
Home Page: http://retresco.github.com/Spyder/
License: Apache License 2.0
Right now there is nothing like Heritrix' HOP-Depth in Spyder. That means it will crawl all URLs in the Crawl Scope no matter how many links it takes.
There are (many) areas where I want to set such a HOP depth.
In one of the internal crawler projects we have a CouchDB sink. Migrate this into Spyder itself.
All management stuff should be handled via the main socket instead of separate sockets.
Create a How-To area in the documentation that will cover the MultipleHostFrontier for the start.
This is not bad, but it would be nice, if the fetcher would work on the CrawlUri and not the DataMessage. It's cleaner
It would be very nice to have some kind of statistics.
When the URL in the CouchDB redirects to a 404 error page instead of giving a 404 status code respone, the 404 error page will be indexed by SOLR.
If a worker dies while processing a CrawlUri the Master may never notice that the specific CrawlUri has never finished. This could lead to problems especially when crawling many hosts, since the host is blocked from then on.
With the 0.1.7 release the master shuts down itself when there is an error in the sink. This is bad, instead mark the uri with a special error code and continue crawling!
Right now, the credentials are only added to newly extracted links when they are relative and there is no URL Tag in the HTML.
Credentials in the seeds should be handled correctly also with respect to the host path.
Right now, the sinks are added in the master. This means, that all content is transfered to the master before stored using the sink. IMHO it would be better to add the sinks to the workers. Depending on your sink/storage this could mean more HTTP connections to the $DB but I consider this not to be my problem! :)
At least the Workers and Log Sink should be supervised and automatically restarted. In addition rework the spyder.masterprocess and spyder.workerprocess to daemonize nicely.
Evaluate if working with XREQ/XREP is cleaner. It should remove at least one socket that is opened. If it is nicer, rework the Master and Worker to use those instead of the PUSH/PULL and PUB/SUB patterns.
There's a problem with datetime objects getting their way into the queues but that must not happen because the queue logic is based on times mktime and thus floats.
Create an Abstract class for processors that gives each processor a similar behaviour. This class should introduce a basic chain of decision methods that may determine if the given processor should work on the current CrawlUri.
Ideas for decision methods:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.