jculvey / roboto Goto Github PK
View Code? Open in Web Editor NEWA web crawler/scraper/spider for nodejs
A web crawler/scraper/spider for nodejs
I think its a common case that similar kind of info (e.g. images, names, etc. ) are accessible via different css queries and require different parsing logic.
How about providing a parser for each domain, kind of like how methods in express.js can be set as handlers for different routes.
Any chance it might make it in sometime soon?
It's really unnecessary to have all those npm_modules bundled with the framework. It would be better to just define a .gitignore file to ignore them. They would be pulled again automatically with npm install.
Hi,
I'm trying to crawl a page (like http://021-online.com) which has charset=gb2312.
When I try to read head/title field using cheerio (that's embedded) I got, let say, crap instead of proper chars.
Am I missing some configuration, or is it a bug that prevents from properly crawling non-utf8 pages?
It may also be a problem of cheerio rather than roboto itself.
hello,
is it possible to get some metrics like request time, request latency?
It'd be great if we could create custom queues or frontiers and inject them into the crawler as an option or parameter.
What I want to do is to use a database to store what urls I have visited and how many times. I also want to have my own custom logic about what url queuing. Maybe I could inject least recently visited urls back in the queue if queue becomes empty to update existing data.
It'd be great if that queuing logic is made into a separate module that we can implement. Let me know if you understand what I'm trying to say. English is not really my native language.
will this cause infinite crawl for the bigger site? what is strategy can be use to crawl the website in efficient way?
In many websites, you are only interested in some kind of a page, like a product details page. But what if there are no links to other product detail pages from there? The crawler would get stuck.
I think it could work that the crawler crawls pages freely (respecting robots.txt file of course) but only parses the qualified pages to extract item.
Hello,
I didn't find anything in the docs, so I figured I would ask the question here. When the docs refer to the process of "Downloading and processing" the page, where does that information go? And does it get cleared after it's been "processed"? How is roboto when it comes to memory consumption? Do these tasks require a lot of memory? And if you "download and process" a ton of pages (like just letting it run forever and scrape every link it can find) will memory eventually become an issue?
Cant crawl https://facebook.com the result is :
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO Beginning Crawl.
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO Crawling https://www.facebook.com/
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO Finished crawl.
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Started: Thursday, May 14th 2015, 5:04:40 pm
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Duration: 0 hours, 0 minutes, 0 seconds.
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Page(s) crawled: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Page(s) nofollowed: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Request count: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Request errors: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 200 OK count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 301 count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 302 count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 404 count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Filtered due to
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Disallowed domain: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Blacklist: 0
The code is:
var roboto = require('roboto');
var html_strip = require('htmlstrip-native').html_strip;
var stripOptions = {
include_script : false,
include_style : false,
compact_whitespace : true
};
var crawler = new roboto.Crawler({
startUrls: [
'https://www.facebook.com/'
],
constrainToRootDomains: true,
obeyRobotsTxt: false,
obeyNofollow: false,
maxDepth: 50,
});
/*
Dead/broken links start
var sites = [];
var deadLinks = {
E503:[],
E500:[],
E404:[],
E403:[],
E302:[]
};
deadLinks.E503.push();
*/
// Returns time it takes to make the request for current site
crawler.parseField('requestTime', function(response) {
return (new Date().getTime()-response.request.headers['Start-time']);
});
// Returns url of current site
crawler.parseField('url', function(response) {
return response.url;
});
// Returns url of current site
crawler.parseField('statusCode', function(response) {
return response.statusCode;
});
// returns title of current site
crawler.parseField('title', function(response, $) {
return $('head title').text();
});
// returns count of words on current site
crawler.parseField('TextCount', function(response, $) {
var html = $('body').html();
if (html) {
return countWords(html_strip(html, stripOptions));
}
});
crawler.on('item', function(item) {
//console.log(item.url);
});
crawler.on('finish', function() {
});
crawler.crawl();
/*
Functions
/
function countWords(s){
s = s.replace(/(^\s)|(\s*$)/gi,"");//exclude start and end white-space
s = s.replace(/[ ]{2,}/gi," ");//2 or more space to 1
s = s.replace(/\n /,"\n"); // exclude newline with a start spacing
return s.split(' ').length;
}
Would like a way for roboto to use the logger provided by the contained application.
Possible approach (not tested):
customLogger
(or something like that)this._log = this.option.customLogger || log;
log.*
with this._log.*
itemLogger(this._log)
Thanks for some awesome code BTW!
I think I saw it in the roadmap.
It could be nice if you could stop and then resume roboto so i does not start over from the beginning/startsUrl. I think it could be achieve via de/serialization so when you start/stop it loads its' previous state.
Roboto is very nice and useful, the one thing is missing for me is a requests path. So when roboto crawls some pages it could store the urls it visited to access to current page (the path to reach current page). It is helpful when you use it to crawl pages and you need to know the context or the previous/parent page (e.g. it could be category page).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.