jculvey / roboto Goto Github PK

View Code? Open in Web Editor NEW

67.0 67.0 25.0 18.41 MB

A web crawler/scraper/spider for nodejs

JavaScript 89.25% HTML 10.75%

roboto's People

Contributors

Stargazers

Watchers

roboto's Issues

Different Parsers for different domains

I think its a common case that similar kind of info (e.g. images, names, etc. ) are accessible via different css queries and require different parsing logic.
How about providing a parser for each domain, kind of like how methods in express.js can be set as handlers for different routes.

hashbang urls are discarded

Any chance it might make it in sometime soon?

Remove folder npm_modules and have a .gitignore

It's really unnecessary to have all those npm_modules bundled with the framework. It would be better to just define a .gitignore file to ignore them. They would be pulled again automatically with npm install.

Unable to crawl a page with encoding other than utf8

Hi,

I'm trying to crawl a page (like http://021-online.com) which has charset=gb2312.
When I try to read head/title field using cheerio (that's embedded) I got, let say, crap instead of proper chars.

Am I missing some configuration, or is it a bug that prevents from properly crawling non-utf8 pages?
It may also be a problem of cheerio rather than roboto itself.

get stats for each request

hello,

is it possible to get some metrics like request time, request latency?

Custom queing logic?

It'd be great if we could create custom queues or frontiers and inject them into the crawler as an option or parameter.
What I want to do is to use a database to store what urls I have visited and how many times. I also want to have my own custom logic about what url queuing. Maybe I could inject least recently visited urls back in the queue if queue becomes empty to update existing data.

It'd be great if that queuing logic is made into a separate module that we can implement. Let me know if you understand what I'm trying to say. English is not really my native language.

infinite crawl

will this cause infinite crawl for the bigger site? what is strategy can be use to crawl the website in efficient way?

Crawl any page, parse only whitelisted and not blacklisted pages

In many websites, you are only interested in some kind of a page, like a product details page. But what if there are no links to other product detail pages from there? The crawler would get stuck.
I think it could work that the crawler crawls pages freely (respecting robots.txt file of course) but only parses the qualified pages to extract item.

Question about roboto and memory usage

Hello,

I didn't find anything in the docs, so I figured I would ask the question here. When the docs refer to the process of "Downloading and processing" the page, where does that information go? And does it get cleared after it's been "processed"? How is roboto when it comes to memory consumption? Do these tasks require a lot of memory? And if you "download and process" a ton of pages (like just letting it run forever and scrape every link it can find) will memory eventually become an issue?

Cant crawl Facebook.com

Cant crawl https://facebook.com the result is :
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO Beginning Crawl.
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO Crawling https://www.facebook.com/
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO Finished crawl.
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Started: Thursday, May 14th 2015, 5:04:40 pm
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Duration: 0 hours, 0 minutes, 0 seconds.
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Page(s) crawled: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Page(s) nofollowed: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Request count: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Request errors: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 200 OK count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 301 count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 302 count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 404 count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Filtered due to
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Disallowed domain: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Blacklist: 0
The code is:
var roboto = require('roboto');
var html_strip = require('htmlstrip-native').html_strip;
var stripOptions = {
include_script : false,
include_style : false,
compact_whitespace : true
};

var crawler = new roboto.Crawler({
startUrls: [
'https://www.facebook.com/'
],
constrainToRootDomains: true,
obeyRobotsTxt: false,
obeyNofollow: false,
maxDepth: 50,
});

Dead/broken links start

var sites = [];
var deadLinks = {
E503:[],
E500:[],
E404:[],
E403:[],
E302:[]
};

deadLinks.E503.push();
*/
// Returns time it takes to make the request for current site
crawler.parseField('requestTime', function(response) {

return (new Date().getTime()-response.request.headers['Start-time']);
});
// Returns url of current site
crawler.parseField('url', function(response) {
return response.url;
});
// Returns url of current site
crawler.parseField('statusCode', function(response) {
return response.statusCode;
});
// returns title of current site
crawler.parseField('title', function(response, $) {
return $('head title').text();
});
// returns count of words on current site
crawler.parseField('TextCount', function(response, $) {
var html = $('body').html();
if (html) {
return countWords(html_strip(html, stripOptions));
}
});

crawler.on('item', function(item) {
//console.log(item.url);

});

crawler.on('finish', function() {

});

crawler.crawl();

/*
Functions

/
function countWords(s){
s = s.replace(/(^\s)|(\s*$)/gi,"");//exclude start and end white-space
s = s.replace(/[ ]{2,}/gi," ");//2 or more space to 1
s = s.replace(/\n /,"\n"); // exclude newline with a start spacing
return s.split(' ').length;
}

Provide external logger

Would like a way for roboto to use the logger provided by the contained application.

Possible approach (not tested):

Add an option customLogger (or something like that)
In constructor add this._log = this.option.customLogger || log;
Replace all log.* with this._log.*
Add a parameter to itemLogger constructor to accept external log object itemLogger(this._log)

Thanks for some awesome code BTW!

Stop/resume

I think I saw it in the roadmap.
It could be nice if you could stop and then resume roboto so i does not start over from the beginning/startsUrl. I think it could be achieve via de/serialization so when you start/stop it loads its' previous state.

Add path for the item

Roboto is very nice and useful, the one thing is missing for me is a requests path. So when roboto crawls some pages it could store the urls it visited to access to current page (the path to reach current page). It is helpful when you use it to crawl pages and you need to know the context or the previous/parent page (e.g. it could be category page).

jculvey / roboto Goto Github PK

roboto's People

Contributors

Stargazers

Watchers

Forkers

roboto's Issues

Recommend Projects

Recommend Topics

Recommend Org