Coder Social home page Coder Social logo

roboto's People

Contributors

arsalandotme avatar f1ames avatar halfnelson avatar jculvey avatar olavtenbosch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

roboto's Issues

Different Parsers for different domains

I think its a common case that similar kind of info (e.g. images, names, etc. ) are accessible via different css queries and require different parsing logic.
How about providing a parser for each domain, kind of like how methods in express.js can be set as handlers for different routes.

Remove folder npm_modules and have a .gitignore

It's really unnecessary to have all those npm_modules bundled with the framework. It would be better to just define a .gitignore file to ignore them. They would be pulled again automatically with npm install.

Unable to crawl a page with encoding other than utf8

Hi,

I'm trying to crawl a page (like http://021-online.com) which has charset=gb2312.
When I try to read head/title field using cheerio (that's embedded) I got, let say, crap instead of proper chars.

Am I missing some configuration, or is it a bug that prevents from properly crawling non-utf8 pages?
It may also be a problem of cheerio rather than roboto itself.

Custom queing logic?

It'd be great if we could create custom queues or frontiers and inject them into the crawler as an option or parameter.
What I want to do is to use a database to store what urls I have visited and how many times. I also want to have my own custom logic about what url queuing. Maybe I could inject least recently visited urls back in the queue if queue becomes empty to update existing data.

It'd be great if that queuing logic is made into a separate module that we can implement. Let me know if you understand what I'm trying to say. English is not really my native language.

infinite crawl

will this cause infinite crawl for the bigger site? what is strategy can be use to crawl the website in efficient way?

Crawl any page, parse only whitelisted and not blacklisted pages

In many websites, you are only interested in some kind of a page, like a product details page. But what if there are no links to other product detail pages from there? The crawler would get stuck.
I think it could work that the crawler crawls pages freely (respecting robots.txt file of course) but only parses the qualified pages to extract item.

Question about roboto and memory usage

Hello,

I didn't find anything in the docs, so I figured I would ask the question here. When the docs refer to the process of "Downloading and processing" the page, where does that information go? And does it get cleared after it's been "processed"? How is roboto when it comes to memory consumption? Do these tasks require a lot of memory? And if you "download and process" a ton of pages (like just letting it run forever and scrape every link it can find) will memory eventually become an issue?

Cant crawl Facebook.com

Cant crawl https://facebook.com the result is :
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO Beginning Crawl.
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO Crawling https://www.facebook.com/
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO Finished crawl.
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Started: Thursday, May 14th 2015, 5:04:40 pm
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Duration: 0 hours, 0 minutes, 0 seconds.
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Page(s) crawled: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Page(s) nofollowed: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Request count: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Request errors: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 200 OK count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 301 count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 302 count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> 404 count : 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Filtered due to
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Disallowed domain: 0
[Thu May 14 2015 17:04:40 GMT+0200 (CEST)] INFO >> Blacklist: 0
The code is:
var roboto = require('roboto');
var html_strip = require('htmlstrip-native').html_strip;
var stripOptions = {
include_script : false,
include_style : false,
compact_whitespace : true
};

var crawler = new roboto.Crawler({
startUrls: [
'https://www.facebook.com/'
],
constrainToRootDomains: true,
obeyRobotsTxt: false,
obeyNofollow: false,
maxDepth: 50,
});

/*

Dead/broken links start

var sites = [];
var deadLinks = {
E503:[],
E500:[],
E404:[],
E403:[],
E302:[]
};

deadLinks.E503.push();
*/
// Returns time it takes to make the request for current site
crawler.parseField('requestTime', function(response) {

return (new Date().getTime()-response.request.headers['Start-time']);
});
// Returns url of current site
crawler.parseField('url', function(response) {
return response.url;
});
// Returns url of current site
crawler.parseField('statusCode', function(response) {
return response.statusCode;
});
// returns title of current site
crawler.parseField('title', function(response, $) {
return $('head title').text();
});
// returns count of words on current site
crawler.parseField('TextCount', function(response, $) {
var html = $('body').html();
if (html) {
return countWords(html_strip(html, stripOptions));
}
});

crawler.on('item', function(item) {
//console.log(item.url);

});

crawler.on('finish', function() {

});

crawler.crawl();

/*
Functions

/
function countWords(s){
s = s.replace(/(^\s
)|(\s*$)/gi,"");//exclude start and end white-space
s = s.replace(/[ ]{2,}/gi," ");//2 or more space to 1
s = s.replace(/\n /,"\n"); // exclude newline with a start spacing
return s.split(' ').length;
}

Provide external logger

Would like a way for roboto to use the logger provided by the contained application.

Possible approach (not tested):

  1. Add an option customLogger (or something like that)
  2. In constructor add this._log = this.option.customLogger || log;
  3. Replace all log.* with this._log.*
  4. Add a parameter to itemLogger constructor to accept external log object itemLogger(this._log)

Thanks for some awesome code BTW!

Stop/resume

I think I saw it in the roadmap.
It could be nice if you could stop and then resume roboto so i does not start over from the beginning/startsUrl. I think it could be achieve via de/serialization so when you start/stop it loads its' previous state.

Add path for the item

Roboto is very nice and useful, the one thing is missing for me is a requests path. So when roboto crawls some pages it could store the urls it visited to access to current page (the path to reach current page). It is helpful when you use it to crawl pages and you need to know the context or the previous/parent page (e.g. it could be category page).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.