Coder Social home page Coder Social logo

node-scraper's Introduction

node-scraper

A little module that makes scraping websites a little easier. Uses node.js and jQuery.

Installation

Via npm:

$ npm install scraper

Examples

Simple

First argument is an url as a string, second is a callback which exposes a jQuery object with your scraped site as "body" and third is an object from the request containing info about the url.

var scraper = require('scraper');
scraper('http://search.twitter.com/search?q=javascript', function(err, jQuery) {
    if (err) {throw err}

    jQuery('.msg').each(function() {
        console.log(jQuery(this).text().trim()+'\n');
    });
});

"Advanced"

First argument is an object containing settings for the "request" instance used internally, second is a callback which exposes a jQuery object with your scraped site as "body" and third is an object from the request containing info about the url.

var scraper = require('scraper');
scraper(
    {
       'uri': 'http://search.twitter.com/search?q=nodejs'
           , 'headers': {
               'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
           }
    }
    , function(err, $) {
        if (err) {throw err}

        $('.msg').each(function() {
            console.log($(this).text().trim()+'\n');
        });
    }
);

Parallel

First argument is an array containing either strings or objects, second is a callback which exposes a jQuery object with your scraped site as "body" and third is an object from the request containing info about the url.

You can also add rate limiting to the fetcher by adding an options object as the third argument containing 'reqPerSec': float.

var scraper = require('scraper');
scraper(
    [
        'http://search.twitter.com/search?q=javascript'
        , 'http://search.twitter.com/search?q=css'
        , {
            'uri': 'http://search.twitter.com/search?q=nodejs'
            , 'headers': {
                'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
            }
        }
        , 'http://search.twitter.com/search?q=html5'
    ]
    , function(err, $) {
        if (err) {throw err;}

        $('.msg').each(function() {
            console.log($(this).text().trim()+'\n');
        });
    }
    , {
        'reqPerSec': 0.2 // Wait 5sec between each external request
    }
);

Arguments

First (required)

Contains the info about what page/pages will be scraped

string

'http://www.nodejs.org'

or

request object

{
   'uri': 'http://search.twitter.com/search?q=nodejs'
       , 'headers': {
           'User-Agent': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)'
       }
}

or

Array (if you want to do fetches on multiple URLs)

[
    urlString
    , urlString
    , requestObject
    , urlString
]

Second (optional)

The callback that allows you do use the data retrieved from the fetch.

function(err, $) {
    if (err) {throw err;}
    
    $('.msg').each(function() {
        console.log($(this).text().trim()+'\n');
    }
}

Third (optional)

This argument is an object containing settings for the fetcher overall.

  • reqPerSec: float; (allows you to throttle your fetches so you don't hammer the server you are scraping)

Depends on

node-scraper's People

Contributors

mape avatar

Watchers

Esco Obong avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.