Coder Social home page Coder Social logo

n1k0 / pjscrape Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nrabinowitz/pjscrape

1.0 2.0 0.0 212 KB

A web-scraping framework written in Javascript, using PhantomJS and jQuery

Home Page: http://nrabinowitz.github.com/pjscrape/

License: MIT License

pjscrape's Introduction

Homepage: http://nrabinowitz.github.com/pjscrape/

Overview

pjscrape is a framework for anyone who's ever wanted a command-line tool for web scraping using Javascript and jQuery. Built to run with PhantomJS, it allows you to scrape pages in a fully rendered, Javascript-enabled context from the command line, no browser required.

Features

  • Client-side, Javascript-based scraping environment with full access to jQuery functions
  • Easy, flexible syntax for setting up one or more scrapers
  • Recursive/crawl scraping
  • Delay scrape until a "ready" condition occurs
  • Load your own scripts on the page before scraping
  • Modular architecture for logging and writing/formatting scraped items
  • Client-side utilities for common tasks
  • Growing set of unit tests

Usage

  1. Download and install PhantomJS or PyPhantomJS, v.1.2. In order to use file-based logging or data writes, you'll need to use PyPhantomJS with the Save to File plugin (though I think this feature will be rolled into the PhantomJS core in the next version).

  2. Make a config file to define your scraper(s). Config files can set global pjscrape settings via pjs.config() and add one or more scraper suites via pjs.addSuite().

  3. A scraper suite defines a set of scraper functions for one or more URLs. More docs on this coming soon, but a sample config file might look like this:

    pjs.addSuite({
        title: 'My Scraper Suite',
        // single URL or array
        urls: [
            'http://www.example.com/page1',
            'http://www.example.com/page2'
        ],
        // one or more functions, evaluated in the client
        scrapers: [
            function() {
                var items = [];
                $('h2').each(function() {
                    items.push($(this).text());
                });
                return items;
            }
        ]
    });
    

    A simple scraper can be added with the pjs.addScraper() function:

    pjs.addScraper(
        'http://www.example.com/page.html',
        function() {
            return $('h1').first().text();
        }
    );
    
  4. To run pjscrape from the command line, type: pyphantomjs /path/to/pjscrape.js my_config_file.js

By default, the log output is pretty verbose, and the scraped data is written as JSON to stdout at the end of the scrape. You can configure logging, formatting, and writing data using pjs.config():

pjs.config({ 
    // options: 'stdout', 'file' (set in config.logFile) or 'none'
    log: 'stdout',
    // options: 'json' or 'csv'
    format: 'json',
    // options: 'stdout' or 'file' (set in config.outFile)
    writer: 'file',
    outFile: 'scrape_output.json'
});

Questions?

Comments and questions welcomed at nick (at) nickrabinowitz (dot) com.

pjscrape's People

Contributors

nrabinowitz avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.