Coder Social home page Coder Social logo

miroshnikov / scrapyteer Goto Github PK

View Code? Open in Web Editor NEW
18.0 1.0 0.0 393 KB

Web crawling & scraping framework for Node.js on top of headless Chrome browser

License: MIT License

TypeScript 51.59% JavaScript 48.41%
scrape scraper scrapy scraping-websites scrapy-crawler crawer web-crawler web-scraping scraping crawling

scrapyteer's Introduction

Scrapyteer

Scrapyteer is a Node.js web scraping framework/tool/library built on top of the headless Chrome browser Puppeteer.
It allows you to scrape both plain html pages and javascript generated content including SPAs (Single-Page Application) of any kind. Scrapyteer offers a small set of functions that forms an easy and concise DSL (Domain Specific Language) for web scraping and allows to define a crawling workflow and a shape of output data.

Examples

Scrapyteer uses a configuration file (scrapyteer.config.js by default). Here are some examples:

Simple example

Search books on amazon.com and get titles and ISBNs of books on the first page of the results.

const { pipe, open, select, enter, $$, $, text } = require('scrapyteer');

module.exports = {
    root: 'https://www.amazon.com',
    parse: pipe(
        open(),     // open amazon.com
        select('#searchDropdownBox', 'search-alias=stripbooks-intl-ship'),  // select 'Books' in dropdown
        enter('#twotabsearchtextbox', 'Web scraping'),   // enter search phrase 'Web scraping'
        $$('.a-section h2'),    // for every H2 on page
        {
            name: text,         // name = inner text of H2 element
            ISBN: pipe(         // go to link and grab ISBN from there if present
                $('a'),
                open(),         // open 'href' attribute of passed A element
                $('#rpi-attribute-book_details-isbn13 .rpi-attribute-value span'), 
                text            // grab inner text of a previously selected element
            )
        }
    )
}
/*
output.json

[
    {
        "name": "Web Scraping with Python: Collecting More Data from the Modern Web  ",
        "ISBN": "978-1491985571"
    },
    ...
]
*/

More elaborate example

Search books on amazon.com, get a number of attributes in JSON lines file and download the cover image of each book to a local directory.

const { pipe, open, select, enter, $$, $, text } = require('scrapyteer');

module.exports = {
    root: 'https://www.amazon.com',
    save: 'books.jsonl',   // saves as jsonl
    parse: pipe(
        open(),     // open amazon.com
        select('#searchDropdownBox', 'search-alias=stripbooks-intl-ship'),  // select 'Books' in dropdown
        enter('#twotabsearchtextbox', 'Web scraping'),   // enter search phrase
        $$('.a-section h2 > a'),    // for every H2 link on page
        open(),         // open 'href' attribute of passed A element
        {
                // on book's page grab all the necessary values
            name: $('#productTitle'),
            ISBN: $('#rpi-attribute-book_details-isbn13 .rpi-attribute-value span'),
            stars: pipe($('#acrPopover > span > a > span'), text, parseFloat),  // number of stars as float
            ratings: pipe($('#acrCustomerReviewLink > span'), text, parseInt),   // convert inner text that looks like 'NNN ratings' into an integer
            cover: pipe(                // save cover image as a file and set cover = file name
                $(['#imageBlockContainer img', '#ebooks-main-image-container img']),     // try several selectors
                save({dir: 'cover-images'})
            )   
        }
    )
}
/*
books.jsonl

{"name":"Web Scraping with Python: Collecting More Data from the Modern Web","ISBN":"978-1491985571","stars":4.6,"ratings":201,"cover":"sitb-sticker-v3-small._CB485933792_.png"}
{"name":"Web Scraping Basics for Recruiters: Learn How to Extract and Scrape Data from the Web","ISBN":null,"stars":4.9,"ratings":15,"cover":"41esb-CVhsL.jpg"}
...
*/

Installation

Locally

npm i -D scrapyteer
npm exec -- scrapyteer --config myconf.js.  # OR npx scrapyteer --config myconf.js

Locally as dependency

npm init
npm i -D scrapyteer

in package.json:

"scripts": {
  "scrape": "scrapyteer --config myconf.js"
}
npm run scrape

Globally

npm install -g scrapyteer
scrapyteer --config myconf.js

Make sure $NODE_PATH points to where global packages are located. If it doesn't, you may need to set it e.g. export NODE_PATH=/path/to/global/node_modules

Configuration options

save

A file name or console object, by default output.json in the current directory.
*.json and *.jsonl are currently supported.
If format is json the data is first collected in memory and then dumped to the file in one go, in jsonl data is written line by line (good for large datasets).

root

The root URL to scrape

parse

The parsing workflow: a pipe function, an object or an array

log

log: true turns on log output for debugging

noRevisit

Set true to not revisit already visited pages

options

    options: {
        browser: {
            headless: false
        }
    }

API

pipe

pipe(...args: any[])

Receives a set of functions and invoke them from left to right supplying the return value of the previous as input for the next. If an argument is not a function, it is converted to one (by indentity).
For objects and arrays all of their items/properties are also parsed.
If the return value is an array, the rest of the function chain will be invoked for all of its items.

open

Opens a given or root url

open(url: string|null = null)

$ / $$

$(selector: string|string[]) => Element|null
$$(selector: string|string[]) => Element[]

Calls querySelector / querySelectorAll on page/element.
If an array of selectors is passed, uses the first one that exists. It is useful if data may be in various places of the DOM.

attr

Returns an element's property value

attr(name: string)

text

Returns a text content of an element

save

save({dir='files'}: {dir: string, saveAs?: (name: string, ext: string) => string})

Saves a link to a file and returns the file name.
saveAs allows to modify a saved file name or extension.

type

Types text into an input

type(inputSelector: string, text: string, delay = 0)

select

Selects one or more values in a select

select(selectSelector: string, ...values: string[])

enter

Types text into an input and presses enter

enter(inputSelector: string, text: string, delay = 0)

scrapyteer's People

Contributors

miroshnikov avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.