Coder Social home page Coder Social logo

thomasdeater / browsertrix-crawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from webrecorder/browsertrix-crawler

0.0 0.0 0.0 352 KB

Run a high-fidelity browser-based crawler in a single Docker container

License: GNU Affero General Public License v3.0

JavaScript 95.42% Dockerfile 4.58%

browsertrix-crawler's Introduction

Browsertrix Crawler

Browsertrix Crawler is a simplified browser-based high-fidelity crawling system, designed to run a single crawl in a single Docker container. It is designed as part of a more streamlined replacement of the original Browsertrix.

The original Browsertrix may be too complex for situations where a single crawl is needed, and requires managing multiple containers.

This is an attempt to refactor Browsertrix into a core crawling system, driven by puppeteer-cluster and puppeteer

Features

Thus far, Browsertrix Crawler supports:

  • Single-container, browser based crawling with multiple headless/headful browsers
  • Support for some behaviors: autoplay to capture video/audio, scrolling
  • Support for direct capture for non-HTML resources
  • Extensible driver script for customizing behavior per crawl or page via Puppeteer

Architecture

The Docker container provided here packages up several components used in Browsertrix.

The system uses:

  • oldwebtoday/chrome - to install a recent version of Chrome (currently chrome:84)
  • puppeteer-cluster - for running Chrome browsers in parallel
  • pywb - in recording mode for capturing the content

The crawl produces a single pywb collection, at /crawls/collections/<collection name> in the Docker container.

To access the contents of the crawl, the /crawls directory in the container should be mounted to a volume (default in the Docker Compose setup).

Crawling Parameters

The image currently accepts the following parameters:

browsertrix-crawler [options]

Options:
      --help         Show help                                         [boolean]
      --version      Show version number                               [boolean]
  -u, --url          The URL to start crawling from          [string] [required]
  -w, --workers      The number of workers to run in parallel
                                                           [number] [default: 1]
      --newContext   The context for each new capture, can be a new: page,
                     session or browser.              [string] [default: "page"]
      --waitUntil    Puppeteer page.goto() condition to wait for before
                     continuing                                [default: "load"]
      --limit        Limit crawl to this number of pages   [number] [default: 0]
      --timeout      Timeout for each page to load (in seconds)
                                                          [number] [default: 90]
      --scope        Regex of page URLs that should be included in the crawl
                     (defaults to the immediate directory of URL)
      --exclude      Regex of page URLs that should be excluded from the crawl.
      --scroll       If set, will autoscroll to bottom of the page
                                                      [boolean] [default: false]
  -c, --collection   Collection name to crawl to (replay will be accessible
                     under this name in pywb preview)
                                                   [string] [default: "capture"]
      --headless     Run in headless mode, otherwise start xvfb
                                                      [boolean] [default: false]
      --driver       JS driver for the crawler
                                     [string] [default: "/app/defaultDriver.js"]
      --generateCDX  If set, generate index (CDXJ) for use with pywb after crawl
                     is done                          [boolean] [default: false]
      --generateWACZ If set, generate wacz for use with pywb after crawl
                      is done                          [boolean] [default: false]
      --text         If set, extract the pages full text to be added to the pages.jsonl  
                      file                         [boolean] [default: false]
      --cwd          Crawl working directory for captures (pywb root). If not
                     set, defaults to process.cwd  [string] [default: "/crawls"]

For the --waitUntil flag, see page.goto waitUntil options.

The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example), while --waitUntil networkidle0 may make sense for dynamic sites.

Example Usage

With Docker-Compose

The Docker Compose file can simplify building and running a crawl, and includes some required settings for docker run, including mounting a volume.

For example, the following commands demonstrate building the image, running a simple crawl with 2 workers:

docker-compose build
docker-compose run crawler crawl --url https://webrecorder.net/ --generateCDX --collection wr-net --workers 2

In this example, the crawl data is written to ./crawls/collections/wr-net by default.

While the crawl is running, the status of the crawl (provide by puppeteer-cluster monitoring) prints the progress to the Docker log.

When done, you can even use the browsertrix-crawler image to also start a local pywb instance to preview the crawl:

docker run -it -v $(pwd)/crawls:/crawls -p 8080:8080 webrecorder/browsertrix-crawler pywb

Then, loading the http://localhost:8080/wr-net/https://webrecorder.net/ should load a recent crawl of the https://webrecorder.net/ site.

With docker run

Browsertrix Crawler can of course all be run directly with Docker run, but requires a few more options.

In particular, the --cap-add and --shm-size flags are needed to run Chrome in Docker.

docker run -v $PWD/crawls:/crawls --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1g -it webrecorder/browsertrix-crawler --url https://webrecorder.net/ --workers 2

Support

Initial support for development of Browsertrix Crawler, was provided by Kiwix

Initial functionality for Browsertrix Crawler was developed to support the zimit project in a collaboration between Webrecorder and Kiwix, and this project has been split off from Zimit into a core component of Webrecorder.

License

AGPLv3 or later, see LICENSE for more details.

browsertrix-crawler's People

Contributors

ikreymer avatar atomotic avatar rgaudin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.