Coder Social home page Coder Social logo

puppeteer-cluster's Introduction

Puppeteer Cluster

Build Status npm npm download count Coverage Status Known Vulnerabilities MIT License

Create a cluster of puppeteer workers. This library spawns a pool of Chromium instances via Puppeteer and helps to keep track of jobs and errors. This is helpful if you want to crawl multiple pages or run tests in parallel. Puppeteer Cluster takes care of reusing Chromium and restarting the browser in case of errors.

What does this library do?
  • Handling of crawling errors
  • Auto restarts the browser in case of a crash
  • Can automatically retry if a job fails
  • Different concurrency models to choose from (pages, contexts, browsers)
  • Simple to use, small boilerplate
  • Progress view and monitoring statistics (see below)

Installation

Install using your favorite package manager:

npm install --save puppeteer # in case you don't already have it installed 
npm install --save puppeteer-cluster

Alternatively, use yarn:

yarn add puppeteer puppeteer-cluster

Usage

The following is a typical example of using puppeteer-cluster. A cluster is created with 2 concurrent workers. Then a task is defined which includes going to the URL and taking a screenshot. We then queue two jobs and wait for the cluster to finish.

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });

  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.screenshot();
    // Store screenshot, do something else
  });

  cluster.queue('http://www.google.com/');
  cluster.queue('http://www.wikipedia.org/');
  // many more pages

  await cluster.idle();
  await cluster.close();
})();

Examples

Concurrency implementations

There are different concurrency models, which define how isolated each job is run. You can set it in the options when calling Cluster.launch. The default option is Cluster.CONCURRENCY_CONTEXT, but it is recommended to always specify which one you want to use.

Concurrency Description Shared data
CONCURRENCY_PAGE One Page for each URL Shares everything (cookies, localStorage, etc.) between jobs.
CONCURRENCY_CONTEXT Incognito page (see BrowserContext) for each URL No shared data.
CONCURRENCY_BROWSER One browser (using an incognito page) per URL. If one browser instance crashes for any reason, this will not affect other jobs. No shared data.
Custom concurrency (experimental) You can create your own concurrency implementation. Copy one of the files of the concurrency/built-in directory and implement ConcurrencyImplementation. Then provide the class to the option concurrency. This part of the library is currently experimental and might break in the future, even in a minor version upgrade while the version has not reached 1.0. Depends on your implementation

Typings for input/output (via TypeScript Generics)

To allow proper type checks with TypeScript you can provide generics. In case no types are provided, any is assumed for input and output. See the following minimal example or check out the more complex typings example for more information.

  const cluster: Cluster<string, number> = await Cluster.launch(/* ... */);

  await cluster.task(async ({ page, data }) => {
    // TypeScript knows that data is a string and expects this function to return a number
    return 123;
  });

  // Typescript expects a string as argument ...
  cluster.queue('http://...');

  // ... and will return a number when execute is called.
  const result = await cluster.execute('https://www.google.com');

Debugging

Try to checkout the puppeteer debugging tips first. Your problem might not be related to puppeteer-cluster, but puppteer itself. Additionally, you can enable verbose logging to see which data is consumed by which worker and some other cluster information. Set the DEBUG environment variable to puppeteer-cluster:*. See an example below or checkout the debug docs for more information.

# Linux
DEBUG='puppeteer-cluster:*' node examples/minimal
# Windows Powershell
$env:DEBUG='puppeteer-cluster:*';node examples/minimal

API

class: Cluster

Cluster module provides a method to launch a cluster of Chromium instances.

event: 'taskerror'

Emitted when a queued task ends in an error for some reason. Reasons might be a network error, your code throwing an error, timeout hit, etc. The first argument will the error itself. The second argument is the URL or data of the job (as given to Cluster.queue). If retryLimit is set to a value greater than 0, the cluster will automatically requeue the job and retry it again later. The third argument is a boolean which indicates whether this task will be retried. In case the task was queued via Cluster.execute there will be no event fired.

  cluster.on('taskerror', (err, data, willRetry) => {
      if (willRetry) {
        console.warn(`Encountered an error while crawling ${data}. ${err.message}\nThis job will be retried`);
      } else {
        console.error(`Failed to crawl ${data}: ${err.message}`);
      }
  });

event: 'queue'

Emitted when a task is queued via Cluster.queue or Cluster.execute. The first argument is the object containing the data (if any data is provided). The second argument is the queued function (if any). In case only a function is provided via Cluster.queue or Cluster.execute, the first argument will be undefined. If only data is provided, the second argument will be undefined.

Cluster.launch(options)

  • options <Object> Set of configurable options for the cluster. Can have the following fields:
    • concurrency <Cluster.CONCURRENCY_PAGE|Cluster.CONCURRENCY_CONTEXT|Cluster.CONCURRENCY_BROWSER|ConcurrencyImplementation> The chosen concurrency model. See Concurreny models for more information. Defaults to Cluster.CONCURRENCY_CONTEXT. Alternatively you can provide a class implementing ConcurrencyImplementation.
    • maxConcurrency <number> Maximal number of parallel workers. Defaults to 1.
    • puppeteerOptions <Object> Object passed to puppeteer.launch. See puppeteer documentation for more information. Defaults to {}.
    • perBrowserOptions <Array<Object>> Object passed to puppeteer.launch for each individual browser. If set, puppeteerOptions will be ignored. Defaults to undefined (meaning that puppeteerOptions will be used).
    • retryLimit <number> How often do you want to retry a job before marking it as failed. Ignored by tasks queued via Cluster.execute. Defaults to 0.
    • retryDelay <number> How much time should pass at minimum between the job execution and its retry. Ignored by tasks queued via Cluster.execute. Defaults to 0.
    • sameDomainDelay <number> How much time should pass at minimum between two requests to the same domain. If you use this field, the queued data must be your URL or data must be an object containing a field called url.
    • skipDuplicateUrls <boolean> If set to true, will skip URLs which were already crawled by the cluster. Defaults to false. If you use this field, the queued data must be your URL or data must be an object containing a field called url.
    • timeout <number> Specify a timeout for all tasks. Defaults to 30000 (30 seconds).
    • monitor <boolean> If set to true, will provide a small command line output to provide information about the crawling process. Defaults to false.
    • workerCreationDelay <number> Time between creation of two workers. Set this to a value like 100 (0.1 seconds) in case you want some time to pass before another worker is created. You can use this to prevent a network peak right at the start. Defaults to 0 (no delay).
    • puppeteer <Object> In case you want to use a different puppeteer library (like puppeteer-core or puppeteer-extra), pass the object here. If not set, will default to using puppeteer. When using puppeteer-core, make sure to also provide puppeteerOptions.executablePath.
  • returns: <Promise<Cluster>>

The method launches a cluster instance.

cluster.task(taskFunction)

  • taskFunction <function(string|Object, Page, Object)> Sets the function, which will be called for each job. The function will be called with an object having the following fields:
    • page <Page> The page given by puppeteer, which provides methods to interact with a single tab in Chromium.
    • data The data of the job you provided to Cluster.queue.
    • worker <Object> An object containing information about the worker executing the current job.
      • id <number> ID of the worker. Worker IDs start at 0.
  • returns: <Promise>

Specifies a task for the cluster. A task is called for each job you queue via Cluster.queue. Alternatively you can directly queue the function that you want to be executed. See Cluster.queue for an example.

cluster.queue([data] [, taskFunction])

  • data Data to be queued. This might be your URL (a string) or a more complex object containing data. The data given will be provided to your task function(s). See [examples] for a more complex usage of this argument.
  • taskFunction <function> Function like the one given to Cluster.task. If a function is provided, this function will be called (only for this job) instead of the function provided to Cluster.task. The function will be called with an object having the following fields:
    • page <Page> The page given by puppeteer, which provides methods to interact with a single tab in Chromium.
    • data The data of the job you provided as first argument to Cluster.queue. This might be undefined in case you only specified a function.
    • worker <Object> An object containing information about the worker executing the current job.
      • id <number> ID of the worker. Worker IDs start at 0.
  • returns: <Promise>

Puts a URL or data into the queue. Alternatively (or even additionally) you can queue functions. See the examples about function queuing for more information: (Simple function queuing, complex function queuing).

Be aware that this function only returns a Promise for backward compatibility reasons. This function does not run asynchronously and will immediately return.

cluster.execute([data] [, taskFunction])

  • data Data to be queued. This might be your URL (a string) or a more complex object containing data. The data given will be provided to your task function(s). See [examples] for a more complex usage of this argument.
  • taskFunction <function> Function like the one given to Cluster.task. If a function is provided, this function will be called (only for this job) instead of the function provided to Cluster.task. The function will be called with an object having the following fields:
    • page <Page> The page given by puppeteer, which provides methods to interact with a single tab in Chromium.
    • data The data of the job you provided as first argument to Cluster.queue. This might be undefined in case you only specified a function.
    • worker <Object> An object containing information about the worker executing the current job.
      • id <number> ID of the worker. Worker IDs start at 0.
  • returns: <Promise>

Works like Cluster.queue, but this function returns a Promise which will be resolved after the task is executed. That means, that the job is still queued, but the script will wait for it to be finished. In case an error happens during the execution, this function will reject the Promise with the thrown error. There will be no "taskerror" event fired. In addition, tasks queued via execute will ignore "retryLimit" and "retryDelay". For an example see the Execute example.

cluster.idle()

Promise is resolved when the queue becomes empty.

cluster.close()

Closes the cluster and all opened Chromium instances including all open pages (if any were opened). It is recommended to run Cluster.idle before calling this function. The Cluster object itself is considered to be disposed and cannot be used anymore.

License

MIT license.

puppeteer-cluster's People

Contributors

apn-carmine avatar cd9 avatar daniellevinson avatar dependabot-preview[bot] avatar dependabot-support avatar greenkeeper[bot] avatar honzamac avatar hugopoi avatar ilantc avatar jackmac92 avatar mhaseebkhan avatar shannonmoeller avatar thomasdondorf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

puppeteer-cluster's Issues

Will it be delayed for 20 seconds?

The code is as follows
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 2,
sameDomainDelay:20*1000 //Will it be delayed for 20 seconds?
});

await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
const screen = await page.screenshot();
// Store screenshot, do something else
});

await cluster.queue('http://www.google.com/a.html');
await cluster.queue('http://www.google.com/b.html');
await cluster.queue('http://www.google.com/c.html');
// many more pages

await cluster.idle();
await cluster.close();
})();

My question is, if a.HTML opens first, then B.Html,c.HTML, will be delayed 20 seconds to open it?
Do not understand how this sameDomainDelay uses

Timeout config not honored

Related code:

  const cluster = await Cluster.launch({
    puppeteerOptions: {
      headless: true,
      ignoreHTTPSErrors: env.IGNORE_HTTPS || false,
      args: ['--disable-http2'],
      timeout: env.PUPPETEER_TIMEOUT || 60000,            //attempt 2
    },
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: parseInt(env.MAX_WORKER) || 4,
    skipDuplicateUrls: false,
    monitor: env.MONITOR === 'true' || false,
    timeout: env.PUPPETEER_TIMEOUT || 60000,              //attempt 1
  });

I've tried to set timeout in cluster launch options and passed it to puppeteerOptions, all failed. Log says timeout was still at 30000.

app:cluster:err TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded
  app:cluster:err     at Promise.then (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/FrameManager.js:1276:21)
  app:cluster:err   -- ASYNC --
  app:cluster:err     at Frame.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/helper.js:144:27)
  app:cluster:err     at Page.goto (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/Page.js:624:49)
  app:cluster:err     at Page.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer/lib/helper.js:145:23)
  app:cluster:err     at GenericHandler.processPage (/home/bambang/project/om-screenshoot/src/handlers/v1base.js:47:21)
  app:cluster:err     at GenericHandler.process (/home/bambang/project/om-screenshoot/src/handlers/v1base.js:94:16)
  app:cluster:err     at module.exports (/home/bambang/project/om-screenshoot/src/handlers/site.js:27:24)
  app:cluster:err     at Worker.<anonymous> (/home/bambang/project/om-screenshoot/node_modules/puppeteer-cluster/dist/Worker.js:56:54)
  app:cluster:err     at Generator.next (<anonymous>)
  app:cluster:err     at fulfilled (/home/bambang/project/om-screenshoot/node_modules/puppeteer-cluster/dist/Worker.js:4:58)
  app:cluster:err     at process.internalTickCallback (internal/process/next_tick.js:77:7) +789ms

Any guidance on how to trace/ fix this issue?

Usage with JEST tests in different files?

A common use-case would be to have many different tests spread out over multiple files.

This seems to be exactly what I need to speed up my tests - but I don't understand how to utilise it to run tests in different files in parallell.

Ex;
One test suite in home/tests/e2e/LoginPage.test.js
Another test suite in loan/tests/e2e/OverviewPage.test.js

I understand I could use it within the same test suite - but what about running different test suites in parallel?

Limit number of tasks per browser instance

Is there any way to limit the number of tasks used per browser instance? I'm thinking of something along the lines (perhaps) of tasksPerInstance: 1000, and then the cluster will track the number of tasks that have been used in a specific browser instance and then whenever that limit is reached will kill that browser instance and launch another, as a (potential) shield against browser memory growth. Its a technique I've seen used in other process pooling models (I think some of the Apache web server modules let you specify a maximum number of requests a worker process will serve before it is terminated and replaced with a fresh process).

Tests might silently fail

A failing expect call will not lead to an error if it gets caught. See jestjs/jest#3917 for discussion. This might currently lead to failing test that are not reported as the generous error handling catches them.

Three options:

  1. Rename taskerror to error which will make sure that Node.js crashes in that case. Users will have to take care of the error handler then.
  2. Enable an option throwOnTaskerror so that task errors will not get caught
  3. Just take care of it in the tests

Cannot find module ../dist

I'm trying to puppeteer-cluster with minimal.js example. I'm getting the following error:

  • Windows 7
  • node: v10.15.à
  • npm: v6.4.1

D:\Developpement\NodeJS\minimal>node minimal.js
internal/modules/cjs/loader.js:583
throw err;
^

Error: Cannot find module '../dist'
at Function.Module._resolveFilename (internal/modules/cjs/loader.js:581:15)
at Function.Module._load (internal/modules/cjs/loader.js:507:25)
at Module.require (internal/modules/cjs/loader.js:637:17)
at require (internal/modules/cjs/helpers.js:22:18)
at Object. (D:\Developpement\NodeJS\minimal\minimal.js:1:83)
at Module._compile (internal/modules/cjs/loader.js:689:30)
at Object.Module._extensions..js (internal/modules/cjs/loader.js:700:10)
at Module.load (internal/modules/cjs/loader.js:599:32)
at tryModuleLoad (internal/modules/cjs/loader.js:538:12)
at Function.Module._load (internal/modules/cjs/loader.js:530:3)

With my configuration the directory ../dist does not exist.
I have

24/01/2019 15:10 .
24/01/2019 15:10 ..
24/01/2019 15:10 minimal
24/01/2019 15:04 node_modules

I replace const { Cluster } = require('../dist'); by const { Cluster } = require('puppeteer-cluster'); It's OK.

multiple crawl does not crawl all my urls.

when i run puppeteer cluster with 100 urls,it only crawls 98 or 99 urls ..
here is my code

`const { Cluster } = require('puppeteer-cluster');
var link=[];
var total=0;
var start=3;
const size= process.argv[2];
for(let i =0;i<size;i++)
{ link.push(process.argv[start++]);}

(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency:50,
timeout:400000,
monitor:false
});

await cluster.task(async ({ page, data: url }) => {
const response=await page.goto(url,{timeout:100000,waitUntil: 'networkidle2'});
console.log(response.url());

 if(response.status()==404){
	console.log('program encountered error');		
	return;
}
total++;   (counts the number of urls)

const hrefs = await page.evaluate(() => {
		const anchors = document.querySelectorAll('a');
		 return [].map.call(anchors, a => a.href);
						});  

});

for(let i =0;i<size;i++){await cluster.queue(link[i]); }

await cluster.idle();
await cluster.close();

console.log(total);
process.exit(0);
})();`

Roadmap for v1.0

I'm thinking about what kind of functionality this library should provide before it should be released as v1. I might edit the list in the future:

My goals:

  • (#25) Make sure it's reliable and crawl more than 10 million pages with it (so far the maximum I crawled was ~800k pages)
  • (#9) Improve sameDomainDelay and skipDuplicateUrls. Detection of domains should use TLD.js for example. Documentation should be better. And there should be a way to provide the URL without using data or { url: ... } Not a goal for 1.0 anymore
  • (#28) Optimize the code, fix code smells
  • More tests, get code coverage up to > 90%
  • More documentation on the concurrency types. Maybe make CONCURRENCY_BROWSER the default as it is more robust?
  • More code snippets in the documentation page (for Cluster.queue for example)
  • Provide a cluster.execute function which executes the job with higher priority (does not queue it at the end) and returns a Promise which is resolved when the job is finished. Might also solve this confusion: #10 (comment)
  • Statistics API: How many jobs in queue, how many jobs processes, etc.
  • #41 Offer more functionality, maybe provide a way to use puppeteer-extra?
  • #36 Sandbox Offer a way to run code from users in a sandbox, maybe even Docker? => This can now be implemented via custom concurrency implementations (although there are now custom implementations right now)
  • #70 Improve types

Maybe:

  • Provide a simple but robust data store with the library
  • Rename API: Some parts of API are rather unfortunate
    • concurrency should be concurrencyType
    • maxConcurrency maybe maxWorkers?
  • Provide queue function to the task function for a more functional syntax (so that you don't need to access cluster from inside the task

Not planned (for now):

  • #8 (comment) Mixed concurrency models
    • Reason: It does not work well together with the idea of having a sandbox (which part of the browser/page/context stuff should be sandboxed then)

An in-range update of ts-jest is breaking the build 🚨

The devDependency ts-jest was updated from 23.1.4 to 23.10.0.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

ts-jest is a devDependency of this project. It might not break your production code or affect downstream projects, but probably breaks your build or test tools, which may prevent deploying or publishing.

Status Details
  • continuous-integration/travis-ci/push: The Travis CI build failed (Details).
  • coverage/coveralls: First build on greenkeeper/ts-jest-23.10.0 at 0.0% (Details).

Release Notes for 23.10.0

ts-jest, reloaded!

  • lots of new features including full type-checking and internal cache (see changelog)
  • improved performances
  • Babel not required anymore
  • improved (and growing) documentation
  • a ts-jest Slack community where you can find some instant help
  • end-to-end isolated testing over multiple jest, typescript and babel versions
Commits

The new version differs by 293 commits.

  • 0e5ffed chore(release): 23.10.0
  • 3665609 Merge pull request #734 from huafu/appveyor-optimizations
  • 45d44d1 Merge branch 'master' into appveyor-optimizations
  • 76e2fe5 ci(appveyor): cache npm versions as well
  • 191c464 ci(appveyor): try to improve appveyor's config
  • 0f31b42 Merge pull request #733 from huafu/fix-test-snap
  • 661853a Merge branch 'master' into fix-test-snap
  • aa7458a Merge pull request #731 from kulshekhar/dependabot/npm_and_yarn/tslint-plugin-prettier-2.0.0
  • 70775f1 ci(lint): run lint scripts in series instead of parallel
  • a18e919 style(fix): exclude package.json from tslint rules
  • 011b580 test(config): stop using snapshots for pkg versions
  • 7e5a3a1 build(deps-dev): bump tslint-plugin-prettier from 1.3.0 to 2.0.0
  • fbe90a9 Merge pull request #730 from kulshekhar/dependabot/npm_and_yarn/@types/node-10.10.1
  • a88456e build(deps-dev): bump @types/node from 10.9.4 to 10.10.1
  • 54fd239 Merge pull request #729 from kulshekhar/dependabot/npm_and_yarn/prettier-1.14.3

There are 250 commits in total.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

Minor type checking improvement for cluster.queue method

First of all, thank you for your work!
I have a minor suggestion on improving type checking fot the cluster.queue() method.
Now we have this:

public async queue(
        data: JobData | TaskFunction,
        taskFunction?: TaskFunction,
    ): Promise<void> {
...
}

As one can inspect, JobData is of type any and it is used both as a first argument to cluster.queue() method as well as the data property of the TaskFunctionArguments interface. This approach does not provide sufficient type checking when we call the cluster.queue() method with two arguments. I'd suggest to use generic types here like this:

type QueueFunction<T> = (arg: QueueFunctionArguments<T>) => Promise<void>;

interface QueueFunctionArguments<T> {
  page: puppeteer.Page;
  data: T;
  worker: {
    id: number;
  };
}

public async queue<T>(
    data: T | TaskFunction,
    taskFunction?: QueueFunction<T>,
): Promise<void> {
...
}

Should queued task take care about closing the page?

My use case is the following: create a cluster with Cluster.CONCURRENCY_BROWSER and never close it.

const { connect } = require('amqplib');
const { Cluster } = require('puppeteer-cluster');
const { crawler, puppeteerOptions, redis } = require('./docroot');
const { Resource } = require('./docroot/Component');

(async ({ RABBITMQ_USER, RABBITMQ_PASS, RABBITMQ_HOST, RABBITMQ_PORT, RABBITMQ_QUEUE, RABBITMQ_THREADS, REDIS_LIST }) => {
  const cluster = await Cluster.launch({
    monitor: true,
    concurrency: Cluster.CONCURRENCY_BROWSER,
    maxConcurrency: Number(RABBITMQ_THREADS),
    puppeteerOptions,
  });

  const channel = await (await connect(`amqp://${RABBITMQ_USER}:${RABBITMQ_PASS}@${RABBITMQ_HOST}:${RABBITMQ_PORT}`)).createChannel();

  channel.assertQueue(RABBITMQ_QUEUE, {
    durable: false,
  });

  await cluster.task(async ({ data, page }) => {
    const { resource, message } = data;
    const metadata = await crawler.crawl(resource, page);

    await redis.rpush(REDIS_LIST, JSON.stringify(metadata));

    channel.ack(message);
  });

  channel.consume(RABBITMQ_QUEUE, message => {
    const content = JSON.parse(message.content.toString('utf8'));
    const resource = new Resource(content.resource);

    if (Array.isArray(content.links_to_check_for)) {
      resource.setLinks(content.links_to_check_for);
    }

    cluster.queue({ resource, message });
  });
})(process.env);

As you can see above, the cluster's queue gets filled once RabbitMQ sends something. This means the process is kinda daemon and shouldn't be stopped. I'm worry about of whether the pages that cluster creates should be closed (await page.close() after const metadata = await crawler.crawl(resource, page);) once not needed anymore or is it done automatically?

Program hang when maxConcurrency is set over 50

First of all, great project!
I tried the example while set maxConcurrency = 50/100. What I noticed is when set to 100, pretty much every time the program will hang somewhere. When set to 50, program will hang sometimes. Not sure what caused this issue. Thanks for any input.

Inter Process Communication

Hello There.
I have multiple instances of puppeteer that scrapes data from some sites. after scraping, each instance uses process.send() to output the data so that it can be saved to a database. I would love to know if it's possible to listen to data/message sent by each instance so that they can be saved to the DB same way we have cluster.on('taskerror') event handler and how to implement it. Regards.

Add more events

Something like:

    cluster.on('monitor', (data) => {
        console.log(data);
    });

Usage in an HTTP environment.

Can I use this in micro / express / etc and be able to have an endpoint process a "screenshot" task and return a value when the task completes?

Is this a thing?

Long-term runs of puppteer-cluster

I'm gonna document some puppeteer-cluster test runs, to see how the different concurrency types and options work together.

Feel free to add your own runs

An in-range update of debug is breaking the build 🚨

Version 3.2.0 of debug was just published.

Branch Build failing 🚨
Dependency debug
Current Version 3.1.0
Type dependency

This version is covered by your current version range and after updating it in your project the build failed.

debug is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details
  • continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).
  • coverage/coveralls: First build on greenkeeper/debug-3.2.0 at 74.478% (Details).

Release Notes 3.2.0

A long-awaited release to debug is available now: 3.2.0.

Due to the delay in release and the number of changes made (including bumping dependencies in order to mitigate vulnerabilities), it is highly recommended maintainers update to the latest package version and test thoroughly.


Minor Changes

Patches

Credits

Huge thanks to @DanielRuf, @EirikBirkeland, @KyleStay, @Qix-, @abenhamdine, @alexey-pelykh, @DiegoRBaquero, @febbraro, @kwolfy, and @TooTallNate for their help!

Commits

The new version differs by 25 commits.

There are 25 commits in total.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

why `await`?

Cool project, but I am confused that why you use await in your example:

const { Cluster } = require('puppeteer-cluster');

(async () => {
  const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
  });
  
  // Is `await` necessary?
  await cluster.task(async ({ page, data: url }) => {
    await page.goto(url);
    const screen = await page.screenshot();
    // Store screenshot, do something else
  });

  await cluster.queue('http://www.google.com/');
  await cluster.queue('http://www.wikipedia.org/');
  // many more pages

  await cluster.idle();
  await cluster.close();
})();

when you define a task or try to add some queues, why await? I try to remove them and is ok to do that.

Browser closes during debugging

Hello,

I have got few questions, not sure if should have created multiple issues.

Question: Using below code (example code), when I am debugging, browser window closes suddenly not letting me finish stepping through my code. Am I missing any config option?

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
    monitor: true,
    retryLimit: 0,
    puppeteerOptions: {
      headless: false,
      devtools: true,
      defaultViewport: {
        width: 1920,
        height: 1080
      }
    }
  });

await cluster.queue('www.example.com', main);

const main = async ({ page, data: url }) => {
    await page.goto(url);
    const results = await page.evaluate(async () => {
    debugger;
      let title = document.title;
      return title;
    }).then((data) => {
      console.log(data);
    });
  };

thanks

Using in Jest context / Node.js version 6 support

I am interested in exploring using puppeteer cluster in a Jest test context.

I am not able to import or require - without getting an Unexpected identifier error on that line.

import Cluster from 'puppeteer-cluster';
// or
const { Cluster } = require('puppeteer-cluster');

Error:

static async launch(options) {
                     ^^^^^^
    SyntaxError: Unexpected identifier

Thanks...

Does puppeteer-cluster have "worker_index" to work with Task Function?

I'm looking for an index of worker that's simply a number of current worker calling Task Function

This code below always captures a screenshot to the same file screen.png.

const puppeteer = require('puppeteer-core');

const {
    Cluster,
} = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 2,
        puppeteer,
        puppeteerOptions: {
            executablePath: 'C:\\Users\\..\\AppData\\Local\\Google\\Chrome SxS\\Application\\chrome.exe',
        },
    });

    await cluster.task(async ({
        page,
        data: url,
    }) => {
        await page.goto(url);
        await page.screenshot({
            path: './screen.png',
        });
        // Store screenshot, do something else
    });

    await cluster.queue('http://www.google.com/');
    await cluster.queue('http://www.wikipedia.org/');
    // many more pages

    await cluster.idle();
    await cluster.close();
})();

I want something like:

await cluster.task(async ({
    page,
    data: url,
    wIndex,
}) => {
    await page.goto(url);
    await page.screenshot({
        path: `./screen_${wIndex}.png`,
    });
    // Store screenshot, do something else
});

With wIndex is a number of current Worker.

Simple solution for this example can be done by using URL of the current queue (https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/minimal.js)

But what if it working with the same URL for each queue?

P/s: Also I want to launch puppeteer with the different launch.Options on each Worker

Is It possible to get browser version?

Hello,

Is it possible to get the version of browser ?
I do This

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_CONTEXT,
    maxConcurrency: 2,
    monitor: true,
    retryLimit: 0,
    timeout: 180000,
    }
  });

await cluster.queue('www.example.com', main);

// Display browser version
// console.log(cluster.browser.version()) ?

const main = async ({ page, data: url }) => {
    await page.goto(url);
    const results = await page.evaluate(async () => {
    debugger;
      let title = document.title;
      return title;
    }).then((data) => {
      console.log(data);
    });
  };

Use "puppeteer-core" instead of "puppeteer"

Is it possible to use "puppeteer-core" instead of "puppeteer" for the sake of not having to specify the environment variable to exclude a chrome download? I have to manually remove the chrome package from my distribution.

Unable to return a variable from a queued function

Hi,

I am having a little trouble figuring out a way to return a variable from a queued function.

Given the sample function-queuing-complex.js example, I have tried using both return and resolve in extractTitle since I read from the README that cluster.queue returns a Promise. Both resulted in undefined being returned. A Promise.all doesn't seem to work either. Is this a bug or am I doing something wrong?

const extractTitle = async ({ page, data: url }) => {
	await page.goto(url);
	const pageTitle = await page.evaluate(() => document.title);
 
        // How do I return pageTitle to use outside this async function?
};
const task1= await cluster.queue("https://reddit.com/", extractTitle);
const task2 = await cluster.queue("https://twitter.com/", extractTitle);
Promise.all([task1, task2]).then(result => console.log(result)); // returns undefined

Extensions are not loading on any concurrency model

I'm trying to run puppeteer in a cluster using this library however when I try the following I get no errors however the plugin itself doesn't load. The same arguments work perfectly with puppeteer directly.

Anyone have an idea why this is happening?

    cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 2,
        monitor: false,
        puppeteerOptions: {
            headless: true,
            args: [
                '--no-sandbox',  
                '--disable-gpu',
                '--enable-usermedia-screen-capturing',
                '--allow-http-screen-capture',
                '--auto-select-desktop-capture-source=ppc',
                '--load-extension=' + __dirname+'/chrome-plugin',
                '--disable-extensions-except=' + __dirname+'/chrome-plugin',
                '--disable-infobars',
                '--window-size=1920,1080',
            ],
        }
    });

Improve error documentation or maybe even think about catching "stupid" errors

Currently the library does not catch asynchronously thrown errors. That means code like this can lead to errors:

page.on('dialog', async dialog => {
  await dialog.dismiss();
});

The correct way right now is to put a try catch block around the code inside the function. This is a problem, as the library might still come to a stop when the code is badly written.

  • Option 1: Improve documentation regarding asynchronous errors.
  • Option 2: Use something like process.on('uncaughtException') and/or process.on('unhandledRejection') to handle all kind of errors. This might interfere with bigger applications that have this kind of handling already build in.

Note sure which one is the way to go. Open for ideas and opinions.

Is it possible to create task queue dynamically

I want to create a node server for scrapping using puppeteer (pass search term in GET request to scrap google search results)

currently my server is not able process more then 5 parallel request after its goes out of memory

Change license to MIT

This software seems really interesting and useful.

Do you have any plans on changing your open source license from the GNU General Public License 3.0 to something else, such as Apache License 2.0, BSD or MIT?

I'm asking since many individuals and organisations cannot use GPL-licensed software. Thanks.

Crawler on demand instead a queue

Hi Guys,

I try to use express to wrap a little REST API above of puppeteer, but i see the only way to add a new url is use the cluster queue. My concern is that i do "parallel" requests i will receive the wrong answer, i mean the content of another url.

My question is: Is possible to run synchronous tasks ?

Thanks and sorry for may bad english.

about CONCURRENCY_PAGE

There is such a scene,I have a number of URLs, want to open the bulk of parallel, but if the URL is under the same domain name, each page needs to delay a few seconds to open, to avoid being blocked by the target webmaster,How to do it, I follow the following settings do not seem to

concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 10,
retryLimit: 5,//失败重试5次
retryDelay: 2000,//重试间隔2秒
sameDomainDelay:30*1000,//统一域名下,延时10秒打开,貌似没用
skipDuplicateUrls: true,//跳过重复url
workerCreationDelay: 500,//标签打开延时

Use TLD.js for sameDomainDelay

So far the domain extraction just takes the hostname from Node.js which includes subdomains.

Should be using TLD.js to make it work with normal top level domains and also for *.co.uk.

Same URL Concurrency

This goes away from the traditional idea of "New browser per task" or "New page per task". This one is more about keeping a cluster of pages open the entire time and periodically refreshing them.

Why would I want to do to this you ask?...

Let's say I have a page that has d3 charts and I want to turn all the charts into images (my actual product isn't d3 charts). If the charts update in real time and I want a screenshot every 5 minutes (assuming there are 100s of charts), opening a page / browser each time takes a while. If I just kept the tab open and kept screenshotting, then I'd have the screenshots a lot sooner.

Now for my more techy way: I'm exposing a function to the site I'm screenshotting, and that function retrieves arguments from puppeteer/chrome to render specific items on the page.

Sudo-Code

// browser
if (typeof window.getRenderOpts === 'function') {
    window.getRenderOpts().then((opts) => updateChart(opts));
}

// puppeteer
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');

async function getPageAndLock(): Promise<Page> {
  // .. get's a page that's idle or waits till one becomes idle...
}

async function pageIsReady(): Promise<Page> {
  // ...
}



... (req, res) => {
    const page = await getPageAndLock();

    await page.evaluate(`render(`${JSON.stringify({/* ... */)`)`);

    const screenshot = await page.screenshot(/* ... */);

    pageIsReady(page);

    res.send(screenshot);
}

It's probably out of the scope of this library, but I'm not sure if anyone would be interested in this type of concurrency.

I did benchmarks of "New browser per task", "New page per task", "Same page per task", and keeping the page open and taking screenshots periodically is A LOT FASTER. I can get these benchmarks back if you want me to. This was when I was experimenting.

importing / requiring Cluster

Hi,

thanks for this awesome library :)

Unfortunately, I do not seem to get it to work, as none of the importing / requiring mechanisms seem to work:

const { Cluster } = require('puppeteer-cluster'); -> Cluster = undefined
import { Cluster } from 'puppeteer-cluster'; -> Cluster = undefined
import Cluster from 'puppeteer-cluster'; -> Cluster = {}

I'm on Node v8.11.4

What am I doing wrong?

Can you add an idle event

There is a requirement, a database table, there is a lot of url, need to be accessed one by one, I want to read in batches to prevent too much memory, the program has been running, only every once in a while to read the database, and then execute, Can you add an idle event to this loop? Read the database only when you are idle, I wonder if this method is feasible

CONCURRENCY_PAGE with headless: false hangs up and breaks

const { Cluster } = require('../dist');

(async () => {
    // Create a cluster with 2 workers
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_PAGE,
        maxConcurrency: 2,
        puppeteerOptions: {headless: false}
    });

    // Define a task (in this case: screenshot of page)
    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);

        const path = url.replace(/[^a-zA-Z]/g, '_') + '.png';
        await page.screenshot({ path });
        console.log(`Screenshot of ${url} saved: ${path}`);
    });

    // Add some pages to queue
    await cluster.queue('https://www.google.com');
    await cluster.queue('https://www.wikipedia.org');
    await cluster.queue('https://github.com/');

    // Shutdown after everything is done
    await cluster.idle();
    await cluster.close();
})();

This only generated screenshots for wiki and github. Browser also hung for some time.

Problem with headless: true

Hello there.
I'm testing puppeteer-cluster in a project I'm working on and I have problem in headless mode.
Because I cannot send the original code, I tried to reproduce the problem using the simple Queuing functions example. When headless is false it works like a charm. When set to true, nothing happens.
Am I missing something?

const { Cluster } = require('puppeteer-cluster');

(async () => {

    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_CONTEXT,
        maxConcurrency: 3,
        puppeteerOptions: {
            headless: true
        },
        monitor: true
    });

    await cluster.queue(async ({ page }) => {
        await page.goto('http://www.wikipedia.org');
        await page.screenshot({path: 'wikipedia.png'});
    });

    await cluster.queue(async ({ page }) => {
        await page.goto('https://www.google.com/');
        const pageTitle = await page.evaluate(() => document.title);
        console.log('google');
    });

    await cluster.queue(async ({ page }) => {
        await page.goto('https://www.imdb.com/');
        console.log('IMDB');
    });
    await cluster.idle();
    await cluster.close();
})();

puppeteer v1.9.0,
puppeteer-cluster v0.11.2

I will appreciate your help. Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.