Comments (8)
I guess you are calling cluster.queue with invalid URLs?
If you want to use skipDuplicateUrls
or sameDomainDelay
, you need to either provide the URL as string or put the URL into a url
property. Then means you have to call the queue function like this:
cluster.queue('http://yoururl.tld/path');
or like this:
cluster.queue({ url: 'http://yoururl.tld/path' });
from puppeteer-cluster.
my code
db_result = await fetch60();//fetch 60 urls from db
for (let db_row of db_result) {//loop
await cluster.queue(db_row['link_url']);
};
from puppeteer-cluster.
Is the actual URL given to cluster.queue with protocol (http(s)://test.tld/...
) or are you only given a part of the URL (test.tld/...
)? The latter will not work, the first one should work.
from puppeteer-cluster.
The current settings, the actual running situation is like this, while opening up 10 URLs,I want the result because these 10 URLs are under the same domain name, so each has to have a delay,Can be opened at the same time 10 tab, but the URL can not be entered at the same time, to increase the delay, otherwise the target site was judged as a robot,I try to add a delay code to the top of the task code, or I can open it at the same time.
from puppeteer-cluster.
I understand your scenario and the library supports it.
Please either answer my questions or provide your source code.
from puppeteer-cluster.
const { Cluster } = require('puppeteer-cluster');
//规则存放的根目录
const module_path = process.env.my_nodemodules;
//--日志
const logger = require(${module_path}/newLogger.js
);
//公用函数库
const { delayAsync } = require(${module_path}/my_common_func.js
);
///////////////////////引入数据库////////////////////////////////////
const table_models = require('./table_model');
const db_models = require(${module_path}/mongodb_model
);
const mongoose = require('mongoose');
const DB_URL = 'mongodb://localhost:27017/weburl';
const db = mongoose.createConnection(DB_URL, { useNewUrlParser: true });
const scm_list = new mongoose.Schema({ link_url: String, is_deal: Boolean });
const list_model = db.model('t_list', scm_list);
db.on('connected', function () { console.log('Mongodb 链接成功 ' + DB_URL); });
db.on('error', function (err) { console.log('Mongodb 链接失败: ' + err); });
db.on('disconnected', function () { console.log('Mongodb 链接断开'); });
//__可变参数部分
let db_result = [];
main();//主函数
//-------------------函数定义---------------------
//取待采集的url,10分钟运行一次,一次取60条
async function fetch60() {
let promise_me = new Promise(function (resolve, reject) { // 异步处理
list_model
.find({})
.where('is_deal').equals(false)
.limit(60)
.select('link_url')
.exec(function (err, data) {
if (err) {
reject(查找待采集数据失败:${err}
);
} else {
resolve(data);
}
});
});
return promise_me;
};
async function main() {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_PAGE,
maxConcurrency: 5,
retryLimit: 5,//失败重试5次
retryDelay: 2000,//重试间隔2秒
sameDomainDelay:30*1000,//同一域名下,延时10秒打开,貌似没用
skipDuplicateUrls: true,//跳过重复url
workerCreationDelay: 500,//标签打开延时
puppeteerOptions: {
headless: false,
ignoreHTTPSErrors: true,
slowMo: 250,//延时
defaultViewport: { width: 1440, height: 900 }
}
});
cluster.on('taskerror', (err, data) => {
console.log(采集任务异常 ${data}: ${err.message}
);
});
db_result = await fetch60();
for (let db_row of db_result) {
await cluster.queue(db_row['link_url']);//example https://www.lagou.com/jobs/xxxxx.html
};
cluster.task(getHtmlSorce);
await cluster.idle();
await cluster.close();
await db.close();
}
async function getHtmlSorce({ page, data: url }) {
await page.goto(url, { waitUntil: 'domcontentloaded' });
const contents = await page.content();
////////////////////////////////入库
console.log(contents);
//////////////////////////
}
from puppeteer-cluster.
@kanxue660 If you solved it, perhaps share the solution so others could learn? Haha
from puppeteer-cluster.
@Rainbowhat The author has fixed this problem.
from puppeteer-cluster.
Related Issues (20)
- Single setup before starting concurrent cluster? HOT 2
- I think a timeout of `0` should disable timeouts HOT 2
- Clear up Concurrency wording incorrect usage HOT 5
- Feature: Lifetimes
- How To Stop Worker To Become Idle automatically
- Expose stats via prometheus HOT 2
- Screen shot getting stuck forever
- Use same URL but diffetent logic on each browser HOT 1
- Concurrency launch: CONCURRENCY_BROWSER definition slightly misleading HOT 2
- Error detection super slow with new Puppeteer versions HOT 1
- Support to new puppetter versions HOT 1
- share the dockerfile I'm using
- Suggestion: Allow pool of already instantiated browser workers
- how to open the progress view and monitoring statistics? HOT 1
- cluster concurrent seems not work HOT 1
- Regarding resource usage HOT 1
- Worker Error getting browser page HOT 1
- How to set args like .launch({ args: [] }) ? HOT 3
- Has anyone managed to use separate data for each browser?
- browser crushing due to "open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from puppeteer-cluster.