coder-hxl / x-crawl Goto Github PK
View Code? Open in Web Editor NEWFlexible Node.js AI-assisted crawler library
Home Page: https://coder-hxl.github.io/x-crawl/
License: MIT License
Flexible Node.js AI-assisted crawler library
Home Page: https://coder-hxl.github.io/x-crawl/
License: MIT License
执行爬虫实例化时
const myXCrawl = xCrawl();
报错
TypeError: (0 , x_crawl_1.default) is not a function
import xCrawl from 'x-crawl';
async crawl(path: string) {
// 2.创建一个爬虫实例
const myXCrawl = xCrawl();
myXCrawl.crawlPage(path).then(res => {
const { browser, page } = res.data;
console.log(page);
// 关闭浏览器
browser.close();
});
}
ERROR 389 TypeError: (0 , x_crawl_1.default) is not a function
8.3.1
16.19.1
npm
9.6.1
crawlPage 设置 proxy 选项无效
环境: window10 操作系统, node 16,npm v8.19
我使用 clash 作为代理,在不开系统代理的情况下,crawlData工作良好,但 crawlPage 的 proxy 配置无效,只走系统代理
代码如下
import xCrawl from 'x-crawl'
// Application instance configuration
const testXCrawl = xCrawl({
proxy: {
urls: [
'https://127.0.0.1:7890'
],
switchByErrorCount: 1,
switchByHttpStatus: [401, 403]
}
})
// Advanced configuration
testXCrawl
.crawlPage({
targets: [
'https://www.google.com',
{
url: 'https://www.google.com',
proxy: { urls: ['http://127.0.0.1:7890'] }
}
],
maxRetry: 3,
proxy: {
urls: [
'http://127.0.0.1:7890',
'http://127.0.0.1:7890'
],
switchByErrorCount: 1,
switchByHttpStatus: [401, 403]
}
})
.then((res) => {})
考虑到 switchByHttpStatus 可能会影响
我修改了 x-crawl index.mjs 第 270 行,加了下列代码
if (status === null) { result = true; }
no error
7.0.1
16.18
npm
8.19
crawlData 请求数据,使用 CrawlDataDetailTargetConfig
的 params
传递请求参数,发现实际请求并未携带参数。
查看源码发现 此处 将 params
参数处理为 search
,然后在 此处 传递给 https.request()
方法进行请求。
查看 Node.js 文档,http.request()
的解释 HTTP #httprequestoptions-callback | Node.js v18.16.0 Documentation,请求参数应该是拼接在 path
后。
const pageSize = 30;
const sort = 'time';
const rating = 'all';
const filter = 'all';
const mode = 'list';
const url = `https://${type}.${this.BASE_URL}/people/${uid}/${status}`;
const request: CrawlDataDetailTargetConfig = {
url,
method: "GET",
params: {
start: (page - 1) * pageSize,
sort,
rating,
filter,
mode,
},
data: {},
timeout: 30000,
maxRetry: 0
};
const crawler = Crawl({
enableRandomFingerprint: true
});
const res = await crawler.crawlData(request);
无报错
7.0.0
18.15.0
npm
8.3.1
比如我想爬一下油管的直播,但是保存的屏幕截图,有一个按钮,但是没法点击。有没有可以点击的操作
click函数
在我传入的参数data为String类型时,isDataUndefine的判断会把data再包一层引号,这样会导致接口参数对不上
const post = {
url: "https://glxy.mot.gov.cn/company/getCompanyAptitude.do",
method: "post",
headers: {
"Content-Type": "application/x-www-form-urlencoded",
},
data: "page=1",
};
let res = await myXCrawl.crawlData(post);
无报错
7.1.1
18.16.0
npm
9.5.1
从某个接口批量下载数据是很常见的需求, 为方便给每次下载设置个性化的参数, 希望crawlFile
的所有配置参数能支持传入数组. 当传入的是字符串常量时, 则统一按此常量执行, 当传入的是数组是则按索引取相应的值.
建议在其他api的设计中也考虑这种形式, 以便更加灵活地操作使用。
const CODES = ['NTES', 'BIDU', 'JD'];
await myXCrawl.crawlFile({
url: CODES.map((d) => `https://query1.finance.yahoo.com/v7/finance/download/${d}`),
storeDir: './public/data/financial', // 常量时表示全部下载都是保存在这个目录, 但也允许支持数组
fileName: CODES, // 支持数组
extension: '.csv', // 所有文件的扩展名都是csv , 但也应允许支持数组
});
https://github.com/dext7r/puppeteer/blob/master/src/utils/bilibili.ts#L45
在window下是可以直接新建這個目錄并写入的。
但是在linux报错如下
Run pnpm task
> @dext7r/[email protected] task /home/runner/work/puppeteer/puppeteer
> npx ts-node ./src/serve
Start crawling - name: page, mode: async, total: 1
Id: 1 - Crawl does not need to sleep, send immediately
Crawl the final result:
Success - total: 1, ids: [ 1 ]
Error - total: 0, ids: [ ]
Start crawling - name: file, mode: async, total: 9
Id: 1 - Crawl does not need to sleep, send immediately
Id: 2 - Crawl does not need to sleep, send immediately
Id: 3 - Crawl does not need to sleep, send immediately
Id: 4 - Crawl does not need to sleep, send immediately
Id: 5 - Crawl does not need to sleep, send immediately
Id: [6](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:7) - Crawl does not need to sleep, send immediately
Id: [7](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:8) - Crawl does not need to sleep, send immediately
Id: [8](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:9) - Crawl does not need to sleep, send immediately
Id: [9](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:10) - Crawl does not need to sleep, send immediately
Error: ENOENT: no such file or directory, mkdir
at Object.mkdirSync (node:fs:1396:3)
at /home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:46:[10](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:11)
at Array.reduce (<anonymous>)
at mkdirDirSync (/home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:43:[12](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:13))
at fileSingleResultHandle (/home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:951:7)
at /home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:[13](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:14)5:[21](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:22)
at processTicksAndRejections (node:internal/process/task_queues:95:5)
at async Promise.all (index 3) {
errno: -2,
syscall: 'mkdir',
code: 'ENOENT'
}
Error: The operation was canceled.
import xCrawl from 'x-crawl';
import path from 'path';
import fs from 'fs/promises';
const bilibiliXCrawl = xCrawl({ mode: 'async' });
export function getCurrentDate(separator: string = '-') {
// 调用 crawlFile API 爬取图片
await bilibiliXCrawl.crawlFile({
targets: [...urls],
// storeDir: `./bilibili/${getCurrentDate('/')}`,
storeDir,
});
// 关闭页面
page.close();
// 关闭浏览器
browser.close();
}
Error: ENOENT: no such file or directory, mkdir
"x-crawl": "^6.0.1"
18
pnpm
7.13.2
mac系统
报错信息
node_modules/.pnpm/[email protected]/node_modules/puppeteer: Running postinstall script, failed in 5.2s
.../node_modules/puppeteer postinstall$ node install.js
│ ERROR: Failed to set up Chromium r1108766! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download.
│ Error: read ECONNRESET
│ at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20) {
│ errno: -54,
│ code: 'ECONNRESET',
│ syscall: 'read'
│ }
└─ Failed in 5.2s
pnpm i xxx
node_modules/.pnpm/[email protected]/node_modules/puppeteer: Running postinstall script, failed in 5.2s .../node_modules/puppeteer postinstall$ node install.js │ ERROR: Failed to set up Chromium r1108766! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download. │ Error: read ECONNRESET │ at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20) { │ errno: -54, │ code: 'ECONNRESET', │ syscall: 'read' │ } └─ Failed in 5.2s
最新
v18.6.0
pnpm
7.14.2
在我使用crawlData爬取接口的时候,配置headers的Content-Type为application/x-www-form-urlencoded,被替换成application/json
const targets = {
url: "https://glxy.mot.gov.cn/company/getCompanyAptitude.do",
method: "POST",
headers: {
"Content-Type": "application/x-www-form-urlencoded;charset=UTF-8",
},
data: Qs.stringify({ type: 2 }),
};
const res = await myXCrawl.crawlData(targets);
无报错
7.1.0
18.16.0
npm
9.5.1
npm i x-crawl
ERROR: Failed to set up Chrome r113.0.5672.63! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download. npm ERR! Error: Download failed: server returned code 404. URL: https://npm.taobao.org/mirrors/113.0.5672.63/linux64/chrome-linux64.zip
7.0.1
16.18.1
npm
8.19.2
预期:浏览器可直接调用成功
实际:接口调用失败,因为无法识别url
import { createCrawl } from 'x-crawl'
const crawlApp = createCrawl({ intervalTime: { max: 3000, min: 1000 } })
let aar =["https://suno-list.com/api/suno/validateLink?link=https%3A%2F%2Fsuno.com%2Fsong%2Fe8b369f2-e2fc-4152-b2aa-d29224d43041"]
crawlApp.crawlData({ aar }).then((twoRes) => {
// 处理接口链接
console.log(twoRes);
for(let one of twoRes){
// 这里报错
}
console.log(downloadLinks);
});
TypeError: Invalid URL
x-crawl 10.0.1
v20.12.1
pnpm
8.3.1
目前看完文档没有发现有控制 log 的配置,在使用其它方法配合 x-crawl 使用的时候会造成控制台输出混乱。
如果可以手动配置,再提供一个 context 就更好了,类似 https://listr2.kilic.dev/task/output.html
这个我不是很专业。
作者好,感谢提供这样的一个库,非常好用!但我觉得代码中的类型声明部分存在一些冗余和不够灵活的地方,可能会给用户在使用函数时带来一些限制和不便。以crawlFile
为例:
1、url
明显是必需的参数,建议将其作为单独的参数更直观和灵活。
2、将 CrawlFileDetailTargetConfig 和 CrawlFileAdvancedConfig 合并为一个更简洁的 CrawlFileConfig 类型,减少重复定义。
3、支持传入单个 URL 或多个 URL,以及为每个 URL 设置单独的存储目录、文件名和扩展名。
我尝试给出ts
的类型作为参考,以便清楚地表达我的建议,希望这个库可以越来越火。
还有个小建议, 可以把crawFile
改名成fetchFile
, 因为 "fetch" 一词表示从网络获取数据的常见操作,结合 "file" 可以清晰地表达其作用。
export function crawlFile(
url: string | string[],
config?: CrawlFileConfig,
callback?: (result: CrawlFileSingleResult | CrawlFileSingleResult[]) => void
): Promise<CrawlFileSingleResult | CrawlFileSingleResult[]>
export interface CrawlFileConfig extends CrawlCommonConfig {
outputDir?: string | string[] | null
extension?: string | string[] | null
fileName?: string | string[] | null
intervalTime?: IntervalTime
fingerprints?: DetailTargetFingerprintCommon[]
headers?: AnyObject
onCrawlItemComplete?: (result: CrawlFileSingleResult) => void
onBeforeSaveItemFile?: (info: {
id: number
fileName: string
filePath: string
data: Buffer
}) => Promise<Buffer>
}
感谢提供这么好用的工具,正在学习中,在使用过程中,有些网站并不提供固定的文件下载地址,而是一个页面后,再跳转(302或返回一段js)直到实际的下载地址,这个地址也是临时的(所以没法事先抓取),请问这种情况如何处理,感谢~
希望能够正常返回数据
我window上能正常使用的功能,在Linux上获取不到结果,也不报错,日志根据提示是完成的,我是通过fastify起api,debian 11
fastify.post('/api/screenshoot', async function handler(request, reply) {
const { url } = request.body
if (!url) {
return reply.send({ code: 0, msg: "url is required" })
}
try {
const buffer = await screenshoot({ url })
// 在发送非常规数据时。一定一定要指定响应数据类型
reply.type("image/jpeg")
return buffer
} catch (error) {
}
})
// screenshoot model
const res = await x.crawlPage({
url,
maxRetry: 10,
viewport: { width: 1920, height: 1080 },
// 为此次的目标统一设置指纹
fingerprints: [
// 设备指纹 1
{
maxWidth: 1024,
maxHeight: 800,
platform: 'Windows',
mobile: 'random',
userAgent: {
value:
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
versions: [
{
name: 'Chrome',
// 浏览器版本
maxMajorVersion: 112,
minMajorVersion: 100,
maxMinorVersion: 20,
maxPatchVersion: 5000
},
{
name: 'Safari',
maxMajorVersion: 537,
minMajorVersion: 500,
maxMinorVersion: 36,
maxPatchVersion: 5000
}
]
}
},
// 设备指纹 2
{
platform: 'Windows',
mobile: 'random',
userAgent: {
value:
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59',
versions: [
{
name: 'Chrome',
maxMajorVersion: 91,
minMajorVersion: 88,
maxMinorVersion: 10,
maxPatchVersion: 5615
},
{ name: 'Safari', maxMinorVersion: 36, maxPatchVersion: 2333 },
{ name: 'Edg', maxMinorVersion: 10, maxPatchVersion: 864 }
]
}
},
// 设备指纹 3
{
platform: 'Windows',
mobile: 'random',
userAgent: {
value:
'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',
versions: [
{
name: 'Firefox',
maxMajorVersion: 47,
minMajorVersion: 43,
maxMinorVersion: 10,
maxPatchVersion: 5000
}
]
}
}
]
})
const { browser, page } = res.data
// // Get a screenshot of the rendered page
const buffer = await page.screenshot({ path: `../uploads/${host}_${Date.now()}.png` })
console.log('Screen capture is complete')
if (buffer) {
page.close()
}
// close brower
// browser.close()
return buffer
无报错,无结果返回
latest
20.9.0
pnpm
latest
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
These updates have all been created already. Click a checkbox below to force a retry/rebase of any.
@babel/core
, @babel/preset-env
)@vitest/coverage-v8
, @vitest/ui
, vitest
).github/workflows/dependency-review.yml
actions/checkout v4
actions/dependency-review-action v4
.github/workflows/deploy.yml
actions/checkout v4
pnpm/action-setup v3
actions/setup-node v4
actions/configure-pages v5
actions/upload-pages-artifact v3
actions/deploy-pages v4
.github/workflows/greetings.yml
actions/first-interaction v1
docs/package.json
vitepress ^1.0.2
package.json
chalk 5.3.0
https-proxy-agent ^7.0.4
openai ^4.33.0
ora ^8.0.1
puppeteer 22.5.0
@babel/core ^7.24.0
@babel/preset-env ^7.24.0
@rollup/plugin-babel ^6.0.4
@rollup/plugin-run ^3.0.2
@rollup/plugin-terser ^0.4.4
@types/node ^20.12.1
@typescript-eslint/eslint-plugin ^7.9.0
@typescript-eslint/parser ^7.9.0
@vitest/coverage-v8 ^1.4.0
@vitest/ui ^1.4.0
eslint ^9.0.0
prettier ^3.2.5
rollup ^4.13.0
rollup-plugin-typescript2 ^0.36.0
typescript 5.4.4
vitest ^1.4.0
node >=18.0.0
publish/package.json
chalk 5.3.0
https-proxy-agent ^7.0.4
openai ^4.33.0
ora ^8.0.1
puppeteer 22.5.0
node >=18.0.0
crawlData 请求数据,请求配置:
{
url: 'https://api.github.com/repos/xxx/xxx/releases',
method: 'GET',
headers: {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
'X-GitHub-Api-Version': '2022-11-28',
Accept: 'application/json',
Authorization: 'Bearer tokenxxx'
},
params: { per_page: 1, page: 1 },
data: {},
timeout: 60000,
maxRetry: 0,
proxy: ''
}
返回错误:
{
id: 1,
isSuccess: false,
maxRetry: 0,
retryCount: 0,
proxyDetails: [],
crawlErrorQueue: [
TypeError: The "chunk" argument must be of type string or an instance of Buffer or Uint8Array. Received an instance of Object
... {
code: 'ERR_INVALID_ARG_TYPE'
}
],
data: null
}
查看源码发现
Line 114 in 1d18de5
const crawler = Crawl();
const options = {
'url': 'https://api.github.com/repos/xxx/xxx/releases',
'method': 'GET',
'headers': {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
'X-GitHub-Api-Version': '2022-11-28',
'Accept': 'application/json',
'Authorization': 'Bearer tokenxxx'
},
'params': {
'per_page': 1,
'page': 1
},
'data': {},
'timeout': 30000,
'maxRetry': 0
};
const res = await crawler.crawlData(options);
TypeError: The "chunk" argument must be of type string or an instance of Buffer or Uint8Array. Received an instance of Object
7.1.2
v18.6.0
pnpm
8.6.5
如题。开发调试的时候,方便把浏览器打开。不然只能盲猜测
避免重复爬取某个 page, 看下面这个例子
const ress = await myXCrawl.crawlPage({
targets: ['https://docs.aave.com/portal/'],
viewport: { width: 1920, height: 1080 },
intervalTime: { max: 2000, min: 1000 },
maxRetry: 3,
});
for (const res of ress) {
const { browser, page } = res.data;
console.log(page.url());
await page.waitForNetworkIdle();
const content = await page.content();
console.log(page.url());
}
https://docs.aave.com/portal/ 会跳转到 https://docs.aave.com/hub/
通过 page.url() 获取的是调整后的 link, 这样我就无法根据 link 来判断一个page是否已经被爬取过了
返回结果提供原始的 link
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.