coder-hxl / x-crawl Goto Github PK

View Code? Open in Web Editor NEW

1.3K 13.0 84.0 41.03 MB

Flexible Node.js AI-assisted crawler library

Home Page: https://coder-hxl.github.io/x-crawl/

License: MIT License

JavaScript 1.89% TypeScript 98.11%

crawl crawler nodejs typescript spider flexible puppeteer javascript multifunction chromium

x-crawl's Issues

创建实例时候报错 TypeError: (0 , x_crawl_1.default) is not a function

Bug 预期

执行爬虫实例化时
const myXCrawl = xCrawl();

报错

TypeError: (0 , x_crawl_1.default) is not a function

最小可重复的例子

import xCrawl from 'x-crawl';

  async crawl(path: string) {
    // 2.创建一个爬虫实例
    const myXCrawl = xCrawl();
    myXCrawl.crawlPage(path).then(res => {
      const { browser, page } = res.data;
      console.log(page);
      // 关闭浏览器
      browser.close();
    });
  }

报错信息

ERROR 389 TypeError: (0 , x_crawl_1.default) is not a function

x-crawl 版本

8.3.1

Node 版本

16.19.1

包管理器

npm

包管理器版本

9.6.1

crawlPage setting proxy option does not work

Bug expectation

crawlPage 设置 proxy 选项无效
环境: window10 操作系统， node 16，npm v8.19
我使用 clash 作为代理，在不开系统代理的情况下，crawlData工作良好，但 crawlPage 的 proxy 配置无效，只走系统代理

代码如下

import xCrawl from 'x-crawl'

// Application instance configuration
const testXCrawl = xCrawl({
  proxy: {
    urls: [
      'https://127.0.0.1:7890'
    ],
    switchByErrorCount: 1,
    switchByHttpStatus: [401, 403]
  }
})

// Advanced configuration
testXCrawl
  .crawlPage({
    targets: [
      'https://www.google.com',
      {
        url: 'https://www.google.com',
        proxy: { urls: ['http://127.0.0.1:7890'] }
      }
    ],
    maxRetry: 3,
    proxy: {
      urls: [
        'http://127.0.0.1:7890',
        'http://127.0.0.1:7890'
      ],
      switchByErrorCount: 1,
      switchByHttpStatus: [401, 403]
    }
  })
  .then((res) => {})

考虑到 switchByHttpStatus 可能会影响
我修改了 x-crawl index.mjs 第 270 行，加了下列代码
if (status === null) { result = true; }

Error string

no error

x-crawl version

7.0.1

Node version

16.18

Package manager

npm

Package manager version

8.19

Bug 预期

crawlData 请求数据，使用 CrawlDataDetailTargetConfig 的 params 传递请求参数，发现实际请求并未携带参数。

查看源码发现此处将 params 参数处理为 search，然后在此处传递给 https.request() 方法进行请求。

查看 Node.js 文档，http.request() 的解释 HTTP #httprequestoptions-callback | Node.js v18.16.0 Documentation，请求参数应该是拼接在 path 后。

最小可重复的例子

const pageSize = 30;
    const sort = 'time';
    const rating = 'all';
    const filter = 'all';
    const mode = 'list';
    const url = `https://${type}.${this.BASE_URL}/people/${uid}/${status}`;
    const request: CrawlDataDetailTargetConfig = {
      url,
      method: "GET",
      params: {
        start: (page - 1) * pageSize,
        sort,
        rating,
        filter,
        mode,
      },
      data: {},
      timeout: 30000,
      maxRetry: 0
    };

    const crawler = Crawl({
      enableRandomFingerprint: true
    });
    const res = await crawler.crawlData(request);

报错信息

无报错

x-crawl 版本

7.0.0

Node 版本

18.15.0

包管理器

npm

包管理器版本

8.3.1

这个特性解决了什么问题？

比如我想爬一下油管的直播，但是保存的屏幕截图，有一个按钮，但是没法点击。有没有可以点击的操作

提议的 API 是什么样的？

click函数

Bug 预期

在我传入的参数data为String类型时，isDataUndefine的判断会把data再包一层引号，这样会导致接口参数对不上

最小可重复的例子

const post = {
    url: "https://glxy.mot.gov.cn/company/getCompanyAptitude.do",
    method: "post",
    headers: {
      "Content-Type": "application/x-www-form-urlencoded",
    },
    data: "page=1",
  };
  let res = await myXCrawl.crawlData(post);

报错信息

无报错

x-crawl 版本

7.1.1

Node 版本

18.16.0

包管理器

npm

包管理器版本

9.5.1

这个特性解决了什么问题？

从某个接口批量下载数据是很常见的需求, 为方便给每次下载设置个性化的参数, 希望crawlFile的所有配置参数能支持传入数组. 当传入的是字符串常量时, 则统一按此常量执行, 当传入的是数组是则按索引取相应的值.

建议在其他api的设计中也考虑这种形式, 以便更加灵活地操作使用。

提议的 API 是什么样的？

const CODES = ['NTES', 'BIDU', 'JD'];
await myXCrawl.crawlFile({
  url: CODES.map((d) => `https://query1.finance.yahoo.com/v7/finance/download/${d}`),
  storeDir: './public/data/financial', // 常量时表示全部下载都是保存在这个目录, 但也允许支持数组
  fileName: CODES,  // 支持数组
  extension: '.csv',  // 所有文件的扩展名都是csv , 但也应允许支持数组
});

xCrawl.crawlFile 函数不能完美的兼容linux

Bug 预期

https://github.com/dext7r/puppeteer/blob/master/src/utils/bilibili.ts#L45

在window下是可以直接新建這個目錄并写入的。
但是在linux报错如下

Run pnpm task

> @dext7r/[email protected] task /home/runner/work/puppeteer/puppeteer
> npx ts-node ./src/serve

Start crawling - name: page, mode: async, total: 1 
Id: 1 - Crawl does not need to sleep, send immediately
Crawl the final result:
  Success - total: 1, ids: [ 1 ]
    Error - total: 0, ids: [  ]
Start crawling - name: file, mode: async, total: 9 
Id: 1 - Crawl does not need to sleep, send immediately
Id: 2 - Crawl does not need to sleep, send immediately
Id: 3 - Crawl does not need to sleep, send immediately
Id: 4 - Crawl does not need to sleep, send immediately
Id: 5 - Crawl does not need to sleep, send immediately
Id: [6](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:7) - Crawl does not need to sleep, send immediately
Id: [7](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:8) - Crawl does not need to sleep, send immediately
Id: [8](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:9) - Crawl does not need to sleep, send immediately
Id: [9](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:10) - Crawl does not need to sleep, send immediately
Error: ENOENT: no such file or directory, mkdir
    at Object.mkdirSync (node:fs:1396:3)
    at /home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:46:[10](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:11)
    at Array.reduce (<anonymous>)
    at mkdirDirSync (/home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:43:[12](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:13))
    at fileSingleResultHandle (/home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:951:7)
    at /home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:[13](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:14)5:[21](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:22)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Promise.all (index 3) {
  errno: -2,
  syscall: 'mkdir',
  code: 'ENOENT'
}
Error: The operation was canceled.

最小可重复的例子

import xCrawl from 'x-crawl';
import path from 'path';
import fs from 'fs/promises';

const bilibiliXCrawl = xCrawl({ mode: 'async' });
export function getCurrentDate(separator: string = '-') {


  // 调用 crawlFile API 爬取图片
  await bilibiliXCrawl.crawlFile({
    targets: [...urls],
    // storeDir: `./bilibili/${getCurrentDate('/')}`,
    storeDir,
  });
  // 关闭页面
  page.close();

  // 关闭浏览器
  browser.close();
}

报错信息

Error: ENOENT: no such file or directory, mkdir

x-crawl 版本

"x-crawl": "^6.0.1"

Node 版本

包管理器

pnpm

包管理器版本

7.13.2

Bug 预期

mac系统
报错信息

node_modules/.pnpm/[email protected]/node_modules/puppeteer: Running postinstall script, failed in 5.2s
.../node_modules/puppeteer postinstall$ node install.js
│ ERROR: Failed to set up Chromium r1108766! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download.
│ Error: read ECONNRESET
│     at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20) {
│   errno: -54,
│   code: 'ECONNRESET',
│   syscall: 'read'
│ }
└─ Failed in 5.2s

最小可重复的例子

pnpm i xxx

报错信息

node_modules/.pnpm/[email protected]/node_modules/puppeteer: Running postinstall script, failed in 5.2s .../node_modules/puppeteer postinstall$ node install.js │ ERROR: Failed to set up Chromium r1108766! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download. │ Error: read ECONNRESET │ at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20) { │ errno: -54, │ code: 'ECONNRESET', │ syscall: 'read' │ } └─ Failed in 5.2s

x-crawl 版本

Node 版本

v18.6.0

包管理器

pnpm

包管理器版本

7.14.2

crawlData配置问题

Bug 预期

在我使用crawlData爬取接口的时候，配置headers的Content-Type为application/x-www-form-urlencoded，被替换成application/json

最小可重复的例子

const targets = {
    url: "https://glxy.mot.gov.cn/company/getCompanyAptitude.do",
    method: "POST",
    headers: {
      "Content-Type": "application/x-www-form-urlencoded;charset=UTF-8",
    },
    data: Qs.stringify({ type: 2 }),
  };
  const res = await myXCrawl.crawlData(targets);

报错信息

无报错

x-crawl 版本

7.1.0

Node 版本

18.16.0

包管理器

npm

包管理器版本

9.5.1

这个是不能在centos服务器上使用吗？安装依赖的时候，无头浏览器一直安装不成功

Bug 预期

安装依赖会有一下报错

跳过报错后，运行不起来

最小可重复的例子

npm i x-crawl

报错信息

ERROR: Failed to set up Chrome r113.0.5672.63! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download. npm ERR! Error: Download failed: server returned code 404. URL: https://npm.taobao.org/mirrors/113.0.5672.63/linux64/chrome-linux64.zip

x-crawl 版本

7.0.1

Node 版本

16.18.1

包管理器

npm

包管理器版本

8.19.2

chore(deps): update dependency @rollup/plugin-terser to v0.4.3

Check this box to trigger a request for Renovate to run again on this repository

Get接口，浏览器能直接访问调用，但工具不行，疑似URL编码问题

Bug 预期

预期：浏览器可直接调用成功
实际：接口调用失败，因为无法识别url

最小可重复的例子

import { createCrawl } from 'x-crawl'

const crawlApp = createCrawl({ intervalTime: { max: 3000, min: 1000 } })
let aar =["https://suno-list.com/api/suno/validateLink?link=https%3A%2F%2Fsuno.com%2Fsong%2Fe8b369f2-e2fc-4152-b2aa-d29224d43041"]
    crawlApp.crawlData({ aar }).then((twoRes) => {
        // 处理接口链接
        console.log(twoRes);
        for(let one of twoRes){
            // 这里报错
        }
        console.log(downloadLinks);
    });

报错信息

TypeError: Invalid URL

x-crawl 版本

x-crawl 10.0.1

Node 版本

v20.12.1

包管理器

pnpm

包管理器版本

8.3.1

是否可以有个选项启用或者关闭打印的 Start Crawling/finish 信息

这个特性解决了什么问题？

目前看完文档没有发现有控制 log 的配置，在使用其它方法配合 x-crawl 使用的时候会造成控制台输出混乱。

如果可以手动配置，再提供一个 context 就更好了，类似 https://listr2.kilic.dev/task/output.html

提议的 API 是什么样的？

这个我不是很专业。

关于 `crawlFile` API 设计的想法建议

这个特性解决了什么问题？

作者好，感谢提供这样的一个库，非常好用！但我觉得代码中的类型声明部分存在一些冗余和不够灵活的地方，可能会给用户在使用函数时带来一些限制和不便。以crawlFile为例：

1、url 明显是必需的参数，建议将其作为单独的参数更直观和灵活。
2、将 CrawlFileDetailTargetConfig 和 CrawlFileAdvancedConfig 合并为一个更简洁的 CrawlFileConfig 类型，减少重复定义。
3、支持传入单个 URL 或多个 URL，以及为每个 URL 设置单独的存储目录、文件名和扩展名。

我尝试给出ts的类型作为参考，以便清楚地表达我的建议，希望这个库可以越来越火。

还有个小建议，可以把crawFile改名成fetchFile，因为 "fetch" 一词表示从网络获取数据的常见操作，结合 "file" 可以清晰地表达其作用。

提议的 API 是什么样的？

export function crawlFile(
  url: string | string[],
  config?: CrawlFileConfig,
  callback?: (result: CrawlFileSingleResult | CrawlFileSingleResult[]) => void
): Promise<CrawlFileSingleResult | CrawlFileSingleResult[]>

export interface CrawlFileConfig extends CrawlCommonConfig {
  outputDir?: string | string[] | null
  extension?: string | string[] | null
  fileName?: string | string[] | null
  intervalTime?: IntervalTime
  fingerprints?: DetailTargetFingerprintCommon[]
  headers?: AnyObject
  onCrawlItemComplete?: (result: CrawlFileSingleResult) => void
  onBeforeSaveItemFile?: (info: {
    id: number
    fileName: string
    filePath: string
    data: Buffer
  }) => Promise<Buffer>
}

请问有跳转的文件下载如何处理

感谢提供这么好用的工具，正在学习中，在使用过程中，有些网站并不提供固定的文件下载地址，而是一个页面后，再跳转（302或返回一段js）直到实际的下载地址，这个地址也是临时的（所以没法事先抓取），请问这种情况如何处理，感谢~

windows正常的功能在Linux无结果无报错

Bug 预期

希望能够正常返回数据

最小可重复的例子

我window上能正常使用的功能，在Linux上获取不到结果，也不报错，日志根据提示是完成的，我是通过fastify起api，debian 11 

fastify.post('/api/screenshoot', async function handler(request, reply) {
    const { url } = request.body
    if (!url) {
        return reply.send({ code: 0, msg: "url is required" })
    }
    try {
        const buffer = await screenshoot({ url })
        // 在发送非常规数据时。一定一定要指定响应数据类型
        reply.type("image/jpeg")
        return buffer
    } catch (error) {

    }
})

// screenshoot model
const res = await x.crawlPage({
            url,
            maxRetry: 10,
            viewport: { width: 1920, height: 1080 },
            // 为此次的目标统一设置指纹
            fingerprints: [
                // 设备指纹 1
                {
                    maxWidth: 1024,
                    maxHeight: 800,
                    platform: 'Windows',
                    mobile: 'random',
                    userAgent: {
                        value:
                            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
                        versions: [
                            {
                                name: 'Chrome',
                                // 浏览器版本
                                maxMajorVersion: 112,
                                minMajorVersion: 100,
                                maxMinorVersion: 20,
                                maxPatchVersion: 5000
                            },
                            {
                                name: 'Safari',
                                maxMajorVersion: 537,
                                minMajorVersion: 500,
                                maxMinorVersion: 36,
                                maxPatchVersion: 5000
                            }
                        ]
                    }
                },
                // 设备指纹 2
                {
                    platform: 'Windows',
                    mobile: 'random',
                    userAgent: {
                        value:
                            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59',
                        versions: [
                            {
                                name: 'Chrome',
                                maxMajorVersion: 91,
                                minMajorVersion: 88,
                                maxMinorVersion: 10,
                                maxPatchVersion: 5615
                            },
                            { name: 'Safari', maxMinorVersion: 36, maxPatchVersion: 2333 },
                            { name: 'Edg', maxMinorVersion: 10, maxPatchVersion: 864 }
                        ]
                    }
                },
                // 设备指纹 3
                {
                    platform: 'Windows',
                    mobile: 'random',
                    userAgent: {
                        value:
                            'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',
                        versions: [
                            {
                                name: 'Firefox',
                                maxMajorVersion: 47,
                                minMajorVersion: 43,
                                maxMinorVersion: 10,
                                maxPatchVersion: 5000
                            }
                        ]
                    }
                }
            ]
        })

        const { browser, page } = res.data
        // // Get a screenshot of the rendered page
        const buffer = await page.screenshot({ path: `../uploads/${host}_${Date.now()}.png` })
        console.log('Screen capture is complete')

        if (buffer) {
            page.close()
        }
        // close brower
        // browser.close()

        return buffer

报错信息

无报错，无结果返回

x-crawl 版本

latest

Node 版本

20.9.0

包管理器

pnpm

包管理器版本

latest

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

github-actions

.github/workflows/dependency-review.yml

actions/checkout v4

actions/dependency-review-action v4

.github/workflows/deploy.yml

actions/checkout v4

pnpm/action-setup v3

actions/setup-node v4

actions/configure-pages v5

actions/upload-pages-artifact v3

actions/deploy-pages v4

.github/workflows/greetings.yml

actions/first-interaction v1

npm

docs/package.json

vitepress ^1.0.2

package.json

chalk 5.3.0

https-proxy-agent ^7.0.4

openai ^4.33.0

ora ^8.0.1

puppeteer 22.5.0

@babel/core ^7.24.0

@babel/preset-env ^7.24.0

@rollup/plugin-babel ^6.0.4

@rollup/plugin-run ^3.0.2

@rollup/plugin-terser ^0.4.4

@types/node ^20.12.1

@typescript-eslint/eslint-plugin ^7.9.0

@typescript-eslint/parser ^7.9.0

@vitest/coverage-v8 ^1.4.0

@vitest/ui ^1.4.0

eslint ^9.0.0

prettier ^3.2.5

rollup ^4.13.0

rollup-plugin-typescript2 ^0.36.0

typescript 5.4.4

vitest ^1.4.0

node >=18.0.0

publish/package.json

chalk 5.3.0

https-proxy-agent ^7.0.4

openai ^4.33.0

ora ^8.0.1

puppeteer 22.5.0

node >=18.0.0

Check this box to trigger a request for Renovate to run again on this repository

crawData 请求结果问题

Bug 预期

crawlData 请求数据，请求配置：

 {
  url: 'https://api.github.com/repos/xxx/xxx/releases',
  method: 'GET',
  headers: {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
    'X-GitHub-Api-Version': '2022-11-28',
    Accept: 'application/json',
    Authorization: 'Bearer tokenxxx'
  },
  params: { per_page: 1, page: 1 },
  data: {},
  timeout: 60000,
  maxRetry: 0,
  proxy: ''
}

返回错误：

{
  id: 1,
  isSuccess: false,
  maxRetry: 0,
  retryCount: 0,
  proxyDetails: [],
  crawlErrorQueue: [
    TypeError: The "chunk" argument must be of type string or an instance of Buffer or Uint8Array. Received an instance of Object
        ... {
      code: 'ERR_INVALID_ARG_TYPE'
    }
  ],
  data: null
}

查看源码发现

x-crawl/src/request.ts

Line 114 in 1d18de5

res.on('data', (chunk) => container.push(chunk))

这里对数据返回结果的处理，对于字符串格式数据不适用，字符串格式数据拼接处理即可。

最小可重复的例子

const crawler = Crawl();
const options = {
      'url': 'https://api.github.com/repos/xxx/xxx/releases',
      'method': 'GET',
      'headers': {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
        'X-GitHub-Api-Version': '2022-11-28',
        'Accept': 'application/json',
  'Authorization': 'Bearer tokenxxx'
      },
      'params': {
        'per_page': 1,
        'page': 1
      },
      'data': {},
      'timeout': 30000,
      'maxRetry': 0
    };
const res = await crawler.crawlData(options);

报错信息

TypeError: The "chunk" argument must be of type string or an instance of Buffer or Uint8Array. Received an instance of Object

x-crawl 版本

7.1.2

Node 版本

v18.6.0

包管理器

pnpm

包管理器版本

8.6.5

能否增加debug模式，在开发的时候，支持将浏览器显示出来？

如题。开发调试的时候，方便把浏览器打开。不然只能盲猜测

crawlPage 爬取多个 link 时, 返回结果是数组, 但是不知道每个结果对应的原始 url

这个特性解决了什么问题？

避免重复爬取某个 page, 看下面这个例子

const ress = await myXCrawl.crawlPage({
    targets: ['https://docs.aave.com/portal/'],
    viewport: { width: 1920, height: 1080 },
    intervalTime: { max: 2000, min: 1000 },
    maxRetry: 3,
});

for (const res of ress) {
    const { browser, page } = res.data;
    console.log(page.url());
    await page.waitForNetworkIdle();
    const content = await page.content();
    console.log(page.url());
}

https://docs.aave.com/portal/ 会跳转到 https://docs.aave.com/hub/
通过 page.url() 获取的是调整后的 link, 这样我就无法根据 link 来判断一个page是否已经被爬取过了

提议的 API 是什么样的？

返回结果提供原始的 link