Coder Social home page Coder Social logo

coder-hxl / x-crawl Goto Github PK

View Code? Open in Web Editor NEW
1.3K 13.0 84.0 41.03 MB

Flexible Node.js AI-assisted crawler library

Home Page: https://coder-hxl.github.io/x-crawl/

License: MIT License

JavaScript 1.89% TypeScript 98.11%
crawl crawler nodejs typescript spider flexible puppeteer javascript multifunction chromium

x-crawl's Issues

创建实例时候 报错 TypeError: (0 , x_crawl_1.default) is not a function

Bug 预期

执行爬虫实例化时
const myXCrawl = xCrawl();

报错

TypeError: (0 , x_crawl_1.default) is not a function

最小可重复的例子

import xCrawl from 'x-crawl';

  async crawl(path: string) {
    // 2.创建一个爬虫实例
    const myXCrawl = xCrawl();
    myXCrawl.crawlPage(path).then(res => {
      const { browser, page } = res.data;
      console.log(page);
      // 关闭浏览器
      browser.close();
    });
  }

报错信息

ERROR 389 TypeError: (0 , x_crawl_1.default) is not a function

x-crawl 版本

8.3.1

Node 版本

16.19.1

包管理器

npm

包管理器版本

9.6.1

crawlPage setting proxy option does not work

Bug expectation

crawlPage 设置 proxy 选项无效
环境: window10 操作系统, node 16,npm v8.19
我使用 clash 作为代理,在不开系统代理的情况下,crawlData工作良好,但 crawlPage 的 proxy 配置无效,只走系统代理

代码如下

import xCrawl from 'x-crawl'

// Application instance configuration
const testXCrawl = xCrawl({
  proxy: {
    urls: [
      'https://127.0.0.1:7890'
    ],
    switchByErrorCount: 1,
    switchByHttpStatus: [401, 403]
  }
})

// Advanced configuration
testXCrawl
  .crawlPage({
    targets: [
      'https://www.google.com',
      {
        url: 'https://www.google.com',
        proxy: { urls: ['http://127.0.0.1:7890'] }
      }
    ],
    maxRetry: 3,
    proxy: {
      urls: [
        'http://127.0.0.1:7890',
        'http://127.0.0.1:7890'
      ],
      switchByErrorCount: 1,
      switchByHttpStatus: [401, 403]
    }
  })
  .then((res) => {})

考虑到 switchByHttpStatus 可能会影响
我修改了 x-crawl index.mjs 第 270 行,加了下列代码
if (status === null) { result = true; }

Error string

no error

x-crawl version

7.0.1

Node version

16.18

Package manager

npm

Package manager version

8.19

crawlData 的请求参数传递有问题

Bug 预期

crawlData 请求数据,使用 CrawlDataDetailTargetConfigparams 传递请求参数,发现实际请求并未携带参数。

查看源码发现 此处params 参数处理为 search,然后在 此处 传递给 https.request() 方法进行请求。

查看 Node.js 文档,http.request() 的解释 HTTP #httprequestoptions-callback | Node.js v18.16.0 Documentation,请求参数应该是拼接在 path 后。

最小可重复的例子

const pageSize = 30;
    const sort = 'time';
    const rating = 'all';
    const filter = 'all';
    const mode = 'list';
    const url = `https://${type}.${this.BASE_URL}/people/${uid}/${status}`;
    const request: CrawlDataDetailTargetConfig = {
      url,
      method: "GET",
      params: {
        start: (page - 1) * pageSize,
        sort,
        rating,
        filter,
        mode,
      },
      data: {},
      timeout: 30000,
      maxRetry: 0
    };

    const crawler = Crawl({
      enableRandomFingerprint: true
    });
    const res = await crawler.crawlData(request);

报错信息

无报错

x-crawl 版本

7.0.0

Node 版本

18.15.0

包管理器

npm

包管理器版本

8.3.1

请问点击事件如何做呢

这个特性解决了什么问题?

比如我想爬一下油管的直播,但是保存的屏幕截图,有一个按钮,但是没法点击。有没有可以点击的操作

提议的 API 是什么样的?

click函数

crawlData中data参数为string

Bug 预期

在我传入的参数data为String类型时,isDataUndefine的判断会把data再包一层引号,这样会导致接口参数对不上

最小可重复的例子

const post = {
    url: "https://glxy.mot.gov.cn/company/getCompanyAptitude.do",
    method: "post",
    headers: {
      "Content-Type": "application/x-www-form-urlencoded",
    },
    data: "page=1",
  };
  let res = await myXCrawl.crawlData(post);

报错信息

无报错

x-crawl 版本

7.1.1

Node 版本

18.16.0

包管理器

npm

包管理器版本

9.5.1

建议 crawlFile 的选项参数可支持字符串或数组

这个特性解决了什么问题?

从某个接口批量下载数据是很常见的需求, 为方便给每次下载设置个性化的参数, 希望crawlFile的所有配置参数能支持传入数组. 当传入的是字符串常量时, 则统一按此常量执行, 当传入的是数组是则按索引取相应的值.

建议在其他api的设计中也考虑这种形式, 以便更加灵活地操作使用。

提议的 API 是什么样的?

const CODES = ['NTES', 'BIDU', 'JD'];
await myXCrawl.crawlFile({
  url: CODES.map((d) => `https://query1.finance.yahoo.com/v7/finance/download/${d}`),
  storeDir: './public/data/financial', // 常量时表示全部下载都是保存在这个目录, 但也允许支持数组
  fileName: CODES,  // 支持数组
  extension: '.csv',  // 所有文件的扩展名都是csv , 但也应允许支持数组
});

xCrawl.crawlFile 函数不能完美的兼容linux

Bug 预期

https://github.com/dext7r/puppeteer/blob/master/src/utils/bilibili.ts#L45

image

在window下是可以直接新建這個目錄并写入的。
但是在linux报错如下

Run pnpm task

> @dext7r/[email protected] task /home/runner/work/puppeteer/puppeteer
> npx ts-node ./src/serve

Start crawling - name: page, mode: async, total: 1 
Id: 1 - Crawl does not need to sleep, send immediately
Crawl the final result:
  Success - total: 1, ids: [ 1 ]
    Error - total: 0, ids: [  ]
Start crawling - name: file, mode: async, total: 9 
Id: 1 - Crawl does not need to sleep, send immediately
Id: 2 - Crawl does not need to sleep, send immediately
Id: 3 - Crawl does not need to sleep, send immediately
Id: 4 - Crawl does not need to sleep, send immediately
Id: 5 - Crawl does not need to sleep, send immediately
Id: [6](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:7) - Crawl does not need to sleep, send immediately
Id: [7](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:8) - Crawl does not need to sleep, send immediately
Id: [8](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:9) - Crawl does not need to sleep, send immediately
Id: [9](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:10) - Crawl does not need to sleep, send immediately
Error: ENOENT: no such file or directory, mkdir
    at Object.mkdirSync (node:fs:1396:3)
    at /home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:46:[10](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:11)
    at Array.reduce (<anonymous>)
    at mkdirDirSync (/home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:43:[12](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:13))
    at fileSingleResultHandle (/home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:951:7)
    at /home/runner/work/puppeteer/puppeteer/node_modules/.pnpm/[email protected][email protected]/node_modules/x-crawl/dist/index.js:[13](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:14)5:[21](https://github.com/dext7r/puppeteer/actions/runs/4793322343/jobs/8525664251#step:6:22)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Promise.all (index 3) {
  errno: -2,
  syscall: 'mkdir',
  code: 'ENOENT'
}
Error: The operation was canceled.

最小可重复的例子

import xCrawl from 'x-crawl';
import path from 'path';
import fs from 'fs/promises';

const bilibiliXCrawl = xCrawl({ mode: 'async' });
export function getCurrentDate(separator: string = '-') {


  // 调用 crawlFile API 爬取图片
  await bilibiliXCrawl.crawlFile({
    targets: [...urls],
    // storeDir: `./bilibili/${getCurrentDate('/')}`,
    storeDir,
  });
  // 关闭页面
  page.close();

  // 关闭浏览器
  browser.close();
}

报错信息

Error: ENOENT: no such file or directory, mkdir

x-crawl 版本

"x-crawl": "^6.0.1"

Node 版本

18

包管理器

pnpm

包管理器版本

7.13.2

pnpm 安装依赖报错

Bug 预期

mac系统
报错信息

node_modules/.pnpm/[email protected]/node_modules/puppeteer: Running postinstall script, failed in 5.2s
.../node_modules/puppeteer postinstall$ node install.js
│ ERROR: Failed to set up Chromium r1108766! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download.
│ Error: read ECONNRESET
│     at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20) {
│   errno: -54,
│   code: 'ECONNRESET',
│   syscall: 'read'
│ }
└─ Failed in 5.2s

最小可重复的例子

pnpm i xxx

报错信息

node_modules/.pnpm/[email protected]/node_modules/puppeteer: Running postinstall script, failed in 5.2s .../node_modules/puppeteer postinstall$ node install.js │ ERROR: Failed to set up Chromium r1108766! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download. │ Error: read ECONNRESET │ at TLSWrap.onStreamRead (node:internal/stream_base_commons:217:20) { │ errno: -54, │ code: 'ECONNRESET', │ syscall: 'read' │ } └─ Failed in 5.2s

x-crawl 版本

最新

Node 版本

v18.6.0

包管理器

pnpm

包管理器版本

7.14.2

crawlData配置问题

Bug 预期

在我使用crawlData爬取接口的时候,配置headers的Content-Type为application/x-www-form-urlencoded,被替换成application/json

最小可重复的例子

const targets = {
    url: "https://glxy.mot.gov.cn/company/getCompanyAptitude.do",
    method: "POST",
    headers: {
      "Content-Type": "application/x-www-form-urlencoded;charset=UTF-8",
    },
    data: Qs.stringify({ type: 2 }),
  };
  const res = await myXCrawl.crawlData(targets);

报错信息

无报错

x-crawl 版本

7.1.0

Node 版本

18.16.0

包管理器

npm

包管理器版本

9.5.1

这个是不能在centos服务器上使用吗?安装依赖的时候,无头浏览器一直安装不成功

Bug 预期

安装依赖会有一下报错
20230510111428
跳过报错后,运行不起来
微信图片_20230510134126

最小可重复的例子

npm i x-crawl

报错信息

ERROR: Failed to set up Chrome r113.0.5672.63! Set "PUPPETEER_SKIP_DOWNLOAD" env variable to skip download. npm ERR! Error: Download failed: server returned code 404. URL: https://npm.taobao.org/mirrors/113.0.5672.63/linux64/chrome-linux64.zip

x-crawl 版本

7.0.1

Node 版本

16.18.1

包管理器

npm

包管理器版本

8.19.2

Get接口,浏览器能直接访问调用,但工具不行,疑似URL编码问题

Bug 预期

预期:浏览器可直接调用成功
实际:接口调用失败,因为无法识别url

最小可重复的例子

import { createCrawl } from 'x-crawl'

const crawlApp = createCrawl({ intervalTime: { max: 3000, min: 1000 } })
let aar =["https://suno-list.com/api/suno/validateLink?link=https%3A%2F%2Fsuno.com%2Fsong%2Fe8b369f2-e2fc-4152-b2aa-d29224d43041"]
    crawlApp.crawlData({ aar }).then((twoRes) => {
        // 处理接口链接
        console.log(twoRes);
        for(let one of twoRes){
            // 这里报错
        }
        console.log(downloadLinks);
    });

报错信息

TypeError: Invalid URL

x-crawl 版本

x-crawl 10.0.1

Node 版本

v20.12.1

包管理器

pnpm

包管理器版本

8.3.1

关于 `crawlFile` API 设计的想法建议

这个特性解决了什么问题?

作者好,感谢提供这样的一个库,非常好用!但我觉得代码中的类型声明部分存在一些冗余和不够灵活的地方,可能会给用户在使用函数时带来一些限制和不便。以crawlFile为例:

1、url 明显是必需的参数,建议将其作为单独的参数更直观和灵活。
2、将 CrawlFileDetailTargetConfig 和 CrawlFileAdvancedConfig 合并为一个更简洁的 CrawlFileConfig 类型,减少重复定义。
3、支持传入单个 URL 或多个 URL,以及为每个 URL 设置单独的存储目录、文件名和扩展名。

我尝试给出ts的类型作为参考,以便清楚地表达我的建议,希望这个库可以越来越火。

还有个小建议, 可以把crawFile改名成fetchFile, 因为 "fetch" 一词表示从网络获取数据的常见操作,结合 "file" 可以清晰地表达其作用。

提议的 API 是什么样的?

export function crawlFile(
  url: string | string[],
  config?: CrawlFileConfig,
  callback?: (result: CrawlFileSingleResult | CrawlFileSingleResult[]) => void
): Promise<CrawlFileSingleResult | CrawlFileSingleResult[]>

export interface CrawlFileConfig extends CrawlCommonConfig {
  outputDir?: string | string[] | null
  extension?: string | string[] | null
  fileName?: string | string[] | null
  intervalTime?: IntervalTime
  fingerprints?: DetailTargetFingerprintCommon[]
  headers?: AnyObject
  onCrawlItemComplete?: (result: CrawlFileSingleResult) => void
  onBeforeSaveItemFile?: (info: {
    id: number
    fileName: string
    filePath: string
    data: Buffer
  }) => Promise<Buffer>
}

请问有跳转的文件下载如何处理

感谢提供这么好用的工具,正在学习中,在使用过程中,有些网站并不提供固定的文件下载地址,而是一个页面后,再跳转(302或返回一段js)直到实际的下载地址,这个地址也是临时的(所以没法事先抓取),请问这种情况如何处理,感谢~

windows正常的功能在Linux无结果无报错

Bug 预期

希望能够正常返回数据

最小可重复的例子

我window上能正常使用的功能,在Linux上获取不到结果,也不报错,日志根据提示是完成的,我是通过fastify起api,debian 11 

fastify.post('/api/screenshoot', async function handler(request, reply) {
    const { url } = request.body
    if (!url) {
        return reply.send({ code: 0, msg: "url is required" })
    }
    try {
        const buffer = await screenshoot({ url })
        // 在发送非常规数据时。一定一定要指定响应数据类型
        reply.type("image/jpeg")
        return buffer
    } catch (error) {

    }
})

// screenshoot model
const res = await x.crawlPage({
            url,
            maxRetry: 10,
            viewport: { width: 1920, height: 1080 },
            // 为此次的目标统一设置指纹
            fingerprints: [
                // 设备指纹 1
                {
                    maxWidth: 1024,
                    maxHeight: 800,
                    platform: 'Windows',
                    mobile: 'random',
                    userAgent: {
                        value:
                            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
                        versions: [
                            {
                                name: 'Chrome',
                                // 浏览器版本
                                maxMajorVersion: 112,
                                minMajorVersion: 100,
                                maxMinorVersion: 20,
                                maxPatchVersion: 5000
                            },
                            {
                                name: 'Safari',
                                maxMajorVersion: 537,
                                minMajorVersion: 500,
                                maxMinorVersion: 36,
                                maxPatchVersion: 5000
                            }
                        ]
                    }
                },
                // 设备指纹 2
                {
                    platform: 'Windows',
                    mobile: 'random',
                    userAgent: {
                        value:
                            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 Edg/91.0.864.59',
                        versions: [
                            {
                                name: 'Chrome',
                                maxMajorVersion: 91,
                                minMajorVersion: 88,
                                maxMinorVersion: 10,
                                maxPatchVersion: 5615
                            },
                            { name: 'Safari', maxMinorVersion: 36, maxPatchVersion: 2333 },
                            { name: 'Edg', maxMinorVersion: 10, maxPatchVersion: 864 }
                        ]
                    }
                },
                // 设备指纹 3
                {
                    platform: 'Windows',
                    mobile: 'random',
                    userAgent: {
                        value:
                            'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',
                        versions: [
                            {
                                name: 'Firefox',
                                maxMajorVersion: 47,
                                minMajorVersion: 43,
                                maxMinorVersion: 10,
                                maxPatchVersion: 5000
                            }
                        ]
                    }
                }
            ]
        })

        const { browser, page } = res.data
        // // Get a screenshot of the rendered page
        const buffer = await page.screenshot({ path: `../uploads/${host}_${Date.now()}.png` })
        console.log('Screen capture is complete')

        if (buffer) {
            page.close()
        }
        // close brower
        // browser.close()

        return buffer

报错信息

无报错,无结果返回

x-crawl 版本

latest

Node 版本

20.9.0

包管理器

pnpm

包管理器版本

latest

Dependency Dashboard

This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.

Open

These updates have all been created already. Click a checkbox below to force a retry/rebase of any.

Detected dependencies

github-actions
.github/workflows/dependency-review.yml
  • actions/checkout v4
  • actions/dependency-review-action v4
.github/workflows/deploy.yml
  • actions/checkout v4
  • pnpm/action-setup v3
  • actions/setup-node v4
  • actions/configure-pages v5
  • actions/upload-pages-artifact v3
  • actions/deploy-pages v4
.github/workflows/greetings.yml
  • actions/first-interaction v1
npm
docs/package.json
  • vitepress ^1.0.2
package.json
  • chalk 5.3.0
  • https-proxy-agent ^7.0.4
  • openai ^4.33.0
  • ora ^8.0.1
  • puppeteer 22.5.0
  • @babel/core ^7.24.0
  • @babel/preset-env ^7.24.0
  • @rollup/plugin-babel ^6.0.4
  • @rollup/plugin-run ^3.0.2
  • @rollup/plugin-terser ^0.4.4
  • @types/node ^20.12.1
  • @typescript-eslint/eslint-plugin ^7.9.0
  • @typescript-eslint/parser ^7.9.0
  • @vitest/coverage-v8 ^1.4.0
  • @vitest/ui ^1.4.0
  • eslint ^9.0.0
  • prettier ^3.2.5
  • rollup ^4.13.0
  • rollup-plugin-typescript2 ^0.36.0
  • typescript 5.4.4
  • vitest ^1.4.0
  • node >=18.0.0
publish/package.json
  • chalk 5.3.0
  • https-proxy-agent ^7.0.4
  • openai ^4.33.0
  • ora ^8.0.1
  • puppeteer 22.5.0
  • node >=18.0.0

  • Check this box to trigger a request for Renovate to run again on this repository

crawData 请求结果问题

Bug 预期

crawlData 请求数据,请求配置:

 {
  url: 'https://api.github.com/repos/xxx/xxx/releases',
  method: 'GET',
  headers: {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
    'X-GitHub-Api-Version': '2022-11-28',
    Accept: 'application/json',
    Authorization: 'Bearer tokenxxx'
  },
  params: { per_page: 1, page: 1 },
  data: {},
  timeout: 60000,
  maxRetry: 0,
  proxy: ''
}

返回错误:

{
  id: 1,
  isSuccess: false,
  maxRetry: 0,
  retryCount: 0,
  proxyDetails: [],
  crawlErrorQueue: [
    TypeError: The "chunk" argument must be of type string or an instance of Buffer or Uint8Array. Received an instance of Object
        ... {
      code: 'ERR_INVALID_ARG_TYPE'
    }
  ],
  data: null
}

查看源码发现

res.on('data', (chunk) => container.push(chunk))

这里对数据返回结果的处理,对于字符串格式数据不适用,字符串格式数据拼接处理即可。

最小可重复的例子

const crawler = Crawl();
const options = {
      'url': 'https://api.github.com/repos/xxx/xxx/releases',
      'method': 'GET',
      'headers': {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36',
        'X-GitHub-Api-Version': '2022-11-28',
        'Accept': 'application/json',
  'Authorization': 'Bearer tokenxxx'
      },
      'params': {
        'per_page': 1,
        'page': 1
      },
      'data': {},
      'timeout': 30000,
      'maxRetry': 0
    };
const res = await crawler.crawlData(options);

报错信息

TypeError: The "chunk" argument must be of type string or an instance of Buffer or Uint8Array. Received an instance of Object

x-crawl 版本

7.1.2

Node 版本

v18.6.0

包管理器

pnpm

包管理器版本

8.6.5

crawlPage 爬取多个 link 时, 返回结果是数组, 但是不知道每个结果对应的原始 url

这个特性解决了什么问题?

避免重复爬取某个 page, 看下面这个例子

const ress = await myXCrawl.crawlPage({
    targets: ['https://docs.aave.com/portal/'],
    viewport: { width: 1920, height: 1080 },
    intervalTime: { max: 2000, min: 1000 },
    maxRetry: 3,
});

for (const res of ress) {
    const { browser, page } = res.data;
    console.log(page.url());
    await page.waitForNetworkIdle();
    const content = await page.content();
    console.log(page.url());
}

https://docs.aave.com/portal/ 会跳转到 https://docs.aave.com/hub/
通过 page.url() 获取的是调整后的 link, 这样我就无法根据 link 来判断一个page是否已经被爬取过了

提议的 API 是什么样的?

返回结果提供原始的 link

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.