Coder Social home page Coder Social logo

crawlee-sitemap-error's Introduction

Crawlee Sitemap Error

This is a reproduction repository to show an issue with Crawlee when using their RobotsFile util. When an issue occurs while decompressing a sitemap file in gzip format, a non-catchable error is thrown, that crashes the entire application.

The error looks like this:

> [email protected] start
> node index.js

node:events:492
      throw er; // Unhandled 'error' event
      ^

Error: incorrect header check
    at Zlib.zlibOnError [as onerror] (node:zlib:189:17)
Emitted 'error' event on Gunzip instance at:
    at Gunzip.onerror (node:internal/streams/readable:1004:14)
    at Gunzip.emit (node:events:514:28)
    at emitErrorNT (node:internal/streams/destroy:151:8)
    at emitErrorCloseNT (node:internal/streams/destroy:116:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  errno: -3,
  code: 'Z_DATA_ERROR'
}

Node.js v20.10.0

When changing the internals of crawlee like this:

// ./node_modules/@crawlee/utils/internals/robots.js
- stream = stream.pipe(createGunzip());
+ stream = stream.pipe(createGunzip()).on('error', reject);

The sitemap is correctly identified as "malformed":

> [email protected] start
> node index.js

WARN  Malformed sitemap content: https://7even.de/web/sitemap/shop-1/sitemap-1.xml
WARN  Malformed sitemap content: https://7even.de/web/sitemap/shop-9/sitemap-1.xml
WARN  Malformed sitemap content: https://7even.de/web/sitemap/shop-8/sitemap-1.xml
WARN  Malformed sitemap content: https://7even.de/web/sitemap/shop-7/sitemap-1.xml
WARN  Malformed sitemap content: https://7even.de/web/sitemap/shop-6/sitemap-1.xml
WARN  Malformed sitemap content: https://7even.de/web/sitemap/shop-5/sitemap-1.xml
[]

crawlee-sitemap-error's People

Contributors

cakewithdivinity avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.