Coder Social home page Coder Social logo

spider-rs / case_insensitive_string Goto Github PK

View Code? Open in Web Editor NEW
3.0 3.0 0.0 17 KB

A case-insensitive string for Rust

Home Page: https://crates.io/crates/case_insensitive_string

License: MIT License

Rust 100.00%
case-insensitive case-insensitive-strings rust string

case_insensitive_string's Introduction

spider-rs

The spider project ported to Node.js

Getting Started

  1. npm i @spider-rs/spider-rs --save
import { Website, pageTitle } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')
  .withHeaders({
    authorization: 'somerandomjwt',
  })
  .withBudget({
    '*': 20, // limit max request 20 pages for the website
    '/docs': 10, // limit only 10 pages on the `/docs` paths
  })
  .withBlacklistUrl(['/resume']) // regex or pattern matching to ignore paths
  .build()

// optional: page event handler
const onPageEvent = (_err, page) => {
  const title = pageTitle(page) // comment out to increase performance if title not needed
  console.info(`Title of ${page.url} is '${title}'`)
  website.pushData({
    status: page.statusCode,
    html: page.content,
    url: page.url,
    title,
  })
}

await website.crawl(onPageEvent)
await website.exportJsonlData('./storage/rsseau.jsonl')
console.log(website.getLinks())

Collect the resources for a website.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')
  .withBudget({
    '*': 20,
    '/docs': 10,
  })
  // you can use regex or string matches to ignore paths
  .withBlacklistUrl(['/resume'])
  .build()

await website.scrape()
console.log(website.getPages())

Run the crawls in the background on another thread.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr')

const onPageEvent = (_err, page) => {
  console.log(page)
}

await website.crawl(onPageEvent, true)
// runs immediately

Use headless Chrome rendering for crawls.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://rsseau.fr').withChromeIntercept(true, true)

const onPageEvent = (_err, page) => {
  console.log(page)
}

// the third param determines headless chrome usage.
await website.crawl(onPageEvent, false, true)
console.log(website.getLinks())

Cron jobs can be done with the following.

import { Website } from '@spider-rs/spider-rs'

const website = new Website('https://choosealicense.com').withCron('1/5 * * * * *')
// sleep function to test cron
const stopCron = (time: number, handle) => {
  return new Promise((resolve) => {
    setTimeout(() => {
      resolve(handle.stop())
    }, time)
  })
}

const links = []

const onPageEvent = (err, value) => {
  links.push(value)
}

const handle = await website.runCron(onPageEvent)

// stop the cron in 4 seconds
await stopCron(4000, handle)

Use the crawl shortcut to get the page content and url.

import { crawl } from '@spider-rs/spider-rs'

const { links, pages } = await crawl('https://rsseau.fr')
console.log(pages)

Benchmarks

View the benchmarks to see a breakdown between libs and platforms.

Test url: https://espn.com

libraries pages speed
spider(rust): crawl 150,387 1m
spider(nodejs): crawl 150,387 153s
spider(python): crawl 150,387 186s
scrapy(python): crawl 49,598 1h
crawlee(nodejs): crawl 18,779 30m

The benches above were ran on a mac m1, spider on linux arm machines performs about 2-10x faster.

Development

Install the napi cli npm i @napi-rs/cli --global.

  1. yarn build:test

case_insensitive_string's People

Contributors

j-mendez avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.