Coder Social home page Coder Social logo

wbot's Introduction

WBot - a web crawler

A configurable, thread-safe web crawler, provides a minimal interface for crawling and downloading web pages.

Features:

  • clean minimal API.
  • Configurable: MaxDepth, MaxBodySize, Rate Limit, Parrallelism, User Agent & Proxy rotation.
  • Memory-efficient, thread-safe.
  • Provides built-in interface: Fetcher, Store, Queue & a Logger.

WBot Specifications:

Interfaces

// Fetcher
type Fetcher interface {
	Fetch(req *Request) (*Response, error)
}

// Store
type Store interface {
	Visited(link string) bool
	Close()
}

// Queue
type Queue interface {
	Add(req *Request)
	Pop() *Request
	Next() bool
	Close()
}

// Logger
type Logger interface {
	Send(rep *Report)
}

API

// NewWBot
func NewWBot(opts ...Option) (*WBot, error)

// Crawl
func (wb *WBot) Crawl(link string) error

// Close
func (wb *WBot) Close() 

Installation

requires Go1.18 go get github.com/twiny/wbot

Example

package main

import (
	"fmt"
	"log"
	"time"

	"github.com/twiny/wbot"
)

func main() {
	// options
	opts := []wbot.Option{
		wbot.SetMaxDepth(5),
		wbot.SetRateLimit(1, 2*time.Second),
		wbot.SetMaxBodySize(1024 * 1024),
		wbot.SetUserAgents([]string{"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"}),
	}

	// new bot
	bot, err := wbot.NewWBot(opts...)
	if err != nil {
		fmt.Println(err)
		return
	}
	defer bot.Close()

	// stream
	go func() {
		count := 0
		for resp := range bot.Stream() {
			count++
			fmt.Printf("num: %d - depth: %d - visited url:%s - status:%d - body len: %d\n", count, resp.Depth, resp.URL.String(), resp.Status, len(resp.Body))
		}
	}()

	site := "https://www.github.com"

	if err := bot.Crawl(site); err != nil {
		log.Fatal(err)
	}

	fmt.Println("i'm out :)")
}

TODO

  • Add support for robots.txt.
  • Add test cases.
  • Implement Fetch using Chromedp.
  • Add more examples.

Bugs

Bugs or suggestions? Please visit the issue tracker.

wbot's People

Contributors

twiny avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.