Coder Social home page Coder Social logo

scraper_service's Introduction

Scraper is a web service that grabs the HTTP response status codes for a given URL.

Main features

  • Custom listen ports for scraper service and metrics service
  • Limit of concurrent scrapes
  • Custom timeout
  • Settings via command line arguments or environment variables (SCRAPER_ prefix)

How it works?

Flow Diagram

  1. A client makes a HTTP POST request to the Scraper Service, to the Scraper listening port (default: 8080). The request is sent in the POST body:

    {"url": "http://phaidra.ai"}
    
  2. The Scraper tries to fullfil the request by creating an HTTP GET request to {target}

  3. If the {target} url exists, Scraper receives the Status Code response from {target}

  4. The following prometheus metrics are updated:

    • http_requests_total{code}
      • How many HTTP requests received, partitioned response code.
    • http_get{code}
      • How many HTTP GET scraped, partitioned by url and response code.
    • workerWait
      • Histogram. Time waiting for an available worker. In milliseconds.
  5. Client receives the response Status Code:

    • 200 OK: Target exists and replied with a Status Code
    • 408 RequestTimeout: No worker available in the timeout period
    • 501 NotImplemented: Scaper received a HTTP method different than POST
    • 400 BadRequest: Request POST body is malformed
    • 500 InternalServerError: Unexpected internal error

Metrics are a prometheus (prometheus-client) service accessible on port 9095 (by default) under path /metrics.

Build and run Scraper

Requirements:

Compile and install:

$ go build .

Run Unit Tests:

$ go test .

Start Scraper Service:

$ ./scraper_service

Scraper Service settings:

$ ./scraper_service --help

Usage: scraper_service [FLAG]...

Flags:
    --listen      Service listen address.                                 (type: string; env: SCRAPER_Listen; default: :8080)
    --workers     Number of serving workers.                              (type: uint8; env: SCRAPER_Workers; default: 2)
    --timeout     Maximum time (in milliseconds) to wait for a worker.    (type: uint64; env: SCRAPER_Timeout; default: 1000)
    --metrics     Metrics listen address.                                 (type: string; env: SCRAPER_MetricsListen; default: :9095)
    -h, --help    show help                                               (type: bool)

Build docker images

This builds the docker image scraper_service/scraper:0.1.0 using a multi-stage build

docker build -t scraper_service/scraper:0.1.0 .

Launching a container

docker run -p 8080:8080 -p 9095:9095 scraper_service/scraper:0.1.0

Simple request script

./tools/request.sh <address-to-scrape> <scraper-service-address> <count>

Examples:

./tools/request.sh https://google.com localhost:8080 1
./tools/request.sh https://phaidra.ai localhost:8080 1
./tools/request.sh https://phaidra.ai/trackrecord localhost:8080 1

PromQL

Number of requests received per 10s:

sum(rate(http_requests_total[10s]))

Number of requests received per 10s, partitioned by Status Code:

sum by(code) (delta(http_requests_total[10s]))

Number of requests received per 1m, where Status Code was not 200 OK:

delta(http_requests_total{code!="200"}[1m])

Number of requests received that resulted in timeout, per 1m:

delta(http_requests_total{code!="408"}[1m])

Number of requests that waited for a worker, per bucket of wait time, per 1m:

sum by (le) (rate(wait_available_worker_bucket[1m]))

Average wait time for a worker to process the request, per 1m:

sum(wait_available_worker_sum/wait_available_worker_count)

Kubernetes deployment

Image should be available in your cluster.

Kind example:

kind load docker-image scraper_service/scraper:0.1.0

Deploying the Scraper Service:

kubectl apply -f deployment/scraper.yaml

Deploying the Prometheus with service discovery:

kubectl apply -f deployment/prometheus-rbac.yaml
kubectl apply -f deployment/prometheus.yaml

scraper_service's People

Contributors

xboshy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.