Coder Social home page Coder Social logo

alash3al / scraply Goto Github PK

View Code? Open in Web Editor NEW
126.0 8.0 11.0 49 KB

Scraply a simple dom scraper to fetch information from any html based website

License: Apache License 2.0

Go 96.40% Dockerfile 3.60%
dom scraper scraping-websites scrapers golang server crawler crawling scrapy

scraply's Introduction

Scraply

Scraply, is a very simple html scraping tool, if you know css & jQuery then you can use it!, scraply should be simple and tiny as well it could be used as a component in a large system something like this use-case

Overview

you can use scraply within your stack via cli or http.

# here is the CLI usage

# extracting the title and the description from scraply github repo page
$ scraply extract \
    -u "https://github.com/alash3al/scraply" \
    -x title='$("title").text()' \
    -x description='$("meta[name=description]").attr("content")'

# same thing but with custom user agent
$ scraply extract \
    -u "https://github.com/alash3al/scraply" \
    -ua "OptionalCustomUserAgent"\
    -x title='$("title").text()' \
    -x description='$("meta[name=description]").attr("content")'

# same thing but with asking scraply to return the response body for debugging purposes
$ scraply extract \
    --return-body \
    -u "https://github.com/alash3al/scraply" \
    -x title='$("title").text()' \
    -x description='$("meta[name=description]").attr("content")'

for http usage, we will run the http server then using any http client to interact with it.

# running the http server
# by default it listens on address ":8010" which equals to "0.0.0.0:8010"
# for more information execute `$ scraply help`
$ scraply serve

# then in another shell let's execute the following curl 
$ curl http://localhost:8010/extract \
    -H "Content-Type: application/json" \
    -s \
    -d '{"url": "https://github.com/alash3al/scraply", "extractors": {"title": "$(\"title\").text()"}, "return_body": false, "user_agent": "CustomeUserAgent"}'

for debugging, there is shell

$ scraply shell -u https://github.com/alash3al/scraply
➜ (scraply) > $("title").text()
GitHub - alash3al/scraply: Scraply a simple dom scraper to fetch information from any html based website and convert that info to JSON APIs

➜ (scraply) > request.url
https://github.com/alash3al/scraply

➜ (scraply) > response.status_code
200

➜ (scraply) > response.url
https://github.com/alash3al/scraply

➜ (scraply) > response.body
<html>.....

Download ?

you can go to the releases page and pick the latest version. or you can $ docker run --rm -it ghcr.io/alash3al/scraply scraply help

Contribution ?

for sure you can contribute, how?

  • clone the repo
  • create your fix/feature branch
  • create a pull request

nothing else, enjoy!

About

I'm Mohamed Al Ashaal, a software engineer :)

scraply's People

Contributors

alash3al avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scraply's Issues

CLI Mode

It would be really nifty to be able to run this from the CLI as a command and have it write the JSON to stdout so it could be used as part of a larger tool. For example:

scraply --execute /my-macro

[feature request] log x-forwarded-for header

I'm using a reverse proxy in front of scraply to use scraply (and all other webapps) with SSL. Please log x-forwarded-for header instead of the requesting ip if the header exists or add an option to enable/disable it. Thank you.

Support params

Supporting params in macros would help make this tool really useful for fetching dynamic content from sites. For example, a vanilla macro could fetch a list of stocks from a website then a separate one could take a name as a parameter and look up that particular stock by constructing a URL.

I imagine constructing a URL could be done similarly to exec:

urlExec = <<JS
        url = {
            scheme: "https",
            host: "stock-lookup.com",
            params: [{"name":"symbol", "value": scraply.params["stock-symbol"]}]
        }
    JS

And be called with URL query params:

/stock?stock-symbol=AAPL

Make registry public read-only?

Looks like an authz bit needs to be enabled

$ docker run --rm -it ghcr.io/alash3al/scraply scraply help
Unable to find image 'ghcr.io/alash3al/scraply:latest' locally
docker: Error response from daemon: Head "https://ghcr.io/v2/alash3al/scraply/manifests/latest": unauthorized.
See 'docker run --help'.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.