Coder Social home page Coder Social logo

extract's Introduction

Extract

Extract just the content from a web page.

Extract is a wrapper to turn the Mercury Parser into a web service.

Why?

Mercury already offers an API component, meant to be deployed to AWS Lambda. There are a few reasons why this exists as an alternative.

  1. Deploy elsewhere. Extract is a vanilla Node.js app, that is meant to run in a VM, and has no platform specific dependencies.

  2. Built-in authorization system.

  3. Performance. In my experience, running it on a VM has been faster than the lambda version.

Here's a graph where you can see a decrease in average response time around the 17. Feb mark. This is when Feedbin switched from the lambda hosted version, to extract running on a VPS.

Response Time

Installation

  1. Install Node.js and npm.

  2. Clone extract

    git clone https://github.com/feedbin/extract.git
  3. Install the dependencies.

    cd extract
    npm install
  4. Run the server

    node app/server.js

    Alternatively, extract includes an ecosystem.config.js to use with pm2. You could use this in production.

    npm install --global pm2
    pm2 start ecosystem.config.js

Usage

Extract has a simple, file-based system for creating users and secret keys. This allows users to be added/removed while the system is running. In the ./users directory, the filename is the username and the contents is the secret key. To make a new user, run the following:

cd extract
mkdir users

# use your own secret key and username
echo "SECRET_KEY" > users/USERNAME

Once a username and password has been created, you can make a request.

An example request looks like:

http://localhost:3000/parser/:username/:signature?base64_url=:base64_url

The parts that you need are:

  • username your username
  • signature the hexadecimal HMAC-SHA1 signature of the URL you want to parse
  • base64_url base64 encoded version of the URL you want to parse

The URL is base64-encoded to avoid any issues in the way different systems encode URLs. It must use the RFC 4648 url-safe variant with no newlines.

If your platform does not offer a URL safe base64 option, you can replicate it. First create the base64 encoded string. Then replace the following characters:

  • + => -
  • / => _
  • \n => ""

Here's a sample implementation in ruby. You can use this as a reference for matching your implementation.

require "uri"
require "openssl"
require "base64"

username = "username"
secret = "secret"
host = "localhost"
port = 3000
url = "https://feedbin.com/blog/2018/09/11/private-by-default/"

digest = OpenSSL::Digest.new("sha1")
signature = OpenSSL::HMAC.hexdigest(digest, secret, url)

base64_url = Base64.urlsafe_encode64(url).gsub("\n", "")

URI::HTTPS.build({
  host: host,
  port: port,
  path: "/parser/#{username}/#{signature}",
  query: "base64_url=#{base64_url}"
}).to_s

The above example would produce:

https://localhost:3000/parser/username/e4696f8630bb68c21d77a9629ce8d063d8e5f81c?base64_url=aHR0cHM6Ly9mZWVkYmluLmNvbS9ibG9nLzIwMTgvMDkvMTEvcHJpdmF0ZS1ieS1kZWZhdWx0Lw==

With the output:

{
    "title": "Private by Default",
    "author": null,
    "date_published": "2018-09-11T00:00:00.000Z",
    "dek": null,
    "lead_image_url": "https://assets.feedbin.com/assets-site/blog/2018-09-11/embed-3f43088538ae5ed7e585c00013adc13a915fd35de31990b3081a085b963ed7dd.png",
    "content": "<div>content</div>",
    "next_page_url": null,
    "url": "https://feedbin.com/blog/2018/09/11/private-by-default/",
    "domain": "feedbin.com",
    "excerpt": "September 11, 2018 by Ben Ubois I want Feedbin to be the opposite of Big Social. I think people should have the right not to be tracked on the Internet and Feedbin can help facilitate that. Since&hellip;",
    "word_count": 787,
    "direction": "ltr",
    "total_pages": 1,
    "rendered_pages": 1
}

extract's People

Contributors

benubois avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.