Coder Social home page Coder Social logo

ageitgey.node-unfluff's Introduction

unfluff

An automatic web page content extractor for Node.js!

Build Status

Automatically grab the main text out of a webpage like this:

extractor = require('unfluff');
data = extractor(my_html_data);
console.log(data.text);

In other words, it turns pretty webpages into boring plain text/json data:

This might be useful for:

  • Writing your own Instapaper clone
  • Easily building ML data sets from web pages
  • Reading your favorite articles from the console?

Please don't use this for:

  • Stealing other peoples' web pages
  • Making crappy spam sites with stolen content from other sites
  • Being a jerk

Credits / Thanks

This library is largely based on python-goose by Xavier Grangier which is in turn based on goose by Gravity Labs. However, it's not an exact port so it may behave differently on some pages and the feature set is a little bit different. If you are looking for a python or Scala/Java/JVM solution, check out those libraries!

Install

To install the command-line unfluff utility:

npm install -g unfluff

To install the unfluff module for use in your Node.js project:

npm install --save unfluff

Usage

You can use unfluff from node or right on the command line!

Extracted data elements

This is what unfluff will try to grab from a web page:

  • title - The document's title (from the <title> tag)
  • text - The main text of the document with all the junk thrown away
  • image - The main image for the document (what's use by facebook, etc.)
  • videos - An array of videos that were embedded in the article. Each video has src, width and height.
  • tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
  • canonicalLink - The canonical url of the document, if given.
  • lang - The language of the document, either detected or supplied by you.
  • description - The description of the document, from <meta> tags
  • favicon - The url of the document's favicon.

This is returned as a simple json object.

Command line interface

You can pass a webpage to unfluff and it will try to parse out the interesting bits.

You can either pass in a file name:

unfluff my_file.html

Or you can pipe it in:

curl -s "http://somesite.com/page" | unfluff

You can easily chain this together with other unix commands to do cool stuff. For example, you can download a web page, parse it and then use jq to print it just the body text.

curl -s "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff | jq -r .text

And here's how to find the top 10 most common words in an article:

curl -s "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u" | unfluff |  tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head -10

Module Interface

extractor(html, language)

html: The html you want to parse

language (optional): The document's two-letter language code. This will be auto-detected as best as possible, but there might be cases where you want to override it.

The extraction algorithm depends heavily on the language, so it probably won't work if you have the language set incorrectly.

extractor = require('unfluff');

data = extractor(my_html_data);

Or supply the language code yourself:

extractor = require('unfluff', 'en');

data = extractor(my_html_data);

data will then be a json object that looks like this:

{
  "title": "Shovel Knight review: rewrite history",
  "text": "Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it. [.. snip ..]",
  "image": "http://cdn2.vox-cdn.com/uploads/chorus_image/image/34834129/jellyfish_hero.0_cinema_1280.0.png",  
  "tags": [],
  "videos": [],
  "canonicalLink": "http://www.polygon.com/2014/6/26/5842180/shovel-knight-review-pc-3ds-wii-u",
  "lang": "en",
  "description": "Shovel Knight is inspired by the past in all the right ways — but it's far from stuck in it.",
  "favicon": "http://cdn1.vox-cdn.com/community_logos/42931/favicon.ico"
}

Demo

The easiest way to try out unfluff is to just install it:

$ npm install -g unfluff
$ curl -s "http://www.cnn.com/2014/07/07/world/americas/mexico-earthquake/index.html" | unfluff

But if you can't be bothered, you can check out fetch text. It's a site by Andy Jiang that uses unfluff. You send an email with a url and it emails back with the cleaned content of that url. It should give you a good idea of how unfluff handles different urls.

What is broken

  • Parsing web pages in languages other than English is poorly tested and probably is buggy right now.
  • This definitely won't work yet for languages like Chinese / Arabic / Korean / etc that need smarter word tokenization.
  • This has only been tested on a limited set of web pages. There are probably lots of lurking bugs with web pages that haven't been tested yet.

ageitgey.node-unfluff's People

Contributors

ageitgey avatar mhuebert avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.