Coder Social home page Coder Social logo

htmlparser2's Introduction

htmlparser2

NPM version Downloads Build Status Coverage

A forgiving HTML/XML/RSS parser. The parser can handle streams and provides a callback interface.

Installation

npm install htmlparser2

A live demo of htmlparser2 is available here.

Usage

var htmlparser = require("htmlparser2");
var parser = new htmlparser.Parser({
	onopentag: function(name, attribs){
		if(name === "script" && attribs.type === "text/javascript"){
			console.log("JS! Hooray!");
		}
	},
	ontext: function(text){
		console.log("-->", text);
	},
	onclosetag: function(tagname){
		if(tagname === "script"){
			console.log("That's it?!");
		}
	}
}, {decodeEntities: true});
parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</ script>");
parser.end();

Output (simplified):

--> Xyz
JS! Hooray!
--> var foo = '<<bar>>';
That's it?!

Documentation

Read more about the parser and its options in the wiki.

Get a DOM

The DomHandler (known as DefaultHandler in the original htmlparser module) produces a DOM (document object model) that can be manipulated using the DomUtils helper.

The DomHandler, while still bundled with this module, was moved to its own module. Have a look at it for further information.

Parsing RSS/RDF/Atom Feeds

new htmlparser.FeedHandler(function(<error> error, <object> feed){
    ...
});

Note: While the provided feed handler works for most feeds, you might want to use danmactough/node-feedparser, which is much better tested and actively maintained.

Performance

After having some artificial benchmarks for some time, @AndreasMadsen published his htmlparser-benchmark, which benchmarks HTML parses based on real-world websites.

At the time of writing, the latest versions of all supported parsers show the following performance characteristics on Travis CI (please note that Travis doesn't guarantee equal conditions for all tests):

gumbo-parser   : 34.9208 ms/file ± 21.4238
html-parser    : 24.8224 ms/file ± 15.8703
html5          : 419.597 ms/file ± 264.265
htmlparser     : 60.0722 ms/file ± 384.844
htmlparser2-dom: 12.0749 ms/file ± 6.49474
htmlparser2    : 7.49130 ms/file ± 5.74368
hubbub         : 30.4980 ms/file ± 16.4682
libxmljs       : 14.1338 ms/file ± 18.6541
parse5         : 22.0439 ms/file ± 15.3743
sax            : 49.6513 ms/file ± 26.6032

How does this module differ from node-htmlparser?

This is a fork of the htmlparser module. The main difference is that this is intended to be used only with node (it runs on other platforms using browserify). htmlparser2 was rewritten multiple times and, while it maintains an API that's compatible with htmlparser in most cases, the projects don't share any code anymore.

The parser now provides a callback interface close to sax.js (originally targeted at readabilitySAX). As a result, old handlers won't work anymore.

The DefaultHandler and the RssHandler were renamed to clarify their purpose (to DomHandler and FeedHandler). The old names are still available when requiring htmlparser2, your code should work as expected.

htmlparser2's People

Contributors

fb55 avatar tautologistics avatar andreasmadsen avatar patrick-steele-idem avatar forbeslindesay avatar jugglinmike avatar leonfedotov avatar uhoreg avatar cvrebert avatar papandreou avatar myndzi avatar duncanbeevers avatar chbrown avatar trysound avatar sailxjx avatar lahmatiy avatar joncasey avatar dduugg avatar abarre avatar superdweebie avatar ackar avatar wisec avatar roderickhsiao avatar raine avatar pajamaman avatar nkzawa avatar ajafff avatar jdesboeufs avatar jfahrenkrug avatar devongovett avatar

Watchers

James Cloos avatar mjunaidi avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.