Coder Social home page Coder Social logo

tl_yargl's Introduction

NOTE: This is a fork to integrate css into the existing parser, which was not written by me. This repo is a Work in Progress, there is no css-related functionality in it yet.

tl_yargl

tl_yargl is a fast HTML parser written in pure Rust.

This crate (currently) does not strictly follow the full specification of the HTML standard, however this usually is not a problem for most use cases. This crate generally attempts to support most "sane" HTML. Not being limited by a specification allows for more optimization opportunities. If you need a parser that can (very quickly) parse the typical HTML document and you need a simple API to work with the DOM, give this crate a try.

If you need a parser that closely follows the standard, consider using html5ever, lol-html, or html5gum.

Usage

Add tl_yargl to your dependencies.

[dependencies]
tl_yargl = "0.7.7"
# or, with explicit SIMD support
# (requires a nightly compiler!)
tl_yargl = { version = "0.7.7", features = ["simd"] }

The main function is tl_yargl::parse(). It accepts an HTML source code string and parses it. It is important to note that tl currently silently ignores tags that are invalid, sort of like browsers do. Sometimes, this means that large chunks of the HTML document do not appear in the resulting tree.

let dom = tl_yargl::parse(r#"<p id="text">Hello</p>"#, tl_yargl::ParserOptions::default()).unwrap();
let parser = dom.parser();
let element = dom.get_element_by_id("text")
  .expect("Failed to find element")
  .get(parser)
  .unwrap();

assert_eq!(element.inner_text(parser), "Hello");

Examples

Finding a tag using the query selector API
let dom = tl_yargl::parse(r#"<div><img src="cool-image.png" /></div>"#, tl_yargl::ParserOptions::default()).unwrap();
let img = dom.query_selector("img[src]").unwrap().next();
    
assert!(img.is_some());
Iterating over the subnodes of an HTML document
let dom = tl_yargl::parse(r#"<div><img src="cool-image.png" /></div>"#, tl_yargl::ParserOptions::default()).unwrap();
let img = dom.nodes()
  .iter()
  .find(|node| {
    node.as_tag().map_or(false, |tag| tag.name() == "img")
  });
    
assert!(img.is_some());
Mutating the `href` attribute of an anchor tag:

In a real world scenario, you would want to handle errors properly instead of unwrapping.

let input = r#"<div><a href="/about">About</a></div>"#;
let mut dom = tl_yargl::parse(input, tl_yargl::ParserOptions::default())
  .expect("HTML string too long");
  
let anchor = dom.query_selector("a[href]")
  .expect("Failed to parse query selector")
  .next()
  .expect("Failed to find anchor tag");

let parser_mut = dom.parser_mut();

let anchor = anchor.get_mut(parser_mut)
  .expect("Failed to resolve node")
  .as_tag_mut()
  .expect("Failed to cast Node to HTMLTag");

let attributes = anchor.attributes_mut();

attributes.get_mut("href")
  .flatten()
  .expect("Attribute not found or malformed")
  .set("http://localhost/about");

assert_eq!(attributes.get("href").flatten(), Some(&"http://localhost/about".into()));

SIMD-accelerated parsing

This crate has utility functions used by the parser which make use of SIMD (e.g. finding a specific byte by looking at the next 16 bytes at once, instead of going through the string one by one). These are disabled by default and must be enabled explicitly by passing the simd feature flag due to the unstable feature portable_simd. This requires a nightly compiler!

If the simd feature is not enabled, it will fall back to stable alternatives that don't explicitly use SIMD intrinsics, but are still decently well optimized, using techniques such as manual loop unrolling to remove boundary checks and other branches by a factor of 16, which also helps LLVM further optimize the code and potentially generate SIMD instructions by itself.

Benchmarks

Results for parsing a ~320KB HTML document. Benchmarked using criterion.

Note: Some HTML parsers listed closely follow the specification while others don't, which greatly impacts performance as the specification limits what can and can't be done. Comparing the performance of a parser that doesn't follow the specification to one that does isn't fair and doesn't yield meaningful results, but it can be interesting to see what the theoretical difference is.

              time            thrpt             follows spec
tl¹           629.78 us       496.65 MiB/s      ❌
lol_html      788.91 us       396.47 MiB/s      ✅
htmlstream    2.2786 ms       137.48 MiB/s      ❌
html5ever     6.2233 ms       50.276 MiB/s      ✅

¹ - simd feature enabled

Source

tl_yargl's People

Contributors

bors[bot] avatar dz1230 avatar kelko avatar ldspits avatar mehmetcansahin avatar oovm avatar y21 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.