Coder Social home page Coder Social logo

html-extract / hext Goto Github PK

View Code? Open in Web Editor NEW
51.0 4.0 3.0 2.11 MB

Domain-specific language for extracting structured data from HTML documents

Home Page: https://hext.thomastrapp.com

License: Apache License 2.0

CMake 4.14% C++ 71.05% HTML 5.83% Shell 8.65% JavaScript 1.83% C 0.68% Python 1.27% PHP 0.30% Ruby 0.24% Ragel 4.00% Dockerfile 0.01% SWIG 1.59% Vim Script 0.42%
cpp html-extraction scraping html dsl data-extraction python php ruby node

hext's Introduction

Hext — Extract Data from HTML

PyPI Version npm version

Hext is a domain-specific language for extracting structured data from HTML documents.

Hext is written in C++ but language bindings are available for Python, Node, JavaScript, Ruby and PHP.

Hext Logo

See https://hext.thomastrapp.com for documentation, installation instructions and a live demo.

The Hext project is released under the terms of the Apache License v2.0.

Example

Suppose you want to extract all hyperlinks from a web page. Hyperlinks have an anchor tag <a>, an attribute called href and a text that visitors can click. The following Hext template will produce a dictionary for every matched element. Each dictionary will contain the keys link and title which refer to the href attribute and the text content of the matched <a>.

# Extract links and their text
<a href:link @text:title />

» Load example in editor

Visit Hext's project page to learn more about Hext. For examples that use the libhext C++ library check out /libhext/examples and libhext's C++ library overview.

Components of this Project

  • htmlext: Command line utility that applies Hext templates to an HTML document and produces JSON.
  • libhext: C++ library that contains a Hext parser but also allows for customization.
  • libhext-test: Unit tests for libhext.
  • Hext bindings: Bindings for scripting languages. There are extensions for Node.js, Python, Ruby and PHP that are able to parse Hext and extract values from HTML.

Project layout

├── build             Build directory for htmlext
├── cmake             CMake modules used by the project
├── htmlext           Source for the htmlext command line tool
├── libhext           The libhext project
│   ├── bindings      Hext bindings for scripting languages
│   ├── build         Build directory for libhext
│   ├── doc           Doxygen documentation for libhext
│   ├── examples      Examples making use of libhext
│   ├── include       Public libhext API
│   ├── ragel         Ragel input files
│   ├── scripts       Helper scripts for libhext
│   ├── src           libhext implementation files
│   └── test          The libhext-test project
│       ├── build     Build directory for libhext-test
│       └── src       Source for libhext-test
├── man               Htmlext man page
├── scripts           Scripts for building and testing releases
├── syntaxhl          Syntax highlighters for Vim and ACE
└── test              Blackbox tests for htmlext

Dependencies for development

  • Ragel generates the state machine that is used to parse Hext
  • The unit tests for libhext are written with Google Test
  • libhext's public API documentation is generated by Doxygen
  • libhext's scripting language bindings are generated by Swig

Tests

There are unit tests for libhext and blackbox tests for Hext as a language, whose main purpose is to detect unwanted change in syntax or behavior.
The libhext-test project is located in /libhext/test and depends on Google Test. Nothing fancy, just build the project and run the executable libhext-test. How to write test cases with Google Test is described here.
The blackbox tests are located in /test. There you'll find a shell script called blackbox.sh. This script applies Hext templates to HTML documents and compares the result to a third file that contains the expected output. For example, there is a test case icase-quoted-regex that consists of three files: icase-quoted-regex.hext, icase-quoted-regex.html, and icase-quoted-regex.expected. To run this test case you would do the following:

$ ./blackbox.sh case/icase-quoted-regex.hext

blackbox.sh will then look for the corresponding .html and .expected files of the same name in the directory of icase-quoted-regex.hext. Then it will invoke htmlext with the given Hext template and HTML document and compare the result to icase-quoted-regex.expected. To run all blackbox tests in succession:

$ ./blackbox.sh case/*.hext

By default blackbox.sh will look for the htmlext binary in $PATH. Failing that, it looks for the binary in the default build directory. You can tell blackbox.sh which command to use by setting HTMLEXT. For example, to run all tests through valgrind you'd run the following:

$ HTMLEXT="valgrind -q ../build/htmlext" ./blackbox.sh case/*.hext

Acknowledgements

  • GumboAn HTML5 parsing library in pure C99
    Gumbo is used as the HTML parser behind hext::Html. It's fast, easy to integrate and even fixes invalid HTML.
  • RagelRagel State Machine Compiler
    The state machine that is used to parse Hext templates is generated by Ragel. You can find the definition of this machine in /libhext/ragel/hext-machine.rl.
  • RapidJSONA fast JSON parser/generator for C++
    RapidJSON powers the JSON output of the htmlext command line utility.
  • jqA lightweight and flexible command-line JSON processor
    An indispensable tool when dealing with JSON in the shell. Piping the output of htmlext into jq lets you do all sorts of crazy things.
  • AceA Code Editor for the Web
    Used as the code editor in the "Try Hext in your Browser!" section and as a highlighter for all code examples. The highlighting rules for Hext are included in this project in /syntaxhl/ace. Also, there's a script in /libhext/scripts/syntax-hl-ace that uses Ace to transform a code template into highlighted HTML.
  • Boost.BeastHTTP and WebSocket built on Boost.Asio in C++11
    The Websocket server behind the "Try Hext in your Browser!" section is built with Beast. See github.com/html-extract/hext-on-websockets for more.

hext's People

Contributors

brandonrobertz avatar thomastrapp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

hext's Issues

hext python module leaks memory

This is just one example of a serious and common memory leak caused by hext. It is not intended to be the only example, although any example will probably leak memory.

import importlib.metadata
import os
import sys
import tracemalloc

import hext

tracemalloc.start(25)

print(f"Python version: {sys.version.replace(os.linesep, '')}")
print(f"hext version: {importlib.metadata.version('hext')}")

RULE = hext.Rule("<html @text:text />")

def work():
    RULE.extract(hext.Html(''))

for i in range(12345):
    work()
    if i % 1234 == 0:
        print(f"{i=:05} {tracemalloc.take_snapshot().statistics('filename')[0]}")

Output:

Python version: 3.8.3 (default, May 14 2020, 20:11:43) [GCC 7.5.0]
hext version: 0.2.4
i=00000 <frozen importlib._bootstrap_external>:0: size=655 KiB, count=7621, average=88 B
i=01234 <frozen importlib._bootstrap_external>:0: size=612 KiB, count=7137, average=88 B
i=02468 <frozen importlib._bootstrap_external>:0: size=609 KiB, count=7102, average=88 B
i=03702 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=838 KiB, count=7403, average=116 B
i=04936 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1117 KiB, count=9869, average=116 B
i=06170 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1396 KiB, count=12335, average=116 B
i=07404 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1675 KiB, count=14801, average=116 B
i=08638 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1955 KiB, count=17267, average=116 B
i=09872 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=2234 KiB, count=19733, average=116 B
i=11106 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=2513 KiB, count=22199, average=116 B
i=12340 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=2793 KiB, count=24667, average=116 B

Process finished with exit code 0

As shown above, the memory consumed by hext keeps increasing for no good reason. This doesn't necessarily have to be due to bug(s) in hext's source code. For all I know, the bug(s) could exist in what is used to interface Python with another language.

Add native releases for Mac OS on M1/M2

Currently there are no releases for Mac OS on the M1/M2 architecture, i.e. npm install hext and pip install hext will find no suitable release.

Popularity of M1 and M2 will increase over time and therefore Hext should provide releases for the new Apple hardware.

Workarounds

Hext does support ARM64, but unfortunately must be compiled from source.

Another alternative is to use Hext.js.
Hext.js is a JavaScript/WebAssembly module that runs on Node on any architecture (Documentation)

Install hext.js:

$ npm install hext.js

Example application test.js:

const loadHext = require('hext.js');

loadHext().then(hext => {
  const html = new hext.Html("<ul><li>Hello</li><li>World</li></ul>");
  const rule = new hext.Rule("<li @text:my_text />");
  const result = rule.extract(html).map(x => x.my_text).join(", ");
  console.log(result); // "Hello, World"
});
$ node test.js
Hello, World

Build fails with Boost 1.70

Temporary workaround:
When calling cmake, add -DBoost_NO_BOOST_CMAKE=On.

Todo:

  • Remove old-style cmake idioms and replace them with modern cmake
  • Make sure that the build succeeds with either CMake's FindBoost or Boost's FindBoost

Fix clang warnings 'inconsistent-missing-destructor-override'

Example:

[...]/AttributeMatch.h:70:3: warning: '~AttributeMatch' overrides a destructor but is
      not marked 'override' [-Winconsistent-missing-destructor-override]
  ~AttributeMatch() noexcept = default;
  ^
[...]/Cloneable.h:36:19: note: overridden virtual function is here
class HEXT_PUBLIC Cloneable : public Base
                  ^

Help transforming result from fast.ai/topics

I can scrape the entries at https://www.fast.ai/topics/ using the hext:

<h2><span id="technical"/></h2>
<ul><li><a href:link @text:title/></li></ul>

I get the result as a Dict[str, List[str]]. I however want the result as a typical List[Dict[str, str]] which I have typically come to expect. Is this feasibe?

For example, instead of:

{
  "link": [
    "/2020/02/13/fastai-A-Layered-API-for-Deep-Learning/",
    "/2020/01/20/blog_overview/",
  ],
  "title": [
    "fastai—A Layered API for Deep Learning",
    "Your own blog with GitHub Pages and fast_template (4 part tutorial)",
  ]
}

I want:

[
    {"link": "/2020/02/13/fastai-A-Layered-API-for-Deep-Learning/", "title": "fastai—A Layered API for Deep Learning"},
    {"link":  "/2020/01/20/blog_overview/", "title": "Your own blog with GitHub Pages and fast_template (4 part tutorial)"}
]

Thanks.

Matching any of multiple tags

I currently have:

<body>
    { <p @text:content /> }
</body>

Obvious this matches all p tags in body at any level. I however want something like:

<body>
    { <p|h[1-6] @text:content /> }
</body>

or more explicitly:

<body>
    { <p|h1|h2|h3|h4|h5|h6 @text:content /> }
</body>

I mean I also want to match h1 through h6, not just p. This doesn't seem to be supported by hext at this time. This is an important and urgent use case for me for extracting text from an HTML article for machine learning purposes. I don't however want to match any other tags at this time. Is there any way to do this?

Currently, to use hext for this purpose, I have to first use a string replacement to replace all h1-h6 tags with p tags, which is a hacky thing to do via string manipulation, risking errors.

Suggestion of 10000 for max_searches is too low

The suggestion of 10_000 for max_searches looks to be too low, resulting in errors for extractions that used to work fine in the past. I increased it to 100_000, with no errors yet.

I often use the website https://hext.thomastrapp.com/ for initial testing. I am seeing this error on it: "The allowed amount of searches was exhausted". Can its value of max_searches be increased?

Fwiw, I was trying it against the copy-pasted source of https://ai.facebook.com/blog/ with the following:

<div>
{
<div>
<h4 @text:title/>
</div>
<div>
<p @text:summary/>
</div>
}

Create a debian package

Users need a way to install hext without the hassle of compiling from source.

As per debian packaging rules the project needs to be split up into at least two packages:

  • libhext
  • htmlext

Language bindings also need a package of their own, e.g. libhextpy.

Ideally, the packages will be accepted into the Debian repository and then automatically be mirrored by Debian derivatives, such as Ubuntu.

Edit: See https://github.com/thomastrapp/hext/releases/tag/v0.7.0 for the newly created (binary) Debian packages.

"All siblings of type" issue

I've been playing with a bunch of extractors and I encountered an issue that has confused me a bit. I'm playing with this DOM:

<li class="item event" >
  <div class="col-12 col-sm-2 event-type" >
    <h5 >
      Special Event
    </h5>
  </div>
    <div class="col-12 col-sm-7 item-content event-content" >
      <h3 class="title item-title event-title" >
        <a href="/events-and-training/event/3433/4377/" >Conference registration (Wednesday)</a>
      </h3>
        <p >Wednesday is a registration day.</p>
        <p >No talks scheduled.</p>
        <p ></p>
    </div>
  <div class="col-12 col-sm-3 item-meta event-meta" >
    <h4 class="event-location" >
      Salon EF
    </h4>
      <p  class="">
      3:00 pm - 6:00 pm
      </p>
  </div>
</li>

<li class="item event" >
  <div class="col-12 col-sm-2 event-type" >
    <h5 >
      Special Event
    </h5>
  </div>
    <div class="col-12 col-sm-7 item-content event-content" >
      <h3 class="title item-title event-title" >
        <a href="/events-and-training/event/3433/4378/">Conference sales (Wednesday)</a>
      </h3>
      <p ></p>
      <p >Stop by the conference sales table and browse our merchandise.</p>
      <p ></p>
    </div>
  <div class="col-12 col-sm-3 item-meta event-meta" >
    <h4 class="event-location" >
      Salon EF
    </h4>
      <p >
      3:00 pm - 6:00 pm
      </p>
  </div>
</li>

From it, I am looking to get a JSON representation like this:

{
    "BODY": [
        "Wednesday is a registration day.",
        "No talks scheduled.",
        ""
    ],
    "TITLE": "Conference registration (Wednesday)"
}
{
    "BODY": [
        "",
        "Stop by the conference sales table and browse our merchandise.",
        ""
    ],
    "TITLE": "Conference sales (Wednesday)"
}

My first thought was this:

<DIV >
  <h3><a @text:TITLE /></h3>
  <p @text:BODY />
</DIV>

But I get the first p tag, others ignored:

{
    "BODY": "Wednesday is a registration day.",
    "TITLE": "Conference registration (Wednesday)"
}
{
    "BODY": "",
    "TITLE": "Conference sales (Wednesday)"
}

I attempted with CSS nth-child selectors, but those selectors only seem to allow only a single reference (ranges like n+2 will only grab the second child, ignoring the rest):

``

# nth-child(n+2) throws an error!
```

The only way I can seem to get all of the p tags under div into BODY array is by omitting the h3 tag:

<DIV ><p @text:BODY /></DIV>

Is this expected behavior? Is there a template I haven't thought of that can get both the h3 text and an array of the sibling p tags under a div?

Thanks a lot!

updated release

I saw you updated this for py3.9,
#18

has that version been posted to pypi?

github latest release is from 2019 - v0.8.3
pypi latest is 2020 - v0.2.5
npm latest is - v10.0.5

Python pip installation

It would really help if hext was pip-installable for Python. It would really simplify the installation and use. I mean to have it and its Python 3.6+ bindings published as a package at pypi. This applies to both command-line and Python use.

This request is relevant because a lot of data analysis work these days is done via online notebooks, where only pip installation of packages is an option.

Sync GitHub release version number with PyPI release version number

Congratulations on v1.0.0!

For future releases, is it possible for the GitHub release version number to match the one in PyPI? Currently the two seem very much out of sync, and this makes it confusing. They're v1.0.0 and 0.3.0 respectively.

This doesn't mean that the two always need to be released together in sequence. A patch increment doesn't strictly need to match. Only the major and minor increments ideally need to match.

Thanks.

Move Hext's website to its own git repository

Move the source of Hext's website to its own git repository to allow for rapid and painless changes to documentation and installation instructions.

Things to consider:

  • Improve automation (e.g. git clone hext-site && ./hext-site/build.sh)
  • Using Github Pages might be a good idea (CDN, uptime, SSL)
  • SSL certificate, e.g. from letsencrypt.org
  • Serve hextserver ("Try Hext in your browser") through its own domain + SSL certificate
  • Update dependencies (Jekyll, ACE, Semantic UI, Proxygen)
  • Cleanup

Edit: See https://github.com/thomastrapp/hext-website

Handling unknown elements?

First off: thank you for creating & maintaining this software!

I am having an issue with custom HTML elements.
My use case involves HTML like this:

<custom-tag>
   <div>Text</div>
</custom-tag>

I am unsure how to approach this!
I get an error when I try the extractor on your homepage:
Error: Unknown HTML tag 'custom-tag' at line 4, char 18:

Python: Improve error messages for argument type mismatch

For example, using Python 3.12, Hext 1.0.8:

import hext
rule = hext.Rule("<a href:link/>")
# Error, the argument for extract is of type string:
results = rule.extract("""<a href="b"></a>""")

Produces an unhelpful error message:

Traceback (most recent call last):
  File "/home/dev/issue-27/issue28-example.py", line 4, in <module>
    results = rule.extract("""<a href="b"></a>""")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/issue-27/venv/lib/python3.12/site-packages/hext/__init__.py", line 139, in extract
    return _hext.Rule_extract(self, html, max_searches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract'.
  Possible C/C++ prototypes are:
    Rule::extract(Html const &,std::uint64_t) const
    Rule::extract(Html const &) const

The error message should explicitly state that the wrong type of argument was passed, and that an hext.Html was expected.

TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract'

With Python 3.12, hext.Rule('').extract('') gives the error:

  File "python3.12/site-packages/hext/__init__.py", line 139, in extract
    return _hext.Rule_extract(self, html, max_searches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract'.
  Possible C/C++ prototypes are:
    Rule::extract(Html const &,std::uint64_t) const
    Rule::extract(Html const &) const

I am of course also getting this error with a more real-life example. At this time I cannot use hext for anything new.

Help scraping entries from Facebook AI Blog

I could use help scraping the links and titles for each result from the source of https://ai.facebook.com/blog/ .

The following gives me the first entry only:

        <a href:prepend("https://ai.facebook.com"):link><span><div><div>
            <h2 @text:title></h2>
        </div></div></span></a>

Example:

{
  "link": "https://ai.facebook.com/blog/powered-by-ai-turning-any-2d-photo-into-3d-using-convolutional-neural-nets/",
  "title": "Powered by AI: Turning any 2D photo into 3D using convolutional neural nets"
}

The following gives me the next two entries:

        <a href:link><div><div><div><div>
            <h4 @text:title></h4>
        </div></div></div></div></a>

Example:

{
  "link": "https://ai.facebook.com/blog/using-radioactive-data-to-detect-if-a-data-set-was-used-for-training/",
  "title": "Using ‘radioactive data’ to detect if a data set was used for training"
}

The following gives me all the correct links but not all the correct titles:

        <a href^="https://ai.facebook.com/blog/" href:link @text:title />

How do I get all results, or is this something that hext is not capable of? Thanks.

Node bindings and npm

Hi,

First of thank you for this incredibly useful project. I was trying to use your node bindings and ran into some trouble. It seems that the NaN node module isn't supported in node v10 (which is what I need for a project). The issue I kept running into was similar to this. It totally worked on node v8.

I was just wondering if getting it to work in node v10 was something you were interested in/had time to work on. For now I will just install it as a CLI and spawn a process in node. Additionally, any plans to put this up on an NPM registry? I like cheerio but the ability to have html templates that i can use across languages is incredibly useful.

Thanks again,

Surya

Arbitrary Nested Hext Templates

Hello from the west coast. So I've been using hext a ton over the past couple years now, building scrapers for all kinds of journalistic work. For the most part it works great, especially combined with the easy-to-use web hext template builder.

One thing that's been coming up more frequently is sites that either try to frustrate web scraping attempts by arbitrary obfuscation/garbage divs, or sometimes just really poor quality/hand-crafted HTML from small government entities. DOMs that look like this:

<div class="list">
  <div class="item">Heading
    <p>Item one</p>
  </div>

  <div class="item"><div>Title 2</div>
    <p>Item two</p>
  </div>

  <div class="item">Title 3
    <div>
      <div>
        <p>Item Three</p>
      </div>
    </div>
  </div>
</div>

Let's say we want to get the contents of the <p> tag from every item in the list. A nested rule would be the simplest, most robust method to build an extractor for such DOMs, IMO.

I still think hext is the most change-resistant way to build extractors for websites, but the lack of a way to specify arbitrary-depth ancestor matching rules limits the ways hext can be used directly.

This has been discussed in other issues: #15 and #16

I really feel like this is the missing link for hext, so I wanted to ask about what would be required to get something like this working. I'd be willing to try and get some work done in this direction if I could get pointed in the right direction. The solution posed in #15 really seems like the ideal method:

<a href:link>
  # "inner" hext template (enclosed in braces) that is
  # contained anywhere inside an <a>:
  {
    <h1 @text:heading />
  }
</a>

Is this a grammar change? Or do we accomplish this via code, extracting bracketed hext templates and applying them recursively to the results? Some other way?

Here's to Hext in 2022! 🥂

Create a MacOS package

I am interested in using Hext, but there is no distribution for Mac OS X. As such, I can't install it using Pip.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.