html-extract / hext Goto Github PK

Domain-specific language for extracting structured data from HTML documents

License: Apache License 2.0

CMake 4.14% C++ 71.05% HTML 5.83% Shell 8.65% JavaScript 1.83% C 0.68% Python 1.27% PHP 0.30% Ruby 0.24% Ragel 4.00% Dockerfile 0.01% SWIG 1.59% Vim Script 0.42%

cpp html-extraction scraping html dsl data-extraction python php ruby node

hext's Issues

Sync GitHub release version number with PyPI release version number

Congratulations on v1.0.0!

For future releases, is it possible for the GitHub release version number to match the one in PyPI? Currently the two seem very much out of sync, and this makes it confusing. They're v1.0.0 and 0.3.0 respectively.

This doesn't mean that the two always need to be released together in sequence. A patch increment doesn't strictly need to match. Only the major and minor increments ideally need to match.

Thanks.

Use GitHub organization for hext repos

Please consider using a free GitHub organization for the various public hext related repos that are currently under your account. Thanks.

hext python module leaks memory

This is just one example of a serious and common memory leak caused by hext. It is not intended to be the only example, although any example will probably leak memory.

import importlib.metadata
import os
import sys
import tracemalloc

import hext

tracemalloc.start(25)

print(f"Python version: {sys.version.replace(os.linesep, '')}")
print(f"hext version: {importlib.metadata.version('hext')}")

RULE = hext.Rule("<html @text:text />")

def work():
    RULE.extract(hext.Html(''))

for i in range(12345):
    work()
    if i % 1234 == 0:
        print(f"{i=:05} {tracemalloc.take_snapshot().statistics('filename')[0]}")

Output:

Python version: 3.8.3 (default, May 14 2020, 20:11:43) [GCC 7.5.0]
hext version: 0.2.4
i=00000 <frozen importlib._bootstrap_external>:0: size=655 KiB, count=7621, average=88 B
i=01234 <frozen importlib._bootstrap_external>:0: size=612 KiB, count=7137, average=88 B
i=02468 <frozen importlib._bootstrap_external>:0: size=609 KiB, count=7102, average=88 B
i=03702 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=838 KiB, count=7403, average=116 B
i=04936 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1117 KiB, count=9869, average=116 B
i=06170 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1396 KiB, count=12335, average=116 B
i=07404 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1675 KiB, count=14801, average=116 B
i=08638 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1955 KiB, count=17267, average=116 B
i=09872 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=2234 KiB, count=19733, average=116 B
i=11106 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=2513 KiB, count=22199, average=116 B
i=12340 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=2793 KiB, count=24667, average=116 B

Process finished with exit code 0

As shown above, the memory consumed by hext keeps increasing for no good reason. This doesn't necessarily have to be due to bug(s) in hext's source code. For all I know, the bug(s) could exist in what is used to interface Python with another language.

Consider switching gumbo-parser upstream to codeberg.org/gumbo-parser/gumbo-parser

https://github.com/google/gumbo-parser is no longer maintained.
Arch has already picked up https://codeberg.org/gumbo-parser/gumbo-parser as the new upstream source.

updated release

I saw you updated this for py3.9,
#18

has that version been posted to pypi?

github latest release is from 2019 - v0.8.3
pypi latest is 2020 - v0.2.5
npm latest is - v10.0.5

Use Github Actions for automated Hext releases for Python on Mac OS

https://github.com/actions/virtual-environments/blob/main/images/macos/macos-10.15-Readme.md
https://github.com/actions/virtual-environments/blob/main/images/macos/macos-11-Readme.md

Move Hext's website to its own git repository

Move the source of Hext's website to its own git repository to allow for rapid and painless changes to documentation and installation instructions.

Things to consider:

Improve automation (e.g. git clone hext-site && ./hext-site/build.sh)
Using Github Pages might be a good idea (CDN, uptime, SSL)
SSL certificate, e.g. from letsencrypt.org
Serve hextserver ("Try Hext in your browser") through its own domain + SSL certificate
Update dependencies (Jekyll, ACE, Semantic UI, Proxygen)
Cleanup

Edit: See https://github.com/thomastrapp/hext-website

Node: Passing the wrong type to `rule.extract` causes a segmentation fault

var hext = require('hext');

var rule = new hext.Rule('<a href:link/>');

// rule.extract expects an object of type hext.Html
var result = rule.extract({}); // raises SIGSEGV

Python pip installation

It would really help if hext was pip-installable for Python. It would really simplify the installation and use. I mean to have it and its Python 3.6+ bindings published as a package at pypi. This applies to both command-line and Python use.

This request is relevant because a lot of data analysis work these days is done via online notebooks, where only pip installation of packages is an option.

Add native releases for Mac OS on M1/M2

Currently there are no releases for Mac OS on the M1/M2 architecture, i.e. npm install hext and pip install hext will find no suitable release.

Popularity of M1 and M2 will increase over time and therefore Hext should provide releases for the new Apple hardware.

Workarounds

Hext does support ARM64, but unfortunately must be compiled from source.

Another alternative is to use Hext.js.
Hext.js is a JavaScript/WebAssembly module that runs on Node on any architecture (Documentation)

Install hext.js:

$ npm install hext.js

Example application test.js:

const loadHext = require('hext.js');

loadHext().then(hext => {
  const html = new hext.Html("<ul><li>Hello</li><li>World</li></ul>");
  const rule = new hext.Rule("<li @text:my_text />");
  const result = rule.extract(html).map(x => x.my_text).join(", ");
  console.log(result); // "Hello, World"
});

$ node test.js
Hello, World

Build fails with Boost 1.70

Temporary workaround:
When calling cmake, add -DBoost_NO_BOOST_CMAKE=On.

Todo:

Remove old-style cmake idioms and replace them with modern cmake
Make sure that the build succeeds with either CMake's FindBoost or Boost's FindBoost

Add wheels for Python 3.9

https://www.python.org/downloads/release/python-390/

When done, remove CI skip for version 3.9 introduced in commit b858d58

Fix clang warnings 'inconsistent-missing-destructor-override'

Example:

[...]/AttributeMatch.h:70:3: warning: '~AttributeMatch' overrides a destructor but is
      not marked 'override' [-Winconsistent-missing-destructor-override]
  ~AttributeMatch() noexcept = default;
  ^
[...]/Cloneable.h:36:19: note: overridden virtual function is here
class HEXT_PUBLIC Cloneable : public Base
                  ^

Node bindings and npm

Hi,

First of thank you for this incredibly useful project. I was trying to use your node bindings and ran into some trouble. It seems that the NaN node module isn't supported in node v10 (which is what I need for a project). The issue I kept running into was similar to this. It totally worked on node v8.

I was just wondering if getting it to work in node v10 was something you were interested in/had time to work on. For now I will just install it as a CLI and spawn a process in node. Additionally, any plans to put this up on an NPM registry? I like cheerio but the ability to have html templates that i can use across languages is incredibly useful.

Thanks again,

Surya

Create a MacOS package

I am interested in using Hext, but there is no distribution for Mac OS X. As such, I can't install it using Pip.

Handling unknown elements?

First off: thank you for creating & maintaining this software!

I am having an issue with custom HTML elements.
My use case involves HTML like this:

<custom-tag>
   <div>Text</div>
</custom-tag>

I am unsure how to approach this!
I get an error when I try the extractor on your homepage:
Error: Unknown HTML tag 'custom-tag' at line 4, char 18:

Consider providing Nim package

I have lately been dabbling in Nim. It transpiles to C/C++ and then it compiles. It very partially has a Python-like syntax.

If hext is implemented in C++, it should be possible to make it interoperable from Nim. A package which may significantly assist in doing so is https://github.com/nimterop/nimterop. Ideally a Nim package can then be published to https://nimble.directory/. Thanks.

Create a debian package

Users need a way to install hext without the hassle of compiling from source.

As per debian packaging rules the project needs to be split up into at least two packages:

libhext
htmlext

Language bindings also need a package of their own, e.g. libhextpy.

Ideally, the packages will be accepted into the Debian repository and then automatically be mirrored by Debian derivatives, such as Ubuntu.

Edit: See https://github.com/thomastrapp/hext/releases/tag/v0.7.0 for the newly created (binary) Debian packages.

Add pip package for Python v3.8

Python 3.8 was released in October 2019: https://www.python.org/downloads/release/python-380/

Match unknown tags in a case-insensitive manner

Currently <custom /> will not match <CUSTOM></CUSTOM>.

The comparison is done here:

hext/libhext/src/Rule.cpp

Lines 213 to 214 in e5d504d

    
           const bool tagname_matches = std::equal(tag_begin, tag_end, 
        
                                                   node_tag_begin, node_tag_end);

Arbitrary Nested Hext Templates

Hello from the west coast. So I've been using hext a ton over the past couple years now, building scrapers for all kinds of journalistic work. For the most part it works great, especially combined with the easy-to-use web hext template builder.

One thing that's been coming up more frequently is sites that either try to frustrate web scraping attempts by arbitrary obfuscation/garbage divs, or sometimes just really poor quality/hand-crafted HTML from small government entities. DOMs that look like this:

<div class="list">
  <div class="item">Heading
    <p>Item one</p>
  </div>

  <div class="item"><div>Title 2</div>
    <p>Item two</p>
  </div>

  <div class="item">Title 3
    <div>
      <div>
        <p>Item Three</p>
      </div>
    </div>
  </div>
</div>

Let's say we want to get the contents of the <p> tag from every item in the list. A nested rule would be the simplest, most robust method to build an extractor for such DOMs, IMO.

I still think hext is the most change-resistant way to build extractors for websites, but the lack of a way to specify arbitrary-depth ancestor matching rules limits the ways hext can be used directly.

This has been discussed in other issues: #15 and #16

I really feel like this is the missing link for hext, so I wanted to ask about what would be required to get something like this working. I'd be willing to try and get some work done in this direction if I could get pointed in the right direction. The solution posed in #15 really seems like the ideal method:

<a href:link>
  # "inner" hext template (enclosed in braces) that is
  # contained anywhere inside an <a>:
  {
    <h1 @text:heading />
  }
</a>

Is this a grammar change? Or do we accomplish this via code, extracting bracketed hext templates and applying them recursively to the results? Some other way?

Here's to Hext in 2022! 🥂

Help transforming result from fast.ai/topics

I can scrape the entries at https://www.fast.ai/topics/ using the hext:

<h2><span id="technical"/></h2>
<ul><li><a href:link @text:title/></li></ul>

I get the result as a Dict[str, List[str]]. I however want the result as a typical List[Dict[str, str]] which I have typically come to expect. Is this feasibe?

For example, instead of:

{
  "link": [
    "/2020/02/13/fastai-A-Layered-API-for-Deep-Learning/",
    "/2020/01/20/blog_overview/",
  ],
  "title": [
    "fastai—A Layered API for Deep Learning",
    "Your own blog with GitHub Pages and fast_template (4 part tutorial)",
  ]
}

I want:

[
    {"link": "/2020/02/13/fastai-A-Layered-API-for-Deep-Learning/", "title": "fastai—A Layered API for Deep Learning"},
    {"link":  "/2020/01/20/blog_overview/", "title": "Your own blog with GitHub Pages and fast_template (4 part tutorial)"}
]

Thanks.

Help scraping entries from Facebook AI Blog

I could use help scraping the links and titles for each result from the source of https://ai.facebook.com/blog/ .

The following gives me the first entry only:

        <a href:prepend("https://ai.facebook.com"):link><span><div><div>
            <h2 @text:title></h2>
        </div></div></span></a>

Example:

{
  "link": "https://ai.facebook.com/blog/powered-by-ai-turning-any-2d-photo-into-3d-using-convolutional-neural-nets/",
  "title": "Powered by AI: Turning any 2D photo into 3D using convolutional neural nets"
}

The following gives me the next two entries:

        <a href:link><div><div><div><div>
            <h4 @text:title></h4>
        </div></div></div></div></a>

Example:

{
  "link": "https://ai.facebook.com/blog/using-radioactive-data-to-detect-if-a-data-set-was-used-for-training/",
  "title": "Using ‘radioactive data’ to detect if a data set was used for training"
}

The following gives me all the correct links but not all the correct titles:

        <a href^="https://ai.facebook.com/blog/" href:link @text:title />

How do I get all results, or is this something that hext is not capable of? Thanks.

Matching any of multiple tags

I currently have:

<body>
    { <p @text:content /> }
</body>

Obvious this matches all p tags in body at any level. I however want something like:

<body>
    { <p|h[1-6] @text:content /> }
</body>

or more explicitly:

<body>
    { <p|h1|h2|h3|h4|h5|h6 @text:content /> }
</body>

I mean I also want to match h1 through h6, not just p. This doesn't seem to be supported by hext at this time. This is an important and urgent use case for me for extracting text from an HTML article for machine learning purposes. I don't however want to match any other tags at this time. Is there any way to do this?

Currently, to use hext for this purpose, I have to first use a string replacement to replace all h1-h6 tags with p tags, which is a hacky thing to do via string manipulation, risking errors.

TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract'

With Python 3.12, hext.Rule('').extract('') gives the error:

  File "python3.12/site-packages/hext/__init__.py", line 139, in extract
    return _hext.Rule_extract(self, html, max_searches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract'.
  Possible C/C++ prototypes are:
    Rule::extract(Html const &,std::uint64_t) const
    Rule::extract(Html const &) const

I am of course also getting this error with a more real-life example. At this time I cannot use hext for anything new.

Suggestion of 10000 for max_searches is too low

The suggestion of 10_000 for max_searches looks to be too low, resulting in errors for extractions that used to work fine in the past. I increased it to 100_000, with no errors yet.

I often use the website https://hext.thomastrapp.com/ for initial testing. I am seeing this error on it: "The allowed amount of searches was exhausted". Can its value of max_searches be increased?

Fwiw, I was trying it against the copy-pasted source of https://ai.facebook.com/blog/ with the following:

<div>
{
<div>
<h4 @text:title/>
</div>
<div>
<p @text:summary/>
</div>
}

Python: Improve error messages for argument type mismatch

For example, using Python 3.12, Hext 1.0.8:

import hext
rule = hext.Rule("<a href:link/>")
# Error, the argument for extract is of type string:
results = rule.extract("""<a href="b"></a>""")

Produces an unhelpful error message:

Traceback (most recent call last):
  File "/home/dev/issue-27/issue28-example.py", line 4, in <module>
    results = rule.extract("""<a href="b"></a>""")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/issue-27/venv/lib/python3.12/site-packages/hext/__init__.py", line 139, in extract
    return _hext.Rule_extract(self, html, max_searches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract'.
  Possible C/C++ prototypes are:
    Rule::extract(Html const &,std::uint64_t) const
    Rule::extract(Html const &) const

The error message should explicitly state that the wrong type of argument was passed, and that an hext.Html was expected.

Delete

"All siblings of type" issue

I've been playing with a bunch of extractors and I encountered an issue that has confused me a bit. I'm playing with this DOM:

<li class="item event" >
  <div class="col-12 col-sm-2 event-type" >
    <h5 >
      Special Event
    </h5>
  </div>
    <div class="col-12 col-sm-7 item-content event-content" >
      <h3 class="title item-title event-title" >
        <a href="/events-and-training/event/3433/4377/" >Conference registration (Wednesday)</a>
      </h3>
        <p >Wednesday is a registration day.</p>
        <p >No talks scheduled.</p>
        <p ></p>
    </div>
  <div class="col-12 col-sm-3 item-meta event-meta" >
    <h4 class="event-location" >
      Salon EF
    </h4>
      <p  class="">
      3:00 pm - 6:00 pm
      </p>
  </div>
</li>

<li class="item event" >
  <div class="col-12 col-sm-2 event-type" >
    <h5 >
      Special Event
    </h5>
  </div>
    <div class="col-12 col-sm-7 item-content event-content" >
      <h3 class="title item-title event-title" >
        <a href="/events-and-training/event/3433/4378/">Conference sales (Wednesday)</a>
      </h3>
      <p ></p>
      <p >Stop by the conference sales table and browse our merchandise.</p>
      <p ></p>
    </div>
  <div class="col-12 col-sm-3 item-meta event-meta" >
    <h4 class="event-location" >
      Salon EF
    </h4>
      <p >
      3:00 pm - 6:00 pm
      </p>
  </div>
</li>

From it, I am looking to get a JSON representation like this:

{
    "BODY": [
        "Wednesday is a registration day.",
        "No talks scheduled.",
        ""
    ],
    "TITLE": "Conference registration (Wednesday)"
}
{
    "BODY": [
        "",
        "Stop by the conference sales table and browse our merchandise.",
        ""
    ],
    "TITLE": "Conference sales (Wednesday)"
}

My first thought was this:

<DIV >
  <h3><a @text:TITLE /></h3>
  <p @text:BODY />
</DIV>

But I get the first p tag, others ignored:

{
    "BODY": "Wednesday is a registration day.",
    "TITLE": "Conference registration (Wednesday)"
}
{
    "BODY": "",
    "TITLE": "Conference sales (Wednesday)"
}

I attempted with CSS nth-child selectors, but those selectors only seem to allow only a single reference (ranges like n+2 will only grab the second child, ignoring the rest):

# nth-child(n+2) throws an error!

```

The only way I can seem to get all of the p tags under div into BODY array is by omitting the h3 tag:

<DIV ><p @text:BODY /></DIV>

Is this expected behavior? Is there a template I haven't thought of that can get both the h3 text and an array of the sibling p tags under a div?

Thanks a lot!

	const bool tagname_matches = std::equal(tag_begin, tag_end,
	node_tag_begin, node_tag_end);

html-extract / hext Goto Github PK

hext's Issues

Workarounds

Recommend Projects

Recommend Topics

Recommend Org