html-extract / hext Goto Github PK
View Code? Open in Web Editor NEWDomain-specific language for extracting structured data from HTML documents
Home Page: https://hext.thomastrapp.com
License: Apache License 2.0
Domain-specific language for extracting structured data from HTML documents
Home Page: https://hext.thomastrapp.com
License: Apache License 2.0
Congratulations on v1.0.0!
For future releases, is it possible for the GitHub release version number to match the one in PyPI? Currently the two seem very much out of sync, and this makes it confusing. They're v1.0.0 and 0.3.0 respectively.
This doesn't mean that the two always need to be released together in sequence. A patch
increment doesn't strictly need to match. Only the major
and minor
increments ideally need to match.
Thanks.
Please consider using a free GitHub organization for the various public hext related repos that are currently under your account. Thanks.
This is just one example of a serious and common memory leak caused by hext. It is not intended to be the only example, although any example will probably leak memory.
import importlib.metadata
import os
import sys
import tracemalloc
import hext
tracemalloc.start(25)
print(f"Python version: {sys.version.replace(os.linesep, '')}")
print(f"hext version: {importlib.metadata.version('hext')}")
RULE = hext.Rule("<html @text:text />")
def work():
RULE.extract(hext.Html(''))
for i in range(12345):
work()
if i % 1234 == 0:
print(f"{i=:05} {tracemalloc.take_snapshot().statistics('filename')[0]}")
Output:
Python version: 3.8.3 (default, May 14 2020, 20:11:43) [GCC 7.5.0]
hext version: 0.2.4
i=00000 <frozen importlib._bootstrap_external>:0: size=655 KiB, count=7621, average=88 B
i=01234 <frozen importlib._bootstrap_external>:0: size=612 KiB, count=7137, average=88 B
i=02468 <frozen importlib._bootstrap_external>:0: size=609 KiB, count=7102, average=88 B
i=03702 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=838 KiB, count=7403, average=116 B
i=04936 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1117 KiB, count=9869, average=116 B
i=06170 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1396 KiB, count=12335, average=116 B
i=07404 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1675 KiB, count=14801, average=116 B
i=08638 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=1955 KiB, count=17267, average=116 B
i=09872 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=2234 KiB, count=19733, average=116 B
i=11106 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=2513 KiB, count=22199, average=116 B
i=12340 /myenv/lib/python3.8/site-packages/hext/__init__.py:0: size=2793 KiB, count=24667, average=116 B
Process finished with exit code 0
As shown above, the memory consumed by hext keeps increasing for no good reason. This doesn't necessarily have to be due to bug(s) in hext's source code. For all I know, the bug(s) could exist in what is used to interface Python with another language.
https://github.com/google/gumbo-parser is no longer maintained.
Arch has already picked up https://codeberg.org/gumbo-parser/gumbo-parser as the new upstream source.
I saw you updated this for py3.9,
#18
has that version been posted to pypi?
github latest release is from 2019 - v0.8.3
pypi latest is 2020 - v0.2.5
npm latest is - v10.0.5
Move the source of Hext's website to its own git repository to allow for rapid and painless changes to documentation and installation instructions.
Things to consider:
var hext = require('hext');
var rule = new hext.Rule('<a href:link/>');
// rule.extract expects an object of type hext.Html
var result = rule.extract({}); // raises SIGSEGV
It would really help if hext was pip-installable for Python. It would really simplify the installation and use. I mean to have it and its Python 3.6+ bindings published as a package at pypi. This applies to both command-line and Python use.
This request is relevant because a lot of data analysis work these days is done via online notebooks, where only pip installation of packages is an option.
Currently there are no releases for Mac OS on the M1/M2 architecture, i.e. npm install hext
and pip install hext
will find no suitable release.
Popularity of M1 and M2 will increase over time and therefore Hext should provide releases for the new Apple hardware.
Hext does support ARM64, but unfortunately must be compiled from source.
Another alternative is to use Hext.js.
Hext.js is a JavaScript/WebAssembly module that runs on Node on any architecture (Documentation)
Install hext.js:
$ npm install hext.js
Example application test.js
:
const loadHext = require('hext.js');
loadHext().then(hext => {
const html = new hext.Html("<ul><li>Hello</li><li>World</li></ul>");
const rule = new hext.Rule("<li @text:my_text />");
const result = rule.extract(html).map(x => x.my_text).join(", ");
console.log(result); // "Hello, World"
});
$ node test.js
Hello, World
Temporary workaround:
When calling cmake, add -DBoost_NO_BOOST_CMAKE=On
.
Todo:
https://www.python.org/downloads/release/python-390/
When done, remove CI skip for version 3.9 introduced in commit b858d58
Example:
[...]/AttributeMatch.h:70:3: warning: '~AttributeMatch' overrides a destructor but is
not marked 'override' [-Winconsistent-missing-destructor-override]
~AttributeMatch() noexcept = default;
^
[...]/Cloneable.h:36:19: note: overridden virtual function is here
class HEXT_PUBLIC Cloneable : public Base
^
Hi,
First of thank you for this incredibly useful project. I was trying to use your node bindings and ran into some trouble. It seems that the NaN node module isn't supported in node v10 (which is what I need for a project). The issue I kept running into was similar to this. It totally worked on node v8.
I was just wondering if getting it to work in node v10 was something you were interested in/had time to work on. For now I will just install it as a CLI and spawn a process in node. Additionally, any plans to put this up on an NPM registry? I like cheerio but the ability to have html templates that i can use across languages is incredibly useful.
Thanks again,
Surya
I am interested in using Hext, but there is no distribution for Mac OS X. As such, I can't install it using Pip.
First off: thank you for creating & maintaining this software!
I am having an issue with custom HTML elements.
My use case involves HTML like this:
<custom-tag>
<div>Text</div>
</custom-tag>
I am unsure how to approach this!
I get an error when I try the extractor on your homepage:
Error: Unknown HTML tag 'custom-tag' at line 4, char 18:
I have lately been dabbling in Nim. It transpiles to C/C++ and then it compiles. It very partially has a Python-like syntax.
If hext is implemented in C++, it should be possible to make it interoperable from Nim. A package which may significantly assist in doing so is https://github.com/nimterop/nimterop. Ideally a Nim package can then be published to https://nimble.directory/. Thanks.
Users need a way to install hext without the hassle of compiling from source.
As per debian packaging rules the project needs to be split up into at least two packages:
Language bindings also need a package of their own, e.g. libhextpy.
Ideally, the packages will be accepted into the Debian repository and then automatically be mirrored by Debian derivatives, such as Ubuntu.
Edit: See https://github.com/thomastrapp/hext/releases/tag/v0.7.0 for the newly created (binary) Debian packages.
Python 3.8 was released in October 2019: https://www.python.org/downloads/release/python-380/
Currently <custom />
will not match <CUSTOM></CUSTOM>
.
The comparison is done here:
Lines 213 to 214 in e5d504d
Hello from the west coast. So I've been using hext a ton over the past couple years now, building scrapers for all kinds of journalistic work. For the most part it works great, especially combined with the easy-to-use web hext template builder.
One thing that's been coming up more frequently is sites that either try to frustrate web scraping attempts by arbitrary obfuscation/garbage divs, or sometimes just really poor quality/hand-crafted HTML from small government entities. DOMs that look like this:
<div class="list">
<div class="item">Heading
<p>Item one</p>
</div>
<div class="item"><div>Title 2</div>
<p>Item two</p>
</div>
<div class="item">Title 3
<div>
<div>
<p>Item Three</p>
</div>
</div>
</div>
</div>
Let's say we want to get the contents of the <p>
tag from every item in the list. A nested rule would be the simplest, most robust method to build an extractor for such DOMs, IMO.
I still think hext is the most change-resistant way to build extractors for websites, but the lack of a way to specify arbitrary-depth ancestor matching rules limits the ways hext can be used directly.
This has been discussed in other issues: #15 and #16
I really feel like this is the missing link for hext, so I wanted to ask about what would be required to get something like this working. I'd be willing to try and get some work done in this direction if I could get pointed in the right direction. The solution posed in #15 really seems like the ideal method:
<a href:link>
# "inner" hext template (enclosed in braces) that is
# contained anywhere inside an <a>:
{
<h1 @text:heading />
}
</a>
Is this a grammar change? Or do we accomplish this via code, extracting bracketed hext templates and applying them recursively to the results? Some other way?
Here's to Hext in 2022! 🥂
I can scrape the entries at https://www.fast.ai/topics/ using the hext:
<h2><span id="technical"/></h2>
<ul><li><a href:link @text:title/></li></ul>
I get the result as a Dict[str, List[str]]
. I however want the result as a typical List[Dict[str, str]]
which I have typically come to expect. Is this feasibe?
For example, instead of:
{
"link": [
"/2020/02/13/fastai-A-Layered-API-for-Deep-Learning/",
"/2020/01/20/blog_overview/",
],
"title": [
"fastai—A Layered API for Deep Learning",
"Your own blog with GitHub Pages and fast_template (4 part tutorial)",
]
}
I want:
[
{"link": "/2020/02/13/fastai-A-Layered-API-for-Deep-Learning/", "title": "fastai—A Layered API for Deep Learning"},
{"link": "/2020/01/20/blog_overview/", "title": "Your own blog with GitHub Pages and fast_template (4 part tutorial)"}
]
Thanks.
I could use help scraping the links and titles for each result from the source of https://ai.facebook.com/blog/ .
The following gives me the first entry only:
<a href:prepend("https://ai.facebook.com"):link><span><div><div>
<h2 @text:title></h2>
</div></div></span></a>
Example:
{
"link": "https://ai.facebook.com/blog/powered-by-ai-turning-any-2d-photo-into-3d-using-convolutional-neural-nets/",
"title": "Powered by AI: Turning any 2D photo into 3D using convolutional neural nets"
}
The following gives me the next two entries:
<a href:link><div><div><div><div>
<h4 @text:title></h4>
</div></div></div></div></a>
Example:
{
"link": "https://ai.facebook.com/blog/using-radioactive-data-to-detect-if-a-data-set-was-used-for-training/",
"title": "Using ‘radioactive data’ to detect if a data set was used for training"
}
The following gives me all the correct links but not all the correct titles:
<a href^="https://ai.facebook.com/blog/" href:link @text:title />
How do I get all results, or is this something that hext is not capable of? Thanks.
I currently have:
<body>
{ <p @text:content /> }
</body>
Obvious this matches all p
tags in body
at any level. I however want something like:
<body>
{ <p|h[1-6] @text:content /> }
</body>
or more explicitly:
<body>
{ <p|h1|h2|h3|h4|h5|h6 @text:content /> }
</body>
I mean I also want to match h1
through h6
, not just p
. This doesn't seem to be supported by hext
at this time. This is an important and urgent use case for me for extracting text from an HTML article for machine learning purposes. I don't however want to match any other tags at this time. Is there any way to do this?
Currently, to use hext
for this purpose, I have to first use a string replacement to replace all h1
-h6
tags with p
tags, which is a hacky thing to do via string manipulation, risking errors.
With Python 3.12, hext.Rule('').extract('')
gives the error:
File "python3.12/site-packages/hext/__init__.py", line 139, in extract
return _hext.Rule_extract(self, html, max_searches)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract'.
Possible C/C++ prototypes are:
Rule::extract(Html const &,std::uint64_t) const
Rule::extract(Html const &) const
I am of course also getting this error with a more real-life example. At this time I cannot use hext for anything new.
The suggestion of 10_000 for max_searches
looks to be too low, resulting in errors for extractions that used to work fine in the past. I increased it to 100_000, with no errors yet.
I often use the website https://hext.thomastrapp.com/ for initial testing. I am seeing this error on it: "The allowed amount of searches was exhausted". Can its value of max_searches
be increased?
Fwiw, I was trying it against the copy-pasted source of https://ai.facebook.com/blog/ with the following:
<div>
{
<div>
<h4 @text:title/>
</div>
<div>
<p @text:summary/>
</div>
}
For example, using Python 3.12, Hext 1.0.8:
import hext
rule = hext.Rule("<a href:link/>")
# Error, the argument for extract is of type string:
results = rule.extract("""<a href="b"></a>""")
Produces an unhelpful error message:
Traceback (most recent call last):
File "/home/dev/issue-27/issue28-example.py", line 4, in <module>
results = rule.extract("""<a href="b"></a>""")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/dev/issue-27/venv/lib/python3.12/site-packages/hext/__init__.py", line 139, in extract
return _hext.Rule_extract(self, html, max_searches)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract'.
Possible C/C++ prototypes are:
Rule::extract(Html const &,std::uint64_t) const
Rule::extract(Html const &) const
The error message should explicitly state that the wrong type of argument was passed, and that an hext.Html was expected.
Delete
I've been playing with a bunch of extractors and I encountered an issue that has confused me a bit. I'm playing with this DOM:
<li class="item event" >
<div class="col-12 col-sm-2 event-type" >
<h5 >
Special Event
</h5>
</div>
<div class="col-12 col-sm-7 item-content event-content" >
<h3 class="title item-title event-title" >
<a href="/events-and-training/event/3433/4377/" >Conference registration (Wednesday)</a>
</h3>
<p >Wednesday is a registration day.</p>
<p >No talks scheduled.</p>
<p ></p>
</div>
<div class="col-12 col-sm-3 item-meta event-meta" >
<h4 class="event-location" >
Salon EF
</h4>
<p class="">
3:00 pm - 6:00 pm
</p>
</div>
</li>
<li class="item event" >
<div class="col-12 col-sm-2 event-type" >
<h5 >
Special Event
</h5>
</div>
<div class="col-12 col-sm-7 item-content event-content" >
<h3 class="title item-title event-title" >
<a href="/events-and-training/event/3433/4378/">Conference sales (Wednesday)</a>
</h3>
<p ></p>
<p >Stop by the conference sales table and browse our merchandise.</p>
<p ></p>
</div>
<div class="col-12 col-sm-3 item-meta event-meta" >
<h4 class="event-location" >
Salon EF
</h4>
<p >
3:00 pm - 6:00 pm
</p>
</div>
</li>
From it, I am looking to get a JSON representation like this:
{
"BODY": [
"Wednesday is a registration day.",
"No talks scheduled.",
""
],
"TITLE": "Conference registration (Wednesday)"
}
{
"BODY": [
"",
"Stop by the conference sales table and browse our merchandise.",
""
],
"TITLE": "Conference sales (Wednesday)"
}
My first thought was this:
<DIV >
<h3><a @text:TITLE /></h3>
<p @text:BODY />
</DIV>
But I get the first p
tag, others ignored:
{
"BODY": "Wednesday is a registration day.",
"TITLE": "Conference registration (Wednesday)"
}
{
"BODY": "",
"TITLE": "Conference sales (Wednesday)"
}
I attempted with CSS nth-child
selectors, but those selectors only seem to allow only a single reference (ranges like n+2
will only grab the second child, ignoring the rest):
``
The only way I can seem to get all of the p
tags under div
into BODY
array is by omitting the h3
tag:
<DIV ><p @text:BODY /></DIV>
Is this expected behavior? Is there a template I haven't thought of that can get both the h3
text and an array of the sibling p
tags under a div
?
Thanks a lot!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.