Comments (7)
Nested Rules
<a>
{
<b/>
}
<c/>
</a>
Rules get a new member: Nested Rules. <b/>
would be the one and only Nested Rule of <a>
in the above template. Nested Rules are regular Rules, but are treated differently when matching rules against nodes.
A Rule matches a node if all its Nested Rules are successfully matched against the content of the node, i.e. node.innerHTML.
Nested Rules have no order: {<b/>}
can be matched before, after or even in nodes matching <c>
.
In other words, the following two hext templates are equivalent in regards to matching:
<a>
<h1 @text:title />
{
<span @text:author />
}
</a>
<a>
{
<span @text:author />
}
<h1 @text:title />
</a>
Worst case performance
<*>
{
<*>
{
<*/>
}
</*>
}
</*>
This would result in roughly n^3 comparisons, where n is the amnount of nodes in a document. That's a problem.
And a help in denial-of-service attacks, when accepting untrusted hext. This could be mitigated by a configuration option. Maybe something like "abort after doing x amount of work". IIRC, regex engines have similar protections.
Data extraction
Simple example
<div class:class>
{
<p @text:content />
}
</div>
<div class="list1">
<div><p>One</p> </div>
<div><p>Two</p> </div>
<div><p>Three</p></div>
</div>
<div class="list2">
<div><p>Four</p> </div>
<div><p>Five</p> </div>
<div><p>Six</p> </div>
</div>
{
"class": "list1",
"content": ["One","Two","Three"]
}
{
"class": "list2",
"content": ["Four","Five","Six"]
}
Edit: The example was wrong
<div class="list1">
<div class="item"><p>One</p> </div>
<div class="item"><p>Two</p> </div>
<div class="item"><p>Three</p></div>
</div>
<div class="list2">
<div class="item"><p>Four</p> </div>
<div class="item"><p>Five</p> </div>
<div class="item"><p>Six</p> </div>
</div>
This would match not only the outer <div>
elements, but also the <div>
elements on the inside:
<div class:class> { <p @text:content /> } </div>
The result would be:
// outer divs
{"class":"list1","content":["One","Two","Three"]}
{"class":"list2","content":["Four","Five","Six"]}
// inner divs
{"class":"item","content":"One"}
{"class":"item","content":"Two"}
{"class":"item","content":"Three"}
{"class":"item","content":"Four"}
{"class":"item","content":"Five"}
{"class":"item","content":"Six"}
Multiple nested rules
<body>
{
<p @text:content />
}
{
<a href:href />
}
</body>
<body>
<div> <p>P1 <a href="p1-href"></a></p> </div>
<div> <p>P2 <a href="p2-href"></a></p> </div>
</body>
{
"content": ["P1", "P2"],
"href": ["p1-href", "p2-href"]
}
Scope
- Rule.cpp: New member
nested_
, ability to add nested rules and iterate over them - RuleMatching.cpp: Match nested rules
- hext-machine.rl: Add syntax, Rule tree building
- Unit tests, blackbox tests
- Documentation
Syntax highlighting for ACE and Vim
So far this looks like a non-breaking change (in regards to already written hext templates). I hope I didn't miss anything crucial.
This might actually work :) I need to sleep on this a bit.
from hext.
Wow this is incredible progress! If you want help testing or writing unit tests, or any other task, let me know.
from hext.
Current master supports nested rules (63339e0).
I am going to add a way to limit hext's resource usage, specifically for hext-on-websockets.
For example, this consumes about 300MB of memory at peak:
\time --verbose htmlext \
<(echo '<* @inner-html:a>{<* @inner-html:b>{<* @inner-html:c>{<* @inner-html:d>{<* @inner-html:e>{<* @inner-html:f>{<*/>}</*>}</*>}</*>}</*>}</*>}</*>') \
<(curl https://news.ycombinator.com/) > /dev/null
I'm thinking about adding a parameter to Rule::extract(html)
, something like:
Rule::extract(html, int max_nested_recursion = 0)
This parameter controls how often nested rules can traverse the HTML tree. I am not sure about the details, yet.
The default value disables throttling.
After solving this problem, I will publish new releases and update the documentation.
Nested rules are a great addition to hext. Thank you for the well written feature request.
from hext.
Great work! I'm readying everything for updates to hext-emscripten and my hextractor. Will be testing!
from hext.
The new pypi package v0.3.0 is built from hext v1.0.0 and currently available for:
- manylinux2014: Python 3.5 - 3.9 (3.10 will follow soon)
- macosx: Python 3.6 - 3.10
Todo:
- node/npm packages
- Support for "max_searches" parameter in language bindings
from hext.
All done!
I will probably improve the documentation at a later date.
Thank you for using hext - Let me know if there is anything missing.
from hext.
As always I'm impressed! I'm going to update hextractor and the other libraries and if I find any issues I'll open a new ticket. Thanks so much! 🦾
from hext.
Related Issues (20)
- Use GitHub organization for hext repos HOT 3
- Fix clang warnings 'inconsistent-missing-destructor-override'
- Handling unknown elements? HOT 5
- Match unknown tags in a case-insensitive manner
- Add pip package for Python v3.8 HOT 1
- Help scraping entries from Facebook AI Blog HOT 3
- Help transforming result from fast.ai/topics HOT 1
- hext python module leaks memory HOT 2
- Add wheels for Python 3.9 HOT 3
- Consider providing Nim package HOT 1
- updated release HOT 5
- Use Github Actions for automated Hext releases for Python on Mac OS HOT 1
- Sync GitHub release version number with PyPI release version number HOT 2
- Node: Passing the wrong type to `rule.extract` causes a segmentation fault HOT 2
- Suggestion of 10000 for max_searches is too low HOT 1
- Add native releases for Mac OS on M1/M2 HOT 1
- TypeError: Wrong number or type of arguments for overloaded function 'Rule_extract' HOT 5
- Python: Improve error messages for argument type mismatch
- Website: Improve documentation for language bindings
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hext.