Coder Social home page Coder Social logo

Comments (7)

thomastrapp avatar thomastrapp commented on June 14, 2024

Nested Rules

<a>
  {
    <b/>
  }
  <c/>
</a>

Rules get a new member: Nested Rules. <b/> would be the one and only Nested Rule of <a> in the above template. Nested Rules are regular Rules, but are treated differently when matching rules against nodes.

A Rule matches a node if all its Nested Rules are successfully matched against the content of the node, i.e. node.innerHTML.

Nested Rules have no order: {<b/>} can be matched before, after or even in nodes matching <c>.
In other words, the following two hext templates are equivalent in regards to matching:

<a>
  <h1 @text:title />
  {
    <span @text:author />
  }
</a>
<a>
  {
    <span @text:author />
  }
  <h1 @text:title />
</a>

Worst case performance

<*>
  {
    <*>
      {
        <*/>
      }
    </*>
  }
</*>

This would result in roughly n^3 comparisons, where n is the amnount of nodes in a document. That's a problem.

And a help in denial-of-service attacks, when accepting untrusted hext. This could be mitigated by a configuration option. Maybe something like "abort after doing x amount of work". IIRC, regex engines have similar protections.

Data extraction

Simple example
<div class:class>
  {
    <p @text:content />
  }
</div>
<div class="list1">
  <div><p>One</p>  </div>
  <div><p>Two</p>  </div>
  <div><p>Three</p></div>
</div>
<div class="list2">
  <div><p>Four</p> </div>
  <div><p>Five</p> </div>
  <div><p>Six</p>  </div>
</div>
{
  "class": "list1",
  "content": ["One","Two","Three"]
}
{
  "class": "list2",
  "content": ["Four","Five","Six"]
}
Edit: The example was wrong
<div class="list1">
  <div class="item"><p>One</p>  </div>
  <div class="item"><p>Two</p>  </div>
  <div class="item"><p>Three</p></div>
</div>
<div class="list2">
  <div class="item"><p>Four</p> </div>
  <div class="item"><p>Five</p> </div>
  <div class="item"><p>Six</p>  </div>
</div>

This would match not only the outer <div> elements, but also the <div> elements on the inside:

<div class:class> { <p @text:content /> } </div>

The result would be:

// outer divs
{"class":"list1","content":["One","Two","Three"]}
{"class":"list2","content":["Four","Five","Six"]}
// inner divs
{"class":"item","content":"One"}
{"class":"item","content":"Two"}
{"class":"item","content":"Three"}
{"class":"item","content":"Four"}
{"class":"item","content":"Five"}
{"class":"item","content":"Six"}

Multiple nested rules
<body>
  {
    <p @text:content />
  }
  {
    <a href:href />
  }
</body>
<body>
  <div> <p>P1 <a href="p1-href"></a></p> </div>
  <div> <p>P2 <a href="p2-href"></a></p> </div>
</body>
{
  "content": ["P1", "P2"],
  "href": ["p1-href", "p2-href"]
}

Scope

  • Rule.cpp: New member nested_, ability to add nested rules and iterate over them
  • RuleMatching.cpp: Match nested rules
  • hext-machine.rl: Add syntax, Rule tree building
  • Unit tests, blackbox tests
  • Documentation
  • Syntax highlighting for ACE and Vim

So far this looks like a non-breaking change (in regards to already written hext templates). I hope I didn't miss anything crucial.

This might actually work :) I need to sleep on this a bit.

from hext.

brandonrobertz avatar brandonrobertz commented on June 14, 2024

Wow this is incredible progress! If you want help testing or writing unit tests, or any other task, let me know.

from hext.

thomastrapp avatar thomastrapp commented on June 14, 2024

Current master supports nested rules (63339e0).

I am going to add a way to limit hext's resource usage, specifically for hext-on-websockets.

For example, this consumes about 300MB of memory at peak:

\time --verbose htmlext \
  <(echo '<* @inner-html:a>{<* @inner-html:b>{<* @inner-html:c>{<* @inner-html:d>{<* @inner-html:e>{<* @inner-html:f>{<*/>}</*>}</*>}</*>}</*>}</*>}</*>') \
  <(curl https://news.ycombinator.com/) > /dev/null

I'm thinking about adding a parameter to Rule::extract(html), something like:
Rule::extract(html, int max_nested_recursion = 0)
This parameter controls how often nested rules can traverse the HTML tree. I am not sure about the details, yet.
The default value disables throttling.

After solving this problem, I will publish new releases and update the documentation.

Nested rules are a great addition to hext. Thank you for the well written feature request.

from hext.

brandonrobertz avatar brandonrobertz commented on June 14, 2024

Great work! I'm readying everything for updates to hext-emscripten and my hextractor. Will be testing!

from hext.

thomastrapp avatar thomastrapp commented on June 14, 2024

The new pypi package v0.3.0 is built from hext v1.0.0 and currently available for:

  • manylinux2014: Python 3.5 - 3.9 (3.10 will follow soon)
  • macosx: Python 3.6 - 3.10

Todo:

  • node/npm packages
  • Support for "max_searches" parameter in language bindings

from hext.

thomastrapp avatar thomastrapp commented on June 14, 2024

All done!
I will probably improve the documentation at a later date.

Thank you for using hext - Let me know if there is anything missing.

from hext.

brandonrobertz avatar brandonrobertz commented on June 14, 2024

As always I'm impressed! I'm going to update hextractor and the other libraries and if I find any issues I'll open a new ticket. Thanks so much! 🦾

from hext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.