Coder Social home page Coder Social logo

Comments (5)

mikhainin avatar mikhainin commented on June 14, 2024 1

Sure, I just filed #534

from trafilatura.

mikhainin avatar mikhainin commented on June 14, 2024

I was able to fix it this way:

diff --git a/trafilatura/core.py b/trafilatura/core.py
index 63699a4..1970c25 100644
--- a/trafilatura/core.py
+++ b/trafilatura/core.py
@@ -397,7 +397,7 @@ def handle_table(table_elem, potential_tags, options):
                     # add child element to processed_element
                     if processed_subchild is not None:
                         subchildelem = SubElement(newchildelem, processed_subchild.tag)
-                        subchildelem.text, subchildelem.tail = processed_subchild.text, processed_subchild.tail
+                        subchildelem.text, subchildelem.tail = ''.join(processed_subchild.itertext()), processed_subchild.tail
                     child.tag = 'done'
             # add to tree
             if newchildelem.text or len(newchildelem) > 0:

But not sure if this is the correct solution

from trafilatura.

adbar avatar adbar commented on June 14, 2024

@mikhainin Thank you for reporting the bug and the solution, could you please draft a PR with your solution? If the tests pass I would integrate it.

from trafilatura.

adbar avatar adbar commented on June 14, 2024

Note: the issue is now fixed if recall option is on.

from trafilatura.

alroythalus avatar alroythalus commented on June 14, 2024

Try it for spotify https://www.spotify.com/in-en/legal/privacy-policy/
The lists in the tables arnt being captured yet
@adbar @mikhainin

from trafilatura.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.