adbar / trafilatura Goto Github PK

View Code? Open in Web Editor NEW

2.8K 28.0 214.0 23.6 MB

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Home Page: https://trafilatura.readthedocs.io

License: Apache License 2.0

Python 100.00%

web-scraping text-extraction nlp html2text news text-mining crawler text-cleaning text-preprocessing article-extractor

trafilatura's Introduction

Hi there! 👋

Links

⚡ Web | ✍ Blog | 🐦 Twitter | 🎞 Youtube | ☕ Coffee

Activity

🔭 Currently working on gathering texts on the Web and detecting word trends

Programming experience

🖩 First programs written on a TI-83 Plus in TI-BASIC

trafilatura's People

Contributors

Stargazers

Watchers

Forkers

derkozmonaut vbarbaresi lukasbbaw przor3n massyah keishinkickback zanachka victor8733 vishalbelsare peeyush113 ellielockhart austinsmom gaplan finster1 dmoklaf yomguithereal seryum dabr01 andiksetyawan pgbrandao lukehassel khaledadrani dominhhai djd0723 ebaharilikult stacklikemind bhuvanesh15 amirmamo314 coding-to-music dirtybit aarek-eng muellermartin arnedefauw c00renut notentered wubabootr bad-d0nkey sourcery-ai-bot super-high mohsenfayyaz naftalibeder jvdbogae ricjhill fangod niallbrickell krenthor mivanovitch sobir-git immortal-autumn wikidepia glacierck dtdly92 swetepete samrath-sudesh-acharya moregatest kobanbanan minaoar renardudezert mthandazo42 amichae2 kukupigs ben2017a justin1904 knit-bee deedy5 k-sareen dsgibbons jake-yi mrienstra m15821832738 imzachjohnson yufish cjmcmurtrie anaptfox helehm getorca mdic ddong17 lightcax wu-seong ashispapu wilson1yan andremacola kinged007 ihfazhillah eudaimonia-tech mwbyeon edkrueger pingmuict jasondavis panji31 abhi12299 viviana31henao transybao1393 marcosfp97 tbmihailov sdondley aliarshaddev korben00 utkarshsingh77

trafilatura's Issues

Difference between this and using readability

Hi!
Great library, but I am trying to figure out if we should swap from using readability on which this builds.
Apart from manually checking the quality for our corpus there is no easy way for me to compare performance.
Would you be able to explain the differences between the actual text extraction approaches?
Thanks in advance!
D

Extraction with unexpected result

Case: In an HTML page, the main text has been placed in paragraphs contained in a DIV. Then after a few more tags outside the DIV, some h2 headings appear:

...
<div>
<p>Lorem ipusm dolor</p>
<p>sit amet</p>
</div>

<h2>Hello World</>
<h2>and a few other texts</>
...

Unexpected behaviour: In the response JSON, the fields "text" and "raw-text" are composed of the content of the h2 tag, followed by the content of the preceding paragraphs p in the HTML stream:
"raw-text" : "Hello World and a few other texts Lorem ipusm dolor sit amet"

No Formatting in Plain Text Output

When using include_formatting for plain text, I'm not seeing any formatting (bold, italics, etc..).
The term I'm using supports this. Is this by design or a bug?
I tried both the standalone version and using it as a library with trafilatura.extract(downloaded, include_formatting=True).

Expose find_date() parameters through extract_metadata()

The parameters for date extraction are hard coded within extract_metadata(). In order to get a different date format (ISO-8601 in my case) I need to call it directly, and then I call extract_metadata() regardless to capture the other bits. This mean I'm running the date parsing two times, which I would like to optimize out.

Here is what I am doing:

tree = load_html(unicode_body)
publish_date = find_date(tree, extensive_search=True, url=doc_url, outputformat='%Y-%m-%d %H:%M:%S')

meta = extract_metadata(unicode_body)

if meta.author:
    author = meta.author

It would be nicer to have this (or equivalent):

meta = extract_metadata(unicode_body, extensive_search=True, url=doc_url, outputformat='%Y-%m-%d %H:%M:%S')

if meta.author:
    author = meta.author

if meta.date:
    publish_date = meta.date

How to pass "you are not a robot" with trafilature

Hey, I am using trafilatura to get the content of newspapers. Is it possible to overpass captcha within trafilatura? For instance, Bloomberg is returning "To continue, please click the box below to let us know you're not a robot. Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review..." when I try to access a news article.

Trafilatura is printing encoding error messages outside of logging perimeter

Specific encoding error messages from libxml2 are sometimes printed on stdout, independantly from the logging library.

After investigation, it seems to be due to a print here:

trafilatura/trafilatura/xml.py

Line 158 in bdde794

print(TEI_RELAXNG.error_log.last_error)

Which shall be replaced by a logging.warning call. I can provide a patch if this is useful.

Investigate potential speed-up with customized readability-lxml

The goal is to modify the internal subclass LXMLDocument so as to avoid converting back the output back from a string back to an LXML tree:

trafilatura/trafilatura/external.py

Line 56 in 3b4cb19

return html.fromstring(doc.summary(html_partial=True), parser=HTML_PARSER)

The function readibility.Document.get_clean_html could be of interest if added to the custom class LXMLDocument(Document), the question is how.

Feature request: keeping links

I can't find a way to extract text while preserving the <a href> tags in it (probably replacing those tags with link tags, output type would be xml). Would be great to have it. If I were to help add it, where should I look?

Importing only the extract utilities

Hello @adbar, thanks for your tremendous work on the library. Do you know if there is a way to install and then import the library so that you will only load the utilities related to raw content extraction from a html string? If not, is there anyway we can discuss this particular topic and see if I could help you implement this in any way? My use case is basically the following: I have a CLI tool that currently relies on dragnet and I would like to jump ship and adopt trafilatura. My issue is that I don't want to install the net-related dependencies you list in your setup.py (notably requests and tldextract) because they will clash with some of my dependencies and I have my own means of downloading things, dealing with urls etc.

Have a good day,

Fallback for parsing sitemap data

Hi, I just discovered this lib and it's quickly becoming one of my favorites due to how quick and clean it works!

I wanted to bring up an issue/potential fix for sitemap parsing, with the end goal of extracting all the urls from the sitemap.

Using the url https://hubspot.com as the example

Using the native method

from trafilatura import sitemaps, feeds

feed_url = "https://hubspot.com"

rss_list = feeds.find_feed_urls(feed_url, target_lang='en')
print(rss_list)

"""
returns

WARNING:root:not a valid XML sitemap: https://hubspot.com/sitemap.xml
WARNING:root:not a sitemap: https://hubspot.com/sitemap.xml
[] # returns empty list
"""

# then if searching by sitemap_search function
sitemap_list = sitemaps.sitemap_search(feed_url, target_lang='en')
print(sitemap_list)

"""
returns

ERROR:trafilatura.utils:not a 200 response: 404 for URL https://hubspot.com/sitemap_news.xml
WARNING:root:not a sitemap: https://hubspot.com/sitemap_news.xml
ERROR:trafilatura.utils:not a 200 response: 404 for URL https://hubspot.com/sitemap_index.xml
WARNING:root:not a sitemap: https://hubspot.com/sitemap_index.xml
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com sitemaps.org
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com google.com
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com google.com
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:root:not a valid XML sitemap: https://hubspot.com/sitemap.xml.gz
WARNING:root:not a sitemap: https://hubspot.com/sitemap.xml.gz
['https://www.hubspot.com/careers/error</loc><lastmod>2019-11-06</lastmod></url><url><loc>', 
'https://www.hubspot.com/case-studies/incenteev</loc><lastmod>2021-03-05</lastmod></url><url><loc>', ...]
# returns unclean list of results
"""

However, the first sitemap found in find_feed_urls is valid, it's just missing the encoding that the function looks for.

So what I had done instead was

from trafilatura import bare_extraction

feed_url = 'https://hubspot.com/sitemap.xml'

downloaded = fetch_url(feed_url)
result = bare_extraction(downloaded, output_format='python', include_links=True)

# probably better with regex
links_list = result['text'].split('https://')
links_list = [('https://' + l).strip() for l in links_list if l]

This returns a list of > 1k urls. If it would be possible to return valid 200 found sitemaps from find_feed_urls function, regardless of whether the function is able to successfully parse it, it would be very useful (perhaps as an arg).

Thanks for this awesome lib!

Missing title crashes Trafilatura

In some web pages Trafilatura crashes with this error:

Traceback (most recent call last):
  File "/Users/raphaelgeronimi/Local/Sabrina/python_libraries/extractpost.py", line 99, in _extract_html_with
    include_comments=False)
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/core.py", line 649, in extract
    max_tree_size=max_tree_size, url_blacklist=url_blacklist
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/core.py", line 562, in bare_extraction
    docmeta = extract_metadata(tree, url, date_extraction_params)
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/metadata.py", line 371, in extract_metadata
    metadata['sitename'] = extract_sitename(tree)
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/metadata.py", line 302, in extract_sitename
    mymatch = re.search(r'^.*?[-|]\s+(.*)$', tree.find('.//head/title').text)
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/re.py", line 185, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object

Reading the code, it seems that the reason is that in metadata.py line 302, the code in bold below returns None, which triggers an exception in the search method of the re library:
mymatch = re.search(r'^.?[-|]\s+(.)$', tree.find('.//head/title').text)

The solution would be to check for None (which could alter the logic afterward).

No metadata extraction

Hello,

Thanks for your beautiful and powerful project, I try to test some websites with trafilatura 0.6.0 in Python 3.8.

My test:

import trafilatura
from trafilatura.core import bare_extraction

downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')

result = bare_extraction(downloaded, include_formatting=False, with_metadata=True)

print(result)

The results:
({'title': None, 'author': None, 'url': None, 'hostname': None, 'description': None, 'sitename': None, 'date': None, 'categories': None, 'tags': None, 'fingerprint': None, 'id': None}, 'Leader spotlight: Erin Spiceland Every March we recognize the women who have shaped history—and now, we’re taking a look forward. From driving software development in large companies to maintaining thriving open source communities, we’re spending Women’s History Month with women leaders who are making history every day in the tech community. Erin Spiceland is a Software Engineer for SpaceX. Born and raised in rural south Georgia, she is a Choctaw and Chickasaw mother of two now living in downtown Los Angeles. Erin didn’t finish college—she’s a predominantly self-taught software engineer. In her spare time, she makes handmade Native American beadwork and regalia and attends powwows. How would you summarize your career (so far) in a single sentence? My career has been a winding road through periods of stimulation and health as well as periods of personal misery. During it all, I’ve learned a variety of programming languages and technologies while working on a diverse array of products and services. I’m a domestic abuse survivor and a Choctaw bisexual polyamorous woman. I’m so proud of myself that I made it this far considering where I came from. What was your first job in tech like? In 2007, I had a three-year-old daughter and I was trying to finish my computer science degree one class at a time, all while keeping my house and family running smoothly. I found the math classes exciting and quickly finished my math minor, leaving only computer science classes. I was looking at about five years before I would graduate. Then, my husband at the time recommended me for an entry software developer position at a telecom and digital communications company. When faced with the choice between an expensive computer science degree and getting paid to do what I loved, I dropped out of college and accepted the job. I was hired to work on internal tooling, and eventually, products. I did a lot of development on product front-ends, embedded network devices, and a distributed platform-as-a-service. I learned Java/JSP, Python, JavaScript/CSS, Node.js, as well as MySQL, PostgreSQL, and distributed systems architecture. It was an intense experience that required a lot of self-teaching, asking others for help, and daycare, but it set me up for my later successes. What does leadership mean to you in your current role? “Leadership is about enabling those below, above, and around you to be at their healthiest and most effective so that all of you can accurately understand your surroundings, make effective plans and goals for the future, and achieve those goals.” I appreciate and admire technical, effective leaders who care for their reports as humans, not as lines on a burndown chart, and forego heavy-handed direction in favor of communication and mutual dialogue. I think it’s as important for a leader to concern herself with her coworkers’ personal well-being as it is for her to direct their performance. What’s the biggest career risk you’ve ever taken? What did you learn from that experience? Last year I took a pay cut to move from a safe, easy job where I had security to work in a language I hadn’t seen in years and with systems more complicated than anything I’d worked with before. I moved from a place where I had a huge four bedroom house to a studio apartment that was twice the price. I moved away from my children, of who I share custody with my ex-husband. We fly across the U.S. to see each other now. I miss my children every day. However, I get to be a wonderful role model for them. “I get to show my children that a Native woman who grew up in poverty, lost her mother and her culture, and who didn’t finish college can learn, grow, and build whatever career and life she wants.” What are you looking forward to next? I can’t wait to wake up every day with my partner who loves me so much. I’m looking forward to showing my children exactly how far they can go. I’m excited to keep exploring Los Angeles. “I expect to learn so much more about software and about life, and I want to experience everything.” Want to know more about Erin Spiceland? Follow them on GitHub or Twitter. Want to learn more about featured leaders for Women’s History Month? Read about: Laura Frank Tacho, Director of Engineering at CloudBees Rachel White, Developer Experience Lead at American Express Kathy Pham, Computer Scientist and Product Leader at Mozilla and Harvard Heidy Khlaaf, Research Consultant at Adelard LLP Check back in soon—we’ll be adding new interviews weekly throughout March.', <Element body at 0x10680a280>, <Element body at 0x1067af080>)

So, no metadata return.

Also, I added a xpath in the metaxpaths.py and rebuild your code. I'm sure that //div[contains(@class, "post__categories")]//li//a will be match with a category in the url https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/. But no category is returned.

categories_xpaths = [
    """//div[starts-with(@class, 'post-info') or starts-with(@class, 'postinfo') or
    starts-with(@class, 'post-meta') or starts-with(@class, 'postmeta') or
    starts-with(@class, 'meta') or starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-info') or
    starts-with(@class, 'entry-utility') or starts-with(@id, 'postpath')]//a""",
    "//p[starts-with(@class, 'postmeta') or starts-with(@class, 'entry-categories') or @class='postinfo' or @id='filedunder']//a",
    "//footer[starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-footer') or starts-with(@class, 'post-info')]//a",
    '//*[(self::li or self::span)][@class="post-category" or starts-with(@class, "post__categories") or @class="postcategory" or @class="entry-category"]//a',
    '//header[@class="entry-header"]//a',
    '//div[@class="row" or @class="tags"]//a',
    '//div[contains(@class, "post__categories")]//li//a',
]

Another question is that could I get content of article including html format (no clean tags in content)?

Please help me, thanks for your support!

Unhandled TypeError — possible regression?

Hey,
I noticed that a fair amount of extractions are failing with a seemingly unnecessary TypeError.

steps to reproduce:

downloaded = trafilatura.fetch_url('https://fortelabs.co/blog/para/')
trafilatura.extract(downloaded, include_links=True)

gives:

TypeError                                 Traceback (most recent call last)
<ipython-input-28-33fce83a7b3f> in <module>
----> 1 trafilatura.extract(downloaded, include_formatting=False, include_links=True)

/usr/local/lib/python3.8/site-packages/trafilatura/core.py in extract(filecontent, url, record_id, no_fallback, include_comments, output_format, tei_validation, target_language, include_tables, include_images, include_formatting, include_links, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, settingsfile, config)
    776         url_blacklist = set()
    777     # extraction
--> 778     docmeta = bare_extraction(
    779         filecontent, url=url, no_fallback=no_fallback,
    780         include_comments=include_comments, output_format=output_format,

/usr/local/lib/python3.8/site-packages/trafilatura/core.py in bare_extraction(filecontent, url, no_fallback, include_comments, output_format, target_language, include_tables, include_images, include_formatting, include_links, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, config)
    682
    683         # extract content
--> 684         postbody, temp_text, len_text, sure_thing = extract_content(cleaned_tree, include_tables, include_images, include_links, deduplicate, config)
    685
    686         # compare if necessary

/usr/local/lib/python3.8/site-packages/trafilatura/core.py in extract_content(tree, include_tables, include_images, include_links, deduplicate, config)
    382         # list(filter(None.__ne__, processed_elems))
    383         result_body.extend([e for e in
--> 384                             [handle_textelem(e, potential_tags, deduplicate, config) for e in subtree.xpath('.//*')]
    385                             if e is not None])
    386         # remove trailing titles

/usr/local/lib/python3.8/site-packages/trafilatura/core.py in <listcomp>(.0)
    382         # list(filter(None.__ne__, processed_elems))
    383         result_body.extend([e for e in
--> 384                             [handle_textelem(e, potential_tags, deduplicate, config) for e in subtree.xpath('.//*')]
    385                             if e is not None])
    386         # remove trailing titles

/usr/local/lib/python3.8/site-packages/trafilatura/core.py in handle_textelem(element, potential_tags, dedupbool, config)
    287         new_element = handle_titles(element)
    288     elif element.tag == 'p':
--> 289         new_element = handle_paragraphs(element, potential_tags, dedupbool, config)
    290     elif element.tag == 'lb':
    291         if text_chars_test(element.tail) is True:

/usr/local/lib/python3.8/site-packages/trafilatura/core.py in handle_paragraphs(element, potential_tags, dedupbool, config)
    171                     newsub.set('rend', child.get('rend'))
    172                 elif child.tag == 'ref':
--> 173                     newsub.set('target', child.get('target'))
    174             # handle line breaks
    175             elif child.tag == 'lb':

src/lxml/etree.pyx in lxml.etree._Element.set()

src/lxml/apihelpers.pxi in lxml.etree._setAttributeValue()

src/lxml/apihelpers.pxi in lxml.etree._utf8()

TypeError: Argument must be bytes or unicode, got 'NoneType'

I believe this error could be handled in a meaningful way (perhaps excepted and passed?), but maybe I'm missing something?

Try to fetch dead links from archive.org

URL structure could be https://web.archive.org/web/20/original-URL

Git Tags for Releases

Will you please start tagging your master branch to match the releases published to pypi? This will help immensely when it comes to finding code to debug issues for a specific version.

Nice library by the way. I'm putting it to use.

Is there a way to extract a top image from an article?

article_trafilatura = trafilatura.bare_extraction(trafilatura.fetch_url(url), include_images=True)

When I set include_images=True, I get article src, but it seems to miss the top article image src. Is it meant to be this way or is it still not implemented or am I doing something wrong? I have tried passing different urls from different websites but still I cannot find a way how to extract that image.

Trafilatura is skipping ordered list elements (ol tag)

Trafilatura is not including the ol tag on the following page, e.g. where the first list element starts with "Der Arbeitgeber finanziert Ihre bAV allein":

https://www.finanztip.de/betriebliche-altersvorsorge/

It's also skipping the h3 titles on the aforementioned page.

PS: Not sure if you want issues with the result precision to be reported here?

Failing to use new version due to deps mismatch.

Hey @adbar

There are issues with the chardet dep, when using the new version. I run pip-compile with the following items:

beautifulsoup4
Flask
pytest
gunicorn
readability-lxml
trafilatura==0.8.0
backoff
requests
urllib3
google-cloud-storage
google-cloud-bigquery
google-cloud-firestore
google-cloud-error-reporting

The error:

Could not find a version that matches chardet<4,>=3.0.2,>=3.0.4,>=4.0.0 (from -r requirements.in (line 4))
Tried: 1.0, 1.0.1, 1.1, 2.1.1, 2.2.1, 2.2.1, 2.3.0, 2.3.0, 3.0.0, 3.0.0, 3.0.1, 3.0.1, 3.0.2, 3.0.2, 3.0.3, 3.0.3, 3.0.4, 3.0.4, 4.0.0, 4.0.0
There are incompatible versions in the resolved dependencies:
  chardet (from -r requirements.in (line 4))
  chardet>=4.0.0 (from htmldate==0.8.0->trafilatura==0.7.0->-r requirements.in (line 7))
  chardet>=3.0.4 (from trafilatura==0.7.0->-r requirements.in (line 7))
  chardet<4,>=3.0.2 (from requests==2.23.0->-r requirements.in (line 9))
  chardet (from readability-lxml==0.8.1->-r requirements.in (line 6))

If I do not pin trafilatura, then it runs fine but we get 0.6.0.

Thanks in advance!

Having difficulty extracting text from headings, images and unordered lists

Hi all I presume this is an easy fix. I'm loading a local copy of this file:

https://lisn-tests.netlify.app/rich-content.html
loaded using Path(…).read_text()

using bare_extraction

I get this output that appears to suggest that trafilatura with these options is skipping over headers, bullets, and discarding images. What have I done wrong to make it behave like this?


  "title": "This is heading 1",
  "author": null,
  "hostname": null,
  "date": null,
  "categories": "",
  "tags": "",
  "fingerprint": "bnP8wg6PD0dg9QA9O/mFqg7dSEk=",
  "id": null,
  "raw-text": "Paragraph 1. What does this make of a bulleted list? This is quoted textAn image: This is bolded text. This is bolded text. This is a paragraph with some bold text in the middle. Table heading 1 Table heading 2 Table heading 1 Table heading 2 Table heading 1 Table heading 2",
  "source": null,
  "source-hostname": null,
  "excerpt": null,
  "text": "Paragraph 1. What does this make of a bulleted list?\nThis is quoted textAn image:\nThis is bolded text.\nThis is bolded text.\nThis is a paragraph with some bold text in the middle.\n|Table heading 1||Table heading 2|\n|Table heading 1||Table heading 2|\n|Table heading 1||Table heading 2|",
  "comments": ""
}
===
[
  "Paragraph 1.",
  "What does this make of a bulleted list?",
  "This is quoted textAn image:",
  "This is bolded text.",
  "This is bolded text.",
  "This is a paragraph with some bold text in the middle.",
  "|Table heading 1||Table heading 2|",
  "|Table heading 1||Table heading 2|",
  "|Table heading 1||Table heading 2|"
]

Test program

Extracting Text from HTML: Unordered List Description\Header

I have been using trafilatura to extract text from HTML pages. I have noticed that sometimes the text following an unordered list is not extracted, the list items are extracted but not the text following the unordered list tag.

<ul>Description of the list:
	<li>List item 1</li>
	<li>List item 2</li>
	<li>List item 3</li>
</ul>

In the previous code example, the extracted text would be:

List item 1
List item 2
List item 3

"Description of the list" would not be extracted into the text file. This is probably due to incorrect HTML coding practices but I'm wondering if Trafilatura can capture that text.

IndexError: pop from empty list

Hi,

I tried to crawl texts from a list of 100 URLs experimentally. Every time it's stopped at the 40th with the error as follows:

$ trafilatura --inputfile linkliste-gefiltert.txt --outputdir ausgabe/ --backup-dir html-quellen/
Traceback (most recent call last):
  File "/home/anaconda3/bin/trafilatura", line 8, in <module>
    sys.exit(main())
  File "/home/anaconda3/lib/python3.7/site-packages/trafilatura/cli.py", line 151, in main
    process_args(args)
  File "/home/anaconda3/lib/python3.7/site-packages/trafilatura/cli.py", line 166, in process_args
    url_processing_pipeline(args, input_urls, SLEEP_TIME)
  File "/home/anaconda3/lib/python3.7/site-packages/trafilatura/cli_utils.py", line 306, in url_processing_pipeline
    multi_threaded_processing(domain_dict, args, sleeptime, counter)
  File "/home/anaconda3/lib/python3.7/site-packages/trafilatura/cli_utils.py", line 254, in multi_threaded_processing
    bufferlist.append(domain_dict[domain].pop())
IndexError: pop from empty list

Anyone also run into this problem?

Fails to parse on mangled DOCTYPE

This HTML document fails to extract any content using current trafilatura: https://firstmonday.org/ojs/index.php/fm/article/download/10274/9729?inline=1

From experimentation, I believe this is due to the mangled DOCTYPE line:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2012"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
[...]

Note the misplaced 2012.

I've contacted the webmaster, and this might be more of an issue with a parsing library that trafilatura uses, but I wonder if this pattern of mangled HTML might be popular and could be tolerated. Particularly in older documents (pre modern web).

As some context for prioritization, this is not an urgent or blocking concern for my use of trafilatura, and I am using a work around for this specific situation.

Getting "ERROR: file too small" when trying to use trafilatura

Hello everybody,

I'm fairly new to using trafilatura. When I try to run " trafilatura -u 'insert random news article' " and execute it in the shell, I always get "ERROR: file too small". I currently use zsh as a shell and also tried it on bash with no success. I checked, that the latest version is installed. I'm currently working on MAC OS Catalina (10.15.7). I've never worked with the command line/shell before so I would really appreciate if somebody could help me with this issue

Trafilatura can't be loaded after installing it to local folder

Hi,

I just tried to install your awsome project to an local folder (pip install --target {path}/package trafilatura). After installing it I cant load it with from package import trafilatura:

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

~\anaconda3\lib\site-packages\pkg_resources\__init__.py in get_provider(moduleOrReq)
    359     try:
--> 360         module = sys.modules[moduleOrReq]
    361     except KeyError:

KeyError: 'trafilatura'


During handling of the above exception, another exception occurred:

ModuleNotFoundError                       Traceback (most recent call last)

<ipython-input-37-57dd4ca0a626> in <module>
      1 from pprint import pprint
----> 2 from package import trafilatura
      3 import time
      4 
      5 trafilatura

~{path}\package\trafilatura\__init__.py in <module>
     14 import logging
     15 
---> 16 from .core import extract, process_record
     17 from .utils import fetch_url
     18 

~{path}\package\trafilatura\core.py in <module>
     17 
     18 # own
---> 19 from .external import justext_rescue, sanitize_tree, SANITIZED_XPATH, try_readability
     20 from .filters import content_fingerprint, duplicate_test, language_filter, text_chars_test
     21 from .htmlprocessing import (convert_tags, discard_unwanted,

~{path}\package\trafilatura\external.py in <module>
     36 from .settings import JUSTEXT_LANGUAGES, MANUALLY_STRIPPED
     37 from .utils import trim, HTML_PARSER
---> 38 from .xml import TEI_VALID_TAGS
     39 
     40 

~{path}\package\trafilatura\xml.py in <module>
     21 LOGGER = logging.getLogger(__name__)
     22 # validation
---> 23 TEI_SCHEMA = pkg_resources.resource_filename('trafilatura', 'data/tei-schema.pickle')
     24 TEI_VALID_TAGS = {'body', 'cell', 'code', 'del', 'div', 'fw', 'head', 'hi', 'item', \
     25                   'lb', 'list', 'p', 'quote', 'row', 'table'}

~\anaconda3\lib\site-packages\pkg_resources\__init__.py in resource_filename(self, package_or_requirement, resource_name)
   1143     def resource_filename(self, package_or_requirement, resource_name):
   1144         """Return a true filesystem path for specified resource"""
-> 1145         return get_provider(package_or_requirement).get_resource_filename(
   1146             self, resource_name
   1147         )

~\anaconda3\lib\site-packages\pkg_resources\__init__.py in get_provider(moduleOrReq)
    360         module = sys.modules[moduleOrReq]
    361     except KeyError:
--> 362         __import__(moduleOrReq)
    363         module = sys.modules[moduleOrReq]
    364     loader = getattr(module, '__loader__', None)

ModuleNotFoundError: No module named 'trafilatura'

Python code used:

from package import trafilatura
trafilatura

Note:
I'm using Python 3.7.6 with pip 20.0.2.

Edit:
A quick fix (for me) is replacing TEI_SCHEMA = pkg_resources.resource_filename('trafilatura', 'data/tei-schema.pickle') in "xml.py" with TEI_SCHMEA = './data/tei-schema.pickle' and
change line 11 in settings.py from from trafilatura import __version__ to from ..trafilatura import __version__

Extracting with include_images option causes subsequent extractions to fail

There's a bug with extraction using include_images option, it only works once and then extractions start to fail. The bug is in htmlprocessing.py, lines 46 to 50:

  if include_images is True:
      # Many websites have <img> inside <figure> or <picture> or <source> tag
      for element in ['figure', 'picture', 'source']:
          MANUALLY_CLEANED.remove(element)
      MANUALLY_STRIPPED.remove('img')

This bit of code will be repeated for all extractions, but since the tags have been removed on the first extraction, a ValueError will be thrown for all other extractions. I tested that adding existence checks solves the issue:

if include_images is True:
        # Many websites have <img> inside <figure> or <picture> or <source> tag
        for element in ['figure', 'picture', 'source']:
            if element in MANUALLY_CLEANED:
                MANUALLY_CLEANED.remove(element)
        if 'img' in MANUALLY_STRIPPED:
            MANUALLY_STRIPPED.remove('img')

Check the language, clarity and consistency of documentation

A short version of the documentation is available straight from Github (README.rst) while a more exhaustive one is present in the docs folder and online on trafilatura.readthedocs.io

Several problems could arise:

Non-idiomatic use of English (not quite fluent or natural)
Unclear or incomplete descriptions
Code examples that don't work
Typos in explanations or code sections
Outdated sections

Feel free to help with any of these, thanks!

Refactor code to provide a "keep-tags" option

Feature requests like in #38 and #48 deal with inclusion of particular HTML elements in the output.

To allow for easier inclusion and less hacky code it would be best to

refactor the code along a list of kept tags
ensure that occurring tags remain untouched in stripping and conversion steps
pass/expose this option to Python and CLI

Bug with LXML parser

On some documents Trafilatura 0.6.0 fails with this error:

... File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/external.py", line 123, in sanitize_tree tree = prune_html(tree) File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/htmlprocessing.py", line 63, in prune_html element.drop_tree() AttributeError: 'lxml.etree._Element' object has no attribute 'drop_tree'

A git blame on this line reveals this is new code that has been made 21 days ago in this revision:
74444d2

Note: I am using the latest version of lxml (4.6.1)

TypeError: expected string or bytes-like object

Me again :)

I'm trying to extract metadata and encountered this error. Dug a bit deeper and it was caused by metadata extractor. Particularly in this case extracting title.
In html <title> tag is empty - <title></title> and inside <head></head>. There is plenty of meta tags and js garbage also and I couldn't make short example to reproduce it but I found solution.

In extract_sitename:

def extract_sitename(tree):
    '''Extract the name of a site from the main title (if it exists)'''
    title_elem = tree.find('.//head/title')
    print (title_elem.text_content())

    if title_elem is not None:

        try:
            mymatch = re.search(r'^.*?[-|]\s+(.*)$', title_elem.text)
            if mymatch:
                return mymatch.group(1)
        except AttributeError:
            pass
    return None

it checks if title_elem is not None. My title_elem passes this and returns True
But when passed to regex raises this error.

I checked title_elem.text_content() and it is empty
I also checked type(title_elem.text) and shows that it is NoneType and is cause for the error.

To fix this, you have to check if title_elem.text is not None instead of just title_elem

ld-json extraction cruft

Hello @adbar,

This is not an issue per-se but it seems that when you extract metadata from ld-json, such as the author, for instance, you don't take into account the fact that it is sometimes stringified with python's infamous ensure_ascii keyword to default True. If we take this html page for instance: https://www.bvoltaire.fr/jean-sevillia-letat-francais-et-letat-algerien-doivent-reconnaitre-les-crimes-commis-des-deux-cotes/ the name "Jean Sévillia" is stringified as "Jean S\u00e9villia". This can be a drag sometimes.

I am not saying this should be addressed by trafilatura directly and I personally use this hacky function to get the job done (a replacer might be dangerous because of some erroneous html text containing badly stringified emojis missing parts of a surrogate combination I keep stumbling upon these days...), but I just wanted to open a discussion.

Keep output file name same as input file name

First of all, let me thank you for working on this amazing tool!

One quick feature request: is it possible to keep the output file name same as the input file names, when running on the command line? Also, I see this error # ERROR: file too small in command line but while running the API programmatically, it doesn't provide any meaningful message when the extraction is not successful. Can we fix it?

KeyError in metadata.py extract_catstags function

I caught the following exception:

...
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/core.py", line 586, in bare_extraction
    docmeta = extract_metadata(tree, url, date_extraction_params)
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/metadata.py", line 393, in extract_metadata
    metadata['categories'] = extract_catstags('category', tree)
  File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/metadata.py", line 331, in extract_catstags
    results.append(element.attrib['content'])
  File "src/lxml/etree.pyx", line 2479, in lxml.etree._Attrib.__getitem__
KeyError: 'content'

I use Python 3.7.9 and Trafilatura 0.6.1 (from Pypi)

Read settings from user-provided file

The configparser library would do the trick in order to override the variables in trafilatura/settings.py on user's request:

implement settings from .ini file
write tests
write documentation

Bypass catchas/cookies/consent windows?

Are there Python libraries which allow to bypass diverse consent mechanisms put in place by news outlets for the readers to allow cookies? It would be too cumbersome to develop a headless browser exclusively for trafilatura.

A good example would be the newspaper zeit.de. Related to #18.

Potential solutions:

headless browser with automatic clickling mechanism
use AMP-links

The output could be piped directly to trafilatura (in a terminal or via Python).

Strip links error

Thank you for that best stuff!!

A bit problem with parse text from https://gametarget.ru/mmofps/

<doc sitename="GameTarget.Ru" title="Онлайн шутеры 2020 - бесплатные игры MMOFPS на ПК" source="https://gametarget.ru/mmofps/" hostname="gametarget.ru" excerpt="В этом разделе сайта собраны клиентские онлайн шутеры, которые помогут вам избавиться от накопившейся усталости и агрессии, выплеснув все негативные эмоции на виртуальных противников по ту сторону экрана." categories="" tags="">
  <main>
    <div><p>Среди жанров многопользовательских игр MMOFPS занимают одну из лидирующих позиций, наряду с</p>MMORPG<p>и</p>MOBA<p>. И это неудивительно, ведь онлайн шутеры не требуют к себе много внимания, при этом развлекают игрока буквально с первых же секунд – никаких долгих вступлений и обучающих уровней: вы просто заходите в лобби, выбираете интересующий режим и спустя несколько мгновений оказываетесь в самой гуще сражения.</p><p>Еще один фактор, влияющий на популярность жанра ммофпс – высокое качество продукта. Среди онлайн шутеров на ПК бушует нешуточная конкуренция, поэтому разработчикам приходится выкладываться на все сто процентов, чтобы обеспечить игроков максимально яркими впечатлениями. Лучшие игры завлекают аудиторию красивой графикой, разнообразием режимов, близким к идеальному балансом, тщательно выверенным геймплеем, а также следованием всем современным трендам – например, наличием элементов выживания, ежедневных заданий или ультрамодных режимов вроде «Королевской битвы».</p><p>Так как жанр существует уже несколько десятков лет, в его рамках сформировалось множество поджанров и категорий. Шутеры классифицируются по модели распространения – платные и бесплатные онлайн шутеры, по расположению виртуальной камеры – от первого лица и от третьего лица, по геймплейным особенностям – тактические, реалистичные, кооперативные, снайперские и так далее, по сеттингу – фантастические, современные, постапокалиптические, военные. Потеряться в этом многообразии очень легко, поэтому при выборе онлайн игры для многочасовых сетевых сражений стоит довериться команде, которая хорошо разбирается в теме.</p><p>В нашем каталоге вы найдете лучшие мультиплеерные шутеры на ПК с подробным описанием для каждого из них. Зайдя на страницу игры, вы ознакомитесь с детальной информацией касательно сеттинга, режимов, классов и других особенностей, оцените визуальную составляющую в трейлерах и скриншотах, а также узнаете, можно ли скачать понравившийся шутер бесплатно. Мы постоянно актуализируем данные, так что будьте уверены – здесь вы получите самые правдивые и честные обзоры, которые, мы надеемся, помогут вам принять верное решение.</p></div>
  </main>
  <comments/>
</doc>

with <p>Среди жанров многопользовательских игр MMOFPS занимают одну из лидирующих позиций, наряду с</p>MMORPG<p>и</p>MOBA<p>

http://prntscr.com/us0qm3
http://prntscr.com/us0r2o

Links from source code are incorrectly replaced with paragraphs, which breaks the entire document.

Thank you for help!!!!!

v 0.5.2

missing <p>TEXT</p> from HTML after extract

I have a HTML-fragement (nested divs, p, ul).

html_fragement = "<div class="l-main-column">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\n\t<div class="text-image-container ">\n<div class="text-image">\n<p>Mit dem KfW-Unternehmerkredit fördern wir Unter\xadnehmen sowie Frei\xadberufler, die seit mindestens <span class="u-nobr">5 Jahren</span> am Markt aktiv sind.</p>\n</div>\n</div><div class="text-image-container ">\n<div class="text-image">\n<p><strong>Das Förderprodukt kommt nicht in Frage: </strong></p><ul class="list list--bullets"> <li class="list__item"> für Existenzgründer und junge Unternehmen bis 5 Jahre. Diese unterstützen wir mit anderen Förder\xadprodukten, zum Beispiel mit dem <a class="link link--underline" href="https://www.kfw.de/inlandsfoerderung/Privatpersonen/Gr%C3%BCnden-Erweitern/F%C3%B6rderprodukte/ERP-Gr%C3%BCnderkredit-Universell-(073_074_075_076)/" title="Zur Infoseite zum ERP-Gründerkredit - Universell (073, 074, 075, 076)" data-section="contentcolumn"><span class="link__name"><span class="link__name-inner"><span class="link__name-text">ERP-Gründerkredit – Universell</span></span></span></a>. </li><li class="list__item"> für Unternehmen, die zum 31.12.2019 in Schwierig\xadkeiten waren, also vor Beginn der Coronakrise. </li><li class="list__item"> wenn Sie während der Kredit\xadlaufzeit Gewinn oder Dividende ausschütten. Möglich sind aber markt\xadübliche Ausschüttungen oder Entnahmen für Geschäfts\xadinhaber (natürliche Personen). </li> </ul>\n</div>\n</div>\n\t\n\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t</div>"

I wanted trafilatura v0.6.0 to extract the "visible" text from the fragement by trafilatura.extract(html_fragment, target_language='de'). There are two paragraphs and an unordered list. After the extract, I receive the text from second paragraph and text of unordered list. But the first paragraph is lost. why?

the output of the extract is:

In [16]: trafilatura.extract(a[0])
2020-12-15 18:01:01 [trafilatura.core] DEBUG: Taking all p-elements
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 34.250 .text-image-container>.text-image link density 0.000 -> 34.250
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 32.125 .l-main-column>.text-image-container link density 0.000 -> 32.125
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 32.390 .text-image-container>.text-image link density 0.063 -> 30.344
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 31.195 .l-main-column>.text-image-container link density 0.063 -> 29.225
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 34.250 .text-image-container>.text-image
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 32.125 .l-main-column>.text-image-container
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 30.344 .text-image-container>.text-image
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 29.225 .l-main-column>.text-image-container
2020-12-15 18:01:01 [readability.readability] DEBUG: Not removing div{01}>.text-image of length 125:  Mit dem KfW-Unternehmerkredit fördern w...
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 32.390 .text-image-container>.text-image link density 0.063 -> 30.344
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 31.195 .l-main-column>.text-image-container link density 0.063 -> 29.225
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 30.344 .text-image-container>.text-image
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 29.225 .l-main-column>.text-image-container
2020-12-15 18:01:01 [readability.readability] DEBUG: Not removing .text-image>ul.list.list--bullets of length 435:  für Existenzgründer und junge Unternehm...
2020-12-15 18:01:01 [readability.readability] DEBUG: Not removing div{01}>.text-image of length 475:  Das Förderprodukt kommt nicht in Frage:...
2020-12-15 18:01:01 [trafilatura.core] DEBUG: extracted length: 476 (algorithm) 165 (extraction)
2020-12-15 18:01:01 [trafilatura.core] INFO: using generic algorithm: None
2020-12-15 18:01:01 [trafilatura.core] INFO: not enough comments None
Out[16]: 'Das Förderprodukt kommt nicht in Frage:\n- für Existenzgründer und junge Unternehmen bis 5 Jahre. Diese unterstützen wir mit anderen Förderprodukten, zum Beispiel mit dem ERP-Gründerkredit – Universell.\n- für Unternehmen, die zum 31.12.2019 in Schwierigkeiten waren, also vor Beginn der Coronakrise.\n- wenn Sie während der Kreditlaufzeit Gewinn oder Dividende ausschütten. Möglich sind aber marktübliche Ausschüttungen oder Entnahmen für Geschäftsinhaber (natürliche Personen).'

The mssing paragraph is not removed by readability.readabilty.
Not removing div{01}>.text-image of length 125: Mit dem KfW-Unternehmerkredit fördern w...

I expected:

Mit dem KfW-Unternehmerkredit fördern wir Unternehmen sowie Freiberufler, die seit mindestens 5 Jahren am Markt aktiv sind.\nDas Förderprodukt kommt nicht in Frage:\n- für Existenzgründer und junge Unternehmen bis 5 Jahre. Diese unterstützen wir mit anderen Förderprodukten, zum Beispiel mit dem ERP-Gründerkredit – Universell.\n- für Unternehmen, die zum 31.12.2019 in Schwierigkeiten waren, also vor Beginn der Coronakrise.\n- wenn Sie während der Kreditlaufzeit Gewinn oder Dividende ausschütten. Möglich sind aber marktübliche Ausschüttungen oder Entnahmen für Geschäftsinhaber (natürliche Personen).

Thanks for your support. Love your work.

Can't get content from http site

With cocon.se site fetch_urlr result is None.

Test code

import trafilatura

if __name__ == '__main__':
  downloaded = trafilatura.fetch_url('http://cocon.se/')
  if downloaded is None:
    print('Error downloaded content')
    exit(1)
  result = trafilatura.extract(downloaded)
  print(result)

Result
Error downloaded content

The reason for this error is that no UA is specified in the header of the request.
With this code, It's OK

  response = requests.Session().get('http://cocon.se/', timeout=30, verify=False, allow_redirects=True, headers={'User-Agent': 'Mozilla/5.0'})
  downloaded = response.text
  result = trafilatura.extract(downloaded)
  print(result)

Thoroughly implement and test duplicate detection

Least-recently-used (LRU) cache
Maximum number of occurrences allowed?
Line / sentence / paragraph / document level?
Concurrency: thread-safety / multiprocessing

Library is redirecting stderr to /dev/null upon every call

If readbility fallback is activated, the Trafilatura library redirects stderr to /dev/null upon every call:

trafilatura/trafilatura/external.py

Line 63 in a56fb3e

with open(os.devnull, 'w') as devnull:

Within programs involving other libraries, this causes a host of side effects. E.g., generating a chart with seaborn imports ipython (a dependency of seaborn) which pre-checks upon initialization stdin, stdout and stderr and crashes because stderr is /dev/null. I have other side effects as well in other libraries, including disappearing logs (eg when logs settings are modified after calls to Trafilatura).

This redirection seems to have been necessary to prevent the readibility library to print out messages to stderr. A cursory reading of the current version of readibility seems to indicate it doesn't do that, it only emits proper logs.

Consequently, this redirect may be removed (to be tested).

AttributeError: module 'regex' has no attribute 'compile'

When i'm trying to import trafilatura, it throws an error that regex has no compile attribute, I have installed version trafilatura-0.8.0. I'm currently running it in jupyter notebook.

Image URLs in Wikipedia texts

Investigate potential issue with missing protocol in URLs as mentioned in #48.

For Json_format=True, Failed when xmltotxt generate None

TypeError: unsupported operand type(s) for +: 'int' and 'str'

I have got an html with this kind of string in it:

'<a href="" class="post-meta-date sh-default-color">2019 28 meh</a>\n        \n    \t\t\t\t\t\t\t</div>\n\n\t\t\t\t\t\t\t<a href="some_link/" class="post-title">\n\t\t\t\t\t\t\t\t<h2 itemprop="headline">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tsome random text?\t\t\t\t\t\t\t\t</h2>\n\t\t\t\t\t\t\t</a>'

When I try to do bare_extraction it fails with:

---------------------------------------------------------------------------
IllegalMonthError                         Traceback (most recent call last)
.../python3.7/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    654         try:
--> 655             ret = self._build_naive(res, default)
    656         except ValueError as e:

.../python3.7/site-packages/dateutil/parser/_parser.py in _build_naive(self, res, default)
   1237 
-> 1238             if cday > monthrange(cyear, cmonth)[1]:
   1239                 repl['day'] = monthrange(cyear, cmonth)[1]

.../python3.7/calendar.py in monthrange(year, month)
    123     if not 1 <= month <= 12:
--> 124         raise IllegalMonthError(month)
    125     day1 = weekday(year, month, 1)

IllegalMonthError: bad month number 28; must be 1-12

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-89-26cbca439b34> in <module>
      1 # downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
      2 res = bare_extraction(con3,
----> 3                       include_formatting = True,
      4 #                       output_format ="xml"
      5                      )

.../python3.7/site-packages/trafilatura/core.py in bare_extraction(filecontent, url, no_fallback, include_comments, output_format, target_language, include_tables, include_images, include_formatting, include_links, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, config)
    674         # extract metadata if necessary
    675         if output_format != 'txt':
--> 676             docmeta = extract_metadata(tree, url, date_extraction_params)
    677             # cut short if extracted URL in blacklist
    678             if docmeta['url'] in url_blacklist:

.../python3.7/site-packages/trafilatura/metadata.py in extract_metadata(filecontent, default_url, date_config)
    384     date_config['url'] = metadata['url']
    385     try:
--> 386         metadata['date'] = find_date(tree, **date_config)
    387     # temporary fix for htmldate bug
    388     except UnicodeError:

.../python3.7/site-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
    629     for expr in DATE_EXPRESSIONS:
    630         dateresult = examine_date_elements(
--> 631             search_tree, expr, outputformat, extensive_search, min_date, max_date
    632         )
    633         if dateresult is not None:

.../python3.7/site-packages/htmldate/core.py in examine_date_elements(tree, expression, outputformat, extensive_search, min_date, max_date)
     92                 {ord(c): None for c in '\n\t\r'}
     93             ).strip()[:100])
---> 94         attempt = try_ymd_date(toexamine, outputformat, extensive_search, min_date, max_date)
     95         if attempt is not None:
     96             return attempt

.../python3.7/site-packages/htmldate/extractors.py in try_ymd_date(string, outputformat, extensive_search, min_date, max_date)
    387         return None
    388     # faster
--> 389     customresult = custom_parse(string, outputformat, extensive_search, min_date, max_date)
    390     if customresult is not None:
    391         return customresult

.../python3.7/site-packages/htmldate/extractors.py in custom_parse(string, outputformat, extensive_search, min_date, max_date)
    304             # speed-up by ignoring time zone info if ciso8601 is installed
    305             else:
--> 306                 result = parse_datetime_as_naive(string)
    307             if date_validator(result, outputformat, earliest=min_date, latest=max_date) is True:
    308                 LOGGER.debug('parsing result: %s', result)

.../python3.7/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
   1372         return parser(parserinfo).parse(timestr, **kwargs)
   1373     else:
-> 1374         return DEFAULTPARSER.parse(timestr, **kwargs)
   1375 
   1376 

.../python3.7/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    655             ret = self._build_naive(res, default)
    656         except ValueError as e:
--> 657             six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
    658 
    659         if not ignoretz:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Used the same html string as input for jusText and it worked. Not sure where exactly is bug.

Using headline instead of name in JSON-LD metadata

Trafilatura extracts some metadata from JSON-LD tag if it is available. In particular, it tries to search for the title in the "headline" property of the JSON-LD tag, but looks like the headline is not necessarily the title. For example, look at this wikipedia page: https://en.m.wikipedia.org/wiki/Semantic_satiation
The JSON-LD is:

{
"@context":"https:\/\/schema.org","@type":"Article",
"name":"Semantic satiation",
"url":"https:\/\/en.wikipedia.org\/wiki\/Semantic_satiation",
"sameAs":"http:\/\/www.wikidata.org\/entity\/Q226007",
"mainEntity":"http:\/\/www.wikidata.org\/entity\/Q226007",
"author":{"@type":"Organization","name":"Contributors to Wikimedia projects"},
"publisher":{"@type":"Organization","name":"Wikimedia Foundation, Inc.","logo":{"@type":"ImageObject","url":"https:\/\/www.wikimedia.org\/static\/images\/wmf-hor-googpub.png"}},
"datePublished":"2006-07-12T09:27:14Z",
"dateModified":"2020-08-31T23:55:26Z",
"headline":"psychological phenomenon in which repetition causes a word to temporarily lose meaning for the listener"
}

Most of the wikipedia pages are like this.

The title of the page is in the "name" property, and the "headline" property contains a short tagline instead. So trafilatura gives the tagline instead of the title as the title of the page. It probably makes sense to search for the "name" property first? Though it would be hard to extract with a regex: "name" also appears in subfields, like in the "author" property above, so would need to parse the json properly.

There was even a proposal to get rid of the headline property and replace it with "name" or with "title": schemaorg/schemaorg#205

List of smaller extraction bugs (text & metadata)

I have mostly tested trafilatura on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in xpaths.py (see BODY_XPATH and COMMENTS_XPATH lists).

Thanks!

AttributeError: 'NoneType' object has no attribute 'tail'

extract API throws this error while trying to extract content:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-333-f77dadd7d5ab> in <module>
----> 1 a = trafilatura.extract(downloaded)
      2 a

~/opt/anaconda2/envs/permpressenv/lib/python3.6/site-packages/trafilatura/core.py in extract(filecontent, url, record_id, no_fallback, include_comments, output_format, csv_output, xml_output, tei_output, tei_validation, target_language, include_tables, include_formatting, date_extraction_params)
    618 
    619     # extract content
--> 620     postbody, temp_text, len_text, sure_thing = extract_content(cleaned_tree, include_tables)
    621 
    622     # compare if necessary

~/opt/anaconda2/envs/permpressenv/lib/python3.6/site-packages/trafilatura/core.py in extract_content(tree, include_tables)
    340         # print(html.tostring(subtree, pretty_print=True, encoding='unicode'))
    341         # extract content
--> 342         processed_elems = [handle_textelem(e, potential_tags) for e in subtree.xpath('.//*')]
    343         # list(filter(None.__ne__, processed_elems))
    344         result_body.extend([e for e in processed_elems if e is not None])

~/opt/anaconda2/envs/permpressenv/lib/python3.6/site-packages/trafilatura/core.py in <listcomp>(.0)
    340         # print(html.tostring(subtree, pretty_print=True, encoding='unicode'))
    341         # extract content
--> 342         processed_elems = [handle_textelem(e, potential_tags) for e in subtree.xpath('.//*')]
    343         # list(filter(None.__ne__, processed_elems))
    344         result_body.extend([e for e in processed_elems if e is not None])

~/opt/anaconda2/envs/permpressenv/lib/python3.6/site-packages/trafilatura/core.py in handle_textelem(element, potential_tags)
    293         if element.tail is not None and not element.tail.isspace():
    294             new_element = etree.Element('p')
--> 295             new_element.text = process_node(element).tail
    296             # new_element.text = handle_textnode(element, comments_fix=False).tail
    297     elif element.tag == 'hi':

AttributeError: 'NoneType' object has no attribute 'tail'

Interestingly, contents from the same HTML file can be extracted with html2text API. For reproducing this error, try this URL: http://sibenlab.blogspot.com/2018/06/sibenlab-privacy-policy.html

Only one author extracted, even when there are multiple

Example article: https://www.nytimes.com/2020/10/19/us/politics/trump-ads-biden-election.html

This is authored by Maggie Haberman, Shane Goldmacher and Michael Crowley, but trafilatura will only show the first one. They are all in the JSON-LD so I think they should all be extracted, and author should be an array.

Replace requests with bare urllib3

The current use of requests sessions in cli_utils.py doesn't appear to be thread-safe (psf/requests#2766).

The full functionality of the module isn't really needed here and a change would help reducing the total number of dependencies as mentioned in #41.

Different extraction result on the same input

I tried to run the following extraction snippet on the same HTML input multiple times, but every 4th extraction returned shorter extraction.

link = '...'
html = scrape(link)
prev_extraction = None

for x in range(10):
    extraction = trafilatura.extract(html, include_comments=False,
        include_tables=False, no_fallback=True, target_language='en')
    if prev_extraction:
        if prev_extraction != extraction:
            print('Extraction looks weird!')
    prev_extraction = extraction

Is there any parameter I should be using or is this a bug to be fixed?
Thanks!

OverflowError: signed integer is greater than maximum

Traceback (most recent call last):
File "indexer.py", line 53, in
content_trafilatura = trafilatura.extract(document, json_output=True, with_metadata=False, include_tables=False, deduplicate=True, include_comments=False)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/trafilatura/core.py", line 684, in extract
max_tree_size=max_tree_size, url_blacklist=url_blacklist
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/trafilatura/core.py", line 586, in bare_extraction
docmeta = extract_metadata(tree, url, date_extraction_params)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/trafilatura/metadata.py", line 367, in extract_metadata
metadata['date'] = find_date(tree, **date_config)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/htmldate/core.py", line 605, in find_date
original_date, min_date, max_date)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/htmldate/core.py", line 124, in examine_header
headerdate = tryfunc(elem.get('content'))
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/htmldate/extractors.py", line 385, in try_ymd_date
customresult = custom_parse(string, outputformat, extensive_search, min_date, max_date)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/htmldate/extractors.py", line 302, in custom_parse
result = parse_datetime_as_naive(string)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 655, in parse
ret = self._build_naive(res, default)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1241, in _build_naive
naive = default.replace(**repl)
OverflowError: signed integer is greater than maximum