⚡ Web | ✍ Blog | 🐦 Twitter | 🎞 Youtube | ☕ Coffee
🔭 Currently working on gathering texts on the Web and detecting word trends
🖩 First programs written on a TI-83 Plus in TI-BASIC
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Home Page: https://trafilatura.readthedocs.io
License: Apache License 2.0
Hi!
Great library, but I am trying to figure out if we should swap from using readability on which this builds.
Apart from manually checking the quality for our corpus there is no easy way for me to compare performance.
Would you be able to explain the differences between the actual text extraction approaches?
Thanks in advance!
D
Case: In an HTML page, the main text has been placed in paragraphs contained in a DIV. Then after a few more tags outside the DIV
, some h2
headings appear:
...
<div>
<p>Lorem ipusm dolor</p>
<p>sit amet</p>
</div>
<h2>Hello World</>
<h2>and a few other texts</>
...
Unexpected behaviour: In the response JSON, the fields "text" and "raw-text" are composed of the content of the h2
tag, followed by the content of the preceding paragraphs p
in the HTML stream:
"raw-text" : "Hello World and a few other texts Lorem ipusm dolor sit amet"
When using include_formatting for plain text, I'm not seeing any formatting (bold, italics, etc..).
The term I'm using supports this. Is this by design or a bug?
I tried both the standalone version and using it as a library with trafilatura.extract(downloaded, include_formatting=True).
The parameters for date extraction are hard coded within extract_metadata()
. In order to get a different date format (ISO-8601 in my case) I need to call it directly, and then I call extract_metadata()
regardless to capture the other bits. This mean I'm running the date parsing two times, which I would like to optimize out.
Here is what I am doing:
tree = load_html(unicode_body)
publish_date = find_date(tree, extensive_search=True, url=doc_url, outputformat='%Y-%m-%d %H:%M:%S')
meta = extract_metadata(unicode_body)
if meta.author:
author = meta.author
It would be nicer to have this (or equivalent):
meta = extract_metadata(unicode_body, extensive_search=True, url=doc_url, outputformat='%Y-%m-%d %H:%M:%S')
if meta.author:
author = meta.author
if meta.date:
publish_date = meta.date
Hey, I am using trafilatura to get the content of newspapers. Is it possible to overpass captcha within trafilatura? For instance, Bloomberg is returning "To continue, please click the box below to let us know you're not a robot. Please make sure your browser supports JavaScript and cookies and that you are not blocking them from loading. For more information you can review..." when I try to access a news article.
Specific encoding error messages from libxml2 are sometimes printed on stdout, independantly from the logging library.
After investigation, it seems to be due to a print here:
trafilatura/trafilatura/xml.py
Line 158 in bdde794
Which shall be replaced by a logging.warning call. I can provide a patch if this is useful.
The goal is to modify the internal subclass LXMLDocument so as to avoid converting back the output back from a string back to an LXML tree:
trafilatura/trafilatura/external.py
Line 56 in 3b4cb19
The function readibility.Document.get_clean_html
could be of interest if added to the custom class LXMLDocument(Document)
, the question is how.
I can't find a way to extract text while preserving the <a href>
tags in it (probably replacing those tags with link
tags, output type would be xml). Would be great to have it. If I were to help add it, where should I look?
Hello @adbar, thanks for your tremendous work on the library. Do you know if there is a way to install and then import the library so that you will only load the utilities related to raw content extraction from a html string? If not, is there anyway we can discuss this particular topic and see if I could help you implement this in any way? My use case is basically the following: I have a CLI tool that currently relies on dragnet and I would like to jump ship and adopt trafilatura. My issue is that I don't want to install the net-related dependencies you list in your setup.py (notably requests
and tldextract
) because they will clash with some of my dependencies and I have my own means of downloading things, dealing with urls etc.
Have a good day,
Hi, I just discovered this lib and it's quickly becoming one of my favorites due to how quick and clean it works!
I wanted to bring up an issue/potential fix for sitemap parsing, with the end goal of extracting all the urls from the sitemap.
Using the url https://hubspot.com
as the example
Using the native method
from trafilatura import sitemaps, feeds
feed_url = "https://hubspot.com"
rss_list = feeds.find_feed_urls(feed_url, target_lang='en')
print(rss_list)
"""
returns
WARNING:root:not a valid XML sitemap: https://hubspot.com/sitemap.xml
WARNING:root:not a sitemap: https://hubspot.com/sitemap.xml
[] # returns empty list
"""
# then if searching by sitemap_search function
sitemap_list = sitemaps.sitemap_search(feed_url, target_lang='en')
print(sitemap_list)
"""
returns
ERROR:trafilatura.utils:not a 200 response: 404 for URL https://hubspot.com/sitemap_news.xml
WARNING:root:not a sitemap: https://hubspot.com/sitemap_news.xml
ERROR:trafilatura.utils:not a 200 response: 404 for URL https://hubspot.com/sitemap_index.xml
WARNING:root:not a sitemap: https://hubspot.com/sitemap_index.xml
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com sitemaps.org
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com google.com
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com google.com
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:trafilatura.sitemaps:Diverging domain names: hubspot.com hubspotusercontent00.net
WARNING:root:not a valid XML sitemap: https://hubspot.com/sitemap.xml.gz
WARNING:root:not a sitemap: https://hubspot.com/sitemap.xml.gz
['https://www.hubspot.com/careers/error</loc><lastmod>2019-11-06</lastmod></url><url><loc>',
'https://www.hubspot.com/case-studies/incenteev</loc><lastmod>2021-03-05</lastmod></url><url><loc>', ...]
# returns unclean list of results
"""
However, the first sitemap found in find_feed_urls is valid, it's just missing the encoding that the function looks for.
So what I had done instead was
from trafilatura import bare_extraction
feed_url = 'https://hubspot.com/sitemap.xml'
downloaded = fetch_url(feed_url)
result = bare_extraction(downloaded, output_format='python', include_links=True)
# probably better with regex
links_list = result['text'].split('https://')
links_list = [('https://' + l).strip() for l in links_list if l]
This returns a list of > 1k urls. If it would be possible to return valid 200 found sitemaps from find_feed_urls function, regardless of whether the function is able to successfully parse it, it would be very useful (perhaps as an arg).
Thanks for this awesome lib!
In some web pages Trafilatura crashes with this error:
Traceback (most recent call last):
File "/Users/raphaelgeronimi/Local/Sabrina/python_libraries/extractpost.py", line 99, in _extract_html_with
include_comments=False)
File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/core.py", line 649, in extract
max_tree_size=max_tree_size, url_blacklist=url_blacklist
File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/core.py", line 562, in bare_extraction
docmeta = extract_metadata(tree, url, date_extraction_params)
File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/metadata.py", line 371, in extract_metadata
metadata['sitename'] = extract_sitename(tree)
File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/metadata.py", line 302, in extract_sitename
mymatch = re.search(r'^.*?[-|]\s+(.*)$', tree.find('.//head/title').text)
File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/re.py", line 185, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
Reading the code, it seems that the reason is that in metadata.py line 302, the code in bold below returns None, which triggers an exception in the search method of the re library:
mymatch = re.search(r'^.?[-|]\s+(.)$', tree.find('.//head/title').text)
The solution would be to check for None (which could alter the logic afterward).
Hello,
Thanks for your beautiful and powerful project, I try to test some websites with trafilatura 0.6.0 in Python 3.8.
My test:
import trafilatura
from trafilatura.core import bare_extraction
downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
result = bare_extraction(downloaded, include_formatting=False, with_metadata=True)
print(result)
The results:
({'title': None, 'author': None, 'url': None, 'hostname': None, 'description': None, 'sitename': None, 'date': None, 'categories': None, 'tags': None, 'fingerprint': None, 'id': None}, 'Leader spotlight: Erin Spiceland Every March we recognize the women who have shaped history—and now, we’re taking a look forward. From driving software development in large companies to maintaining thriving open source communities, we’re spending Women’s History Month with women leaders who are making history every day in the tech community. Erin Spiceland is a Software Engineer for SpaceX. Born and raised in rural south Georgia, she is a Choctaw and Chickasaw mother of two now living in downtown Los Angeles. Erin didn’t finish college—she’s a predominantly self-taught software engineer. In her spare time, she makes handmade Native American beadwork and regalia and attends powwows. How would you summarize your career (so far) in a single sentence? My career has been a winding road through periods of stimulation and health as well as periods of personal misery. During it all, I’ve learned a variety of programming languages and technologies while working on a diverse array of products and services. I’m a domestic abuse survivor and a Choctaw bisexual polyamorous woman. I’m so proud of myself that I made it this far considering where I came from. What was your first job in tech like? In 2007, I had a three-year-old daughter and I was trying to finish my computer science degree one class at a time, all while keeping my house and family running smoothly. I found the math classes exciting and quickly finished my math minor, leaving only computer science classes. I was looking at about five years before I would graduate. Then, my husband at the time recommended me for an entry software developer position at a telecom and digital communications company. When faced with the choice between an expensive computer science degree and getting paid to do what I loved, I dropped out of college and accepted the job. I was hired to work on internal tooling, and eventually, products. I did a lot of development on product front-ends, embedded network devices, and a distributed platform-as-a-service. I learned Java/JSP, Python, JavaScript/CSS, Node.js, as well as MySQL, PostgreSQL, and distributed systems architecture. It was an intense experience that required a lot of self-teaching, asking others for help, and daycare, but it set me up for my later successes. What does leadership mean to you in your current role? “Leadership is about enabling those below, above, and around you to be at their healthiest and most effective so that all of you can accurately understand your surroundings, make effective plans and goals for the future, and achieve those goals.” I appreciate and admire technical, effective leaders who care for their reports as humans, not as lines on a burndown chart, and forego heavy-handed direction in favor of communication and mutual dialogue. I think it’s as important for a leader to concern herself with her coworkers’ personal well-being as it is for her to direct their performance. What’s the biggest career risk you’ve ever taken? What did you learn from that experience? Last year I took a pay cut to move from a safe, easy job where I had security to work in a language I hadn’t seen in years and with systems more complicated than anything I’d worked with before. I moved from a place where I had a huge four bedroom house to a studio apartment that was twice the price. I moved away from my children, of who I share custody with my ex-husband. We fly across the U.S. to see each other now. I miss my children every day. However, I get to be a wonderful role model for them. “I get to show my children that a Native woman who grew up in poverty, lost her mother and her culture, and who didn’t finish college can learn, grow, and build whatever career and life she wants.” What are you looking forward to next? I can’t wait to wake up every day with my partner who loves me so much. I’m looking forward to showing my children exactly how far they can go. I’m excited to keep exploring Los Angeles. “I expect to learn so much more about software and about life, and I want to experience everything.” Want to know more about Erin Spiceland? Follow them on GitHub or Twitter. Want to learn more about featured leaders for Women’s History Month? Read about: Laura Frank Tacho, Director of Engineering at CloudBees Rachel White, Developer Experience Lead at American Express Kathy Pham, Computer Scientist and Product Leader at Mozilla and Harvard Heidy Khlaaf, Research Consultant at Adelard LLP Check back in soon—we’ll be adding new interviews weekly throughout March.', <Element body at 0x10680a280>, <Element body at 0x1067af080>)
So, no metadata return.
Also, I added a xpath in the metaxpaths.py and rebuild your code. I'm sure that //div[contains(@class, "post__categories")]//li//a
will be match with a category in the url https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/
. But no category is returned.
categories_xpaths = [
"""//div[starts-with(@class, 'post-info') or starts-with(@class, 'postinfo') or
starts-with(@class, 'post-meta') or starts-with(@class, 'postmeta') or
starts-with(@class, 'meta') or starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-info') or
starts-with(@class, 'entry-utility') or starts-with(@id, 'postpath')]//a""",
"//p[starts-with(@class, 'postmeta') or starts-with(@class, 'entry-categories') or @class='postinfo' or @id='filedunder']//a",
"//footer[starts-with(@class, 'entry-meta') or starts-with(@class, 'entry-footer') or starts-with(@class, 'post-info')]//a",
'//*[(self::li or self::span)][@class="post-category" or starts-with(@class, "post__categories") or @class="postcategory" or @class="entry-category"]//a',
'//header[@class="entry-header"]//a',
'//div[@class="row" or @class="tags"]//a',
'//div[contains(@class, "post__categories")]//li//a',
]
Another question is that could I get content of article including html format (no clean tags in content)?
Please help me, thanks for your support!
Hey,
I noticed that a fair amount of extractions are failing with a seemingly unnecessary TypeError.
steps to reproduce:
downloaded = trafilatura.fetch_url('https://fortelabs.co/blog/para/')
trafilatura.extract(downloaded, include_links=True)
gives:
TypeError Traceback (most recent call last)
<ipython-input-28-33fce83a7b3f> in <module>
----> 1 trafilatura.extract(downloaded, include_formatting=False, include_links=True)
/usr/local/lib/python3.8/site-packages/trafilatura/core.py in extract(filecontent, url, record_id, no_fallback, include_comments, output_format, tei_validation, target_language, include_tables, include_images, include_formatting, include_links, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, settingsfile, config)
776 url_blacklist = set()
777 # extraction
--> 778 docmeta = bare_extraction(
779 filecontent, url=url, no_fallback=no_fallback,
780 include_comments=include_comments, output_format=output_format,
/usr/local/lib/python3.8/site-packages/trafilatura/core.py in bare_extraction(filecontent, url, no_fallback, include_comments, output_format, target_language, include_tables, include_images, include_formatting, include_links, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, config)
682
683 # extract content
--> 684 postbody, temp_text, len_text, sure_thing = extract_content(cleaned_tree, include_tables, include_images, include_links, deduplicate, config)
685
686 # compare if necessary
/usr/local/lib/python3.8/site-packages/trafilatura/core.py in extract_content(tree, include_tables, include_images, include_links, deduplicate, config)
382 # list(filter(None.__ne__, processed_elems))
383 result_body.extend([e for e in
--> 384 [handle_textelem(e, potential_tags, deduplicate, config) for e in subtree.xpath('.//*')]
385 if e is not None])
386 # remove trailing titles
/usr/local/lib/python3.8/site-packages/trafilatura/core.py in <listcomp>(.0)
382 # list(filter(None.__ne__, processed_elems))
383 result_body.extend([e for e in
--> 384 [handle_textelem(e, potential_tags, deduplicate, config) for e in subtree.xpath('.//*')]
385 if e is not None])
386 # remove trailing titles
/usr/local/lib/python3.8/site-packages/trafilatura/core.py in handle_textelem(element, potential_tags, dedupbool, config)
287 new_element = handle_titles(element)
288 elif element.tag == 'p':
--> 289 new_element = handle_paragraphs(element, potential_tags, dedupbool, config)
290 elif element.tag == 'lb':
291 if text_chars_test(element.tail) is True:
/usr/local/lib/python3.8/site-packages/trafilatura/core.py in handle_paragraphs(element, potential_tags, dedupbool, config)
171 newsub.set('rend', child.get('rend'))
172 elif child.tag == 'ref':
--> 173 newsub.set('target', child.get('target'))
174 # handle line breaks
175 elif child.tag == 'lb':
src/lxml/etree.pyx in lxml.etree._Element.set()
src/lxml/apihelpers.pxi in lxml.etree._setAttributeValue()
src/lxml/apihelpers.pxi in lxml.etree._utf8()
TypeError: Argument must be bytes or unicode, got 'NoneType'
I believe this error could be handled in a meaningful way (perhaps excepted and passed?), but maybe I'm missing something?
URL structure could be https://web.archive.org/web/20/original-URL
Will you please start tagging your master branch to match the releases published to pypi? This will help immensely when it comes to finding code to debug issues for a specific version.
Nice library by the way. I'm putting it to use.
article_trafilatura = trafilatura.bare_extraction(trafilatura.fetch_url(url), include_images=True)
When I set include_images=True, I get article src, but it seems to miss the top article image src. Is it meant to be this way or is it still not implemented or am I doing something wrong? I have tried passing different urls from different websites but still I cannot find a way how to extract that image.
Trafilatura is not including the ol tag on the following page, e.g. where the first list element starts with "Der Arbeitgeber finanziert Ihre bAV allein":
https://www.finanztip.de/betriebliche-altersvorsorge/
It's also skipping the h3 titles on the aforementioned page.
PS: Not sure if you want issues with the result precision to be reported here?
Hey @adbar
There are issues with the chardet
dep, when using the new version. I run pip-compile
with the following items:
beautifulsoup4
Flask
pytest
gunicorn
readability-lxml
trafilatura==0.8.0
backoff
requests
urllib3
google-cloud-storage
google-cloud-bigquery
google-cloud-firestore
google-cloud-error-reporting
The error:
Could not find a version that matches chardet<4,>=3.0.2,>=3.0.4,>=4.0.0 (from -r requirements.in (line 4))
Tried: 1.0, 1.0.1, 1.1, 2.1.1, 2.2.1, 2.2.1, 2.3.0, 2.3.0, 3.0.0, 3.0.0, 3.0.1, 3.0.1, 3.0.2, 3.0.2, 3.0.3, 3.0.3, 3.0.4, 3.0.4, 4.0.0, 4.0.0
There are incompatible versions in the resolved dependencies:
chardet (from -r requirements.in (line 4))
chardet>=4.0.0 (from htmldate==0.8.0->trafilatura==0.7.0->-r requirements.in (line 7))
chardet>=3.0.4 (from trafilatura==0.7.0->-r requirements.in (line 7))
chardet<4,>=3.0.2 (from requests==2.23.0->-r requirements.in (line 9))
chardet (from readability-lxml==0.8.1->-r requirements.in (line 6))
If I do not pin trafilatura, then it runs fine but we get 0.6.0.
Thanks in advance!
Hi all I presume this is an easy fix. I'm loading a local copy of this file:
https://lisn-tests.netlify.app/rich-content.html
loaded using Path(…).read_text()
using bare_extraction
I get this output that appears to suggest that trafilatura with these options is skipping over headers, bullets, and discarding images. What have I done wrong to make it behave like this?
"title": "This is heading 1",
"author": null,
"hostname": null,
"date": null,
"categories": "",
"tags": "",
"fingerprint": "bnP8wg6PD0dg9QA9O/mFqg7dSEk=",
"id": null,
"raw-text": "Paragraph 1. What does this make of a bulleted list? This is quoted textAn image: This is bolded text. This is bolded text. This is a paragraph with some bold text in the middle. Table heading 1 Table heading 2 Table heading 1 Table heading 2 Table heading 1 Table heading 2",
"source": null,
"source-hostname": null,
"excerpt": null,
"text": "Paragraph 1. What does this make of a bulleted list?\nThis is quoted textAn image:\nThis is bolded text.\nThis is bolded text.\nThis is a paragraph with some bold text in the middle.\n|Table heading 1||Table heading 2|\n|Table heading 1||Table heading 2|\n|Table heading 1||Table heading 2|",
"comments": ""
}
===
[
"Paragraph 1.",
"What does this make of a bulleted list?",
"This is quoted textAn image:",
"This is bolded text.",
"This is bolded text.",
"This is a paragraph with some bold text in the middle.",
"|Table heading 1||Table heading 2|",
"|Table heading 1||Table heading 2|",
"|Table heading 1||Table heading 2|"
]
I have been using trafilatura to extract text from HTML pages. I have noticed that sometimes the text following an unordered list is not extracted, the list items are extracted but not the text following the unordered list tag.
<ul>Description of the list:
<li>List item 1</li>
<li>List item 2</li>
<li>List item 3</li>
</ul>
In the previous code example, the extracted text would be:
"Description of the list" would not be extracted into the text file. This is probably due to incorrect HTML coding practices but I'm wondering if Trafilatura can capture that text.
Hi,
I tried to crawl texts from a list of 100 URLs experimentally. Every time it's stopped at the 40th with the error as follows:
$ trafilatura --inputfile linkliste-gefiltert.txt --outputdir ausgabe/ --backup-dir html-quellen/
Traceback (most recent call last):
File "/home/anaconda3/bin/trafilatura", line 8, in <module>
sys.exit(main())
File "/home/anaconda3/lib/python3.7/site-packages/trafilatura/cli.py", line 151, in main
process_args(args)
File "/home/anaconda3/lib/python3.7/site-packages/trafilatura/cli.py", line 166, in process_args
url_processing_pipeline(args, input_urls, SLEEP_TIME)
File "/home/anaconda3/lib/python3.7/site-packages/trafilatura/cli_utils.py", line 306, in url_processing_pipeline
multi_threaded_processing(domain_dict, args, sleeptime, counter)
File "/home/anaconda3/lib/python3.7/site-packages/trafilatura/cli_utils.py", line 254, in multi_threaded_processing
bufferlist.append(domain_dict[domain].pop())
IndexError: pop from empty list
Anyone also run into this problem?
My list is like this (a subset of corona-corpus at DWDS):
http://schmid.welt.de/2020/03/18/corona-gesammelte-apercus-3
http://schmid.welt.de/2020/04/04/krieg-und-virus-corona-apercus-5
http://www.bmz.de/de/themen/corona/index.html
http://www.chiemgauseiten.de/2020/04/15/zirkus-corona-tag-33-wann-wird-s-mal-wieder-richtig-schule
http://www.chiemgauseiten.de/2020/04/19/zirkus-corona-tag-38-der-hasenpalast
http://www.chiemgauseiten.de/2020/04/23/zirkus-corona-tag-42-die-hoffnungs-profis
http://www.chiemgauseiten.de/2020/04/27/zirkus-corona-tag-45-leo-allein-zu-haus
http://www.chiemgauseiten.de/2020/05/01/zirkus-corona-tag-50-das-gro%C3%9Fe-wiedersehen
http://www.chiemgauseiten.de/2020/05/04/zirkus-corona-tag-53-das-corona-paradox
http://www.chiemgauseiten.de/2020/05/08/zirkus-corona-tag-57-tage-der-befreiung
http://www.chiemgauseiten.de/2020/05/11/zirkus-corona-tag-60-zwischenzeit
http://www.klimareporter.de
http://www.klimareporter.de/deutschland/der-corona-rollback
http://www.klimareporter.de/finanzen-wirtschaft/finanzhilfen-nur-mit-1-5-grad-standards
http://www.klimareporter.de/gesellschaft/der-klimawandel-ein-planetarisches-virus
http://www.klimareporter.de/protest/das-klima-in-der-hand-der-aktionaer-innen
http://www.klimareporter.de/protest/die-vermeintlich-unpolitischen-krisen
http://www.klimareporter.de/verkehr/corona-zwingt-luftfahrt-zu-mehr-klimaschutz
http://www.liebesleben.de/corona/corona-hiv-und-andere-sti
http://www.liebesleben.de/corona/corona-sexualitaet-und-wohlbefinden
http://www.liebesleben.de/corona/corona-und-beziehungen
http://www.liebesleben.de/corona/corona-und-dating
http://www.liebesleben.de/corona/corona-und-fragen-zu-sexualitaet
http://www.liebesleben.de/corona/corona-und-sex
http://www.literaturhaus-graz.at/die-corona-tagebuecher-1
http://www.literaturhaus-graz.at/die-corona-tagebuecher-teil-2
http://www.literaturhaus-graz.at/die-corona-tagebuecher-teil-3
http://www.literaturhaus-graz.at/die-corona-tagebuecher-teil-5
http://www.literaturhaus-graz.at/die-corona-tagebuecher-teil-6
http://www.literaturhaus-graz.at/ie-corona-tagebuecher-teil-8
http://www.marlenestreeruwitz.at/werk/so-ist-die-welt-geworden
http://www.ortheil-blog.de/2020/04/15/exit-und-zwar-sofort
http://www.ortheil-blog.de/2020/04/20/stationen-eines-corona-tages-1
http://www.ortheil-blog.de/2020/04/21/stationen-eines-coronatages-fuer-kinder
http://www.ortheil-blog.de/2020/05/02/nachrichtenueberdruss
http://www.tichyseinblick.de/daili-es-sentials/aemter-ignorierten-empfehlungen-der-who-und-einer-hygiene-kommission
http://www.tichyseinblick.de/daili-es-sentials/agitation-oder-trost-muezzine-rufen-zum-gebet
http://www.tichyseinblick.de/daili-es-sentials/altmaier-sollte-etwas-tun-oder-den-mund-halten
http://www.tichyseinblick.de/daili-es-sentials/ausstieg-aus-der-quarantaene-nach-ostern-neue-daten-stuetzen-die-forderung
http://www.tichyseinblick.de/daili-es-sentials/berlin-kirche-nein-revolutionaere-1-mai-demo-ja
http://www.tichyseinblick.de/daili-es-sentials/bund-arbeitet-an-impfpflicht-durch-die-hintertuer
http://www.tichyseinblick.de/daili-es-sentials/cdu-verschiebt-parteitag-auf-unbestimmte-zeit
http://www.tichyseinblick.de/daili-es-sentials/china-coronavirus-abschottung
http://www.tichyseinblick.de/daili-es-sentials/corona-abstimmung-buerger-protestieren-trotz-demonstrationsverbot
http://www.tichyseinblick.de/daili-es-sentials/corona-bringts-an-den-tag-deutschland-ohne-mass-und-ordnung
http://www.tichyseinblick.de/daili-es-sentials/corona-ein-kurzer-laendervergleich
http://www.tichyseinblick.de/daili-es-sentials/corona-hilfe-fuer-nachbarn-solidaritaet-ist-eine-einbahnstrasse
http://www.tichyseinblick.de/daili-es-sentials/corona-in-suedostasien-wie-thailand-auf-die-pandemie-reagiert
http://www.tichyseinblick.de/daili-es-sentials/corona-kommunikation-schlechte-noten-fuer-bundesministerien
http://www.tichyseinblick.de/daili-es-sentials/corona-schattengesetze
http://www.tichyseinblick.de/daili-es-sentials/corona-und-co2-jetzt-kommen-die-engfuehrer
http://www.tichyseinblick.de/daili-es-sentials/corona-update-3-mai
http://www.tichyseinblick.de/daili-es-sentials/corona-update-um-18-april-die-maskenpflicht-kommt
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-1-mai-kein-tanz-dafuer-presseerklaerungen
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-10-april-eine-studie-aus-dem-kreis-heinsberg
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-12-april-die-gier-der-vermoegenden
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-14-april-anwaeltin-bahner-in-psychiatrie
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-15-april-die-leopoldina-und-die-wirtschaft
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-16-april-die-bundesregierung-fordert-zum-maskentragen-auf
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-2-mai-gerichte-werden-aktiv
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-20-april-die-lockerungen-im-ueberblick
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-22-april-spahn-fordert-toleranz-fuer-fehler-der-politik
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-24-april-eine-merkelsche-regierungserklaerung
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-25-april
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-26-april-china-uebt-druck-auf-die-eu-aus
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-27-april-spd-will-recht-auf-homeoffice-und-strichweibchen
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-27-oktober-das-parlament-stiehlt-sich-aus-der-verantwortung
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-29-april-die-bundesregierung-bekaempft-ihre-eigenen-fake-news
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-30-maerz-soziale-folgen-draengen-nach-vorne
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-4-april-wie-raus-aus-dem-kontaktverbot
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-8-april-mundschutzmasken-aber-woher
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-morgen-des-1-april-flacht-die-kurve-ab
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-morgen-des-19-maerz
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-morgen-des-20-maerz-anstieg-der-infektionen
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-morgen-des-23-maerz-deutsche-amtsstuben-und-italienische-aerzte
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-morgen-des-26-maerz-wer-wird-behandelt-wer-nicht
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-morgen-des-27-maerz
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-morgen-des-28-maerz-das-rki-taucht-ab
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-morgen-des-29-maerz-ein-vertrauliches-strategiepapier
http://www.tichyseinblick.de/daili-es-sentials/corona-update-zum-morgen-des-4-april-beschlagnahmte-schutzmasken-und-hoffnung-fuer-italien
http://www.tichyseinblick.de/daili-es-sentials/corona-vermoegensteuer-zwei-reichen-solis-und-eine-mauer-fuer-reiche
http://www.tichyseinblick.de/daili-es-sentials/corona-verschleiert-spanien-das-ausmass-der-pandemie
http://www.tichyseinblick.de/daili-es-sentials/corona-virus-ueberlebensplan-bis-zum-impfstoff
http://www.tichyseinblick.de/daili-es-sentials/coronavirus-in-deutschland
http://www.tichyseinblick.de/daili-es-sentials/coronavirus-in-europa-angekommen
http://www.tichyseinblick.de/daili-es-sentials/coronavirus-oesterreich-ist-im-krisenmanagement-deutschland-weit-voraus
http://www.tichyseinblick.de/daili-es-sentials/coronavirus-was-wir-von-der-spanischen-grippe-lernen-koennen
http://www.tichyseinblick.de/daili-es-sentials/covid-19-warum-die-aktuelle-sterberate-wenig-aussagt
http://www.tichyseinblick.de/daili-es-sentials/der-corona-wortschatz
http://www.tichyseinblick.de/daili-es-sentials/deutschland-war-auf-corona-vorbereitet-aber-nur-auf-dem-papier
http://www.tichyseinblick.de/daili-es-sentials/die-hoffnung-stirbt-zuerst-merkel-zieht-neuen-corona-gipfel-vor
http://www.tichyseinblick.de/daili-es-sentials/dritter-fdp-bundestagsabgeordneter-positiv-auf-covid-19-getestet
http://www.tichyseinblick.de/daili-es-sentials/ein-nachtrag-zu-systemtest-coronavirus-berlin-redet-wien-und-andere-handeln-un-und-eu-finden-nicht-statt
http://www.tichyseinblick.de/daili-es-sentials/endlich-die-fdp-wacht-aus-ihrem-tiefschlaf-auf
http://www.tichyseinblick.de/daili-es-sentials/engtanz-in-zeiten-von-corona
http://www.tichyseinblick.de/daili-es-sentials/erste-ergebnisse-der-heinsberg-studie-koennen-die-einschraenkungen-gelockert-werden
http://www.tichyseinblick.de/daili-es-sentials/fdp-abgeordneter-marcel-luthe-wir-oeffnen-die-buechse-der-pandora
http://www.tichyseinblick.de/daili-es-sentials/fluege-aus-dem-iran-kommen-weiter-in-deutschland-an
http://www.tichyseinblick.de/daili-es-sentials/hamburg-party-senator-bricht-corona-regeln-der-eigenen-behoerde
http://www.tichyseinblick.de/daili-es-sentials/hamsterkaeufe-sorge-vor-coronavirus-und-leere-supermarktregale
This HTML document fails to extract any content using current trafilatura: https://firstmonday.org/ojs/index.php/fm/article/download/10274/9729?inline=1
From experimentation, I believe this is due to the mangled DOCTYPE line:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2012"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
[...]
Note the misplaced 2012.
I've contacted the webmaster, and this might be more of an issue with a parsing library that trafilatura uses, but I wonder if this pattern of mangled HTML might be popular and could be tolerated. Particularly in older documents (pre modern web).
As some context for prioritization, this is not an urgent or blocking concern for my use of trafilatura, and I am using a work around for this specific situation.
Hello everybody,
I'm fairly new to using trafilatura. When I try to run " trafilatura -u 'insert random news article' " and execute it in the shell, I always get "ERROR: file too small". I currently use zsh as a shell and also tried it on bash with no success. I checked, that the latest version is installed. I'm currently working on MAC OS Catalina (10.15.7). I've never worked with the command line/shell before so I would really appreciate if somebody could help me with this issue
Hi,
I just tried to install your awsome project to an local folder (pip install --target {path}/package trafilatura
). After installing it I cant load it with from package import trafilatura
:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pkg_resources\__init__.py in get_provider(moduleOrReq)
359 try:
--> 360 module = sys.modules[moduleOrReq]
361 except KeyError:
KeyError: 'trafilatura'
During handling of the above exception, another exception occurred:
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-37-57dd4ca0a626> in <module>
1 from pprint import pprint
----> 2 from package import trafilatura
3 import time
4
5 trafilatura
~{path}\package\trafilatura\__init__.py in <module>
14 import logging
15
---> 16 from .core import extract, process_record
17 from .utils import fetch_url
18
~{path}\package\trafilatura\core.py in <module>
17
18 # own
---> 19 from .external import justext_rescue, sanitize_tree, SANITIZED_XPATH, try_readability
20 from .filters import content_fingerprint, duplicate_test, language_filter, text_chars_test
21 from .htmlprocessing import (convert_tags, discard_unwanted,
~{path}\package\trafilatura\external.py in <module>
36 from .settings import JUSTEXT_LANGUAGES, MANUALLY_STRIPPED
37 from .utils import trim, HTML_PARSER
---> 38 from .xml import TEI_VALID_TAGS
39
40
~{path}\package\trafilatura\xml.py in <module>
21 LOGGER = logging.getLogger(__name__)
22 # validation
---> 23 TEI_SCHEMA = pkg_resources.resource_filename('trafilatura', 'data/tei-schema.pickle')
24 TEI_VALID_TAGS = {'body', 'cell', 'code', 'del', 'div', 'fw', 'head', 'hi', 'item', \
25 'lb', 'list', 'p', 'quote', 'row', 'table'}
~\anaconda3\lib\site-packages\pkg_resources\__init__.py in resource_filename(self, package_or_requirement, resource_name)
1143 def resource_filename(self, package_or_requirement, resource_name):
1144 """Return a true filesystem path for specified resource"""
-> 1145 return get_provider(package_or_requirement).get_resource_filename(
1146 self, resource_name
1147 )
~\anaconda3\lib\site-packages\pkg_resources\__init__.py in get_provider(moduleOrReq)
360 module = sys.modules[moduleOrReq]
361 except KeyError:
--> 362 __import__(moduleOrReq)
363 module = sys.modules[moduleOrReq]
364 loader = getattr(module, '__loader__', None)
ModuleNotFoundError: No module named 'trafilatura'
Python code used:
from package import trafilatura
trafilatura
Note:
I'm using Python 3.7.6 with pip 20.0.2.
Edit:
A quick fix (for me) is replacing TEI_SCHEMA = pkg_resources.resource_filename('trafilatura', 'data/tei-schema.pickle')
in "xml.py" with TEI_SCHMEA = './data/tei-schema.pickle'
and
change line 11 in settings.py from from trafilatura import __version__
to from ..trafilatura import __version__
There's a bug with extraction using include_images option, it only works once and then extractions start to fail. The bug is in htmlprocessing.py, lines 46 to 50:
if include_images is True:
# Many websites have <img> inside <figure> or <picture> or <source> tag
for element in ['figure', 'picture', 'source']:
MANUALLY_CLEANED.remove(element)
MANUALLY_STRIPPED.remove('img')
This bit of code will be repeated for all extractions, but since the tags have been removed on the first extraction, a ValueError will be thrown for all other extractions. I tested that adding existence checks solves the issue:
if include_images is True:
# Many websites have <img> inside <figure> or <picture> or <source> tag
for element in ['figure', 'picture', 'source']:
if element in MANUALLY_CLEANED:
MANUALLY_CLEANED.remove(element)
if 'img' in MANUALLY_STRIPPED:
MANUALLY_STRIPPED.remove('img')
A short version of the documentation is available straight from Github (README.rst) while a more exhaustive one is present in the docs
folder and online on trafilatura.readthedocs.io
Several problems could arise:
Feel free to help with any of these, thanks!
Feature requests like in #38 and #48 deal with inclusion of particular HTML elements in the output.
To allow for easier inclusion and less hacky code it would be best to
On some documents Trafilatura 0.6.0 fails with this error:
... File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/external.py", line 123, in sanitize_tree tree = prune_html(tree) File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/htmlprocessing.py", line 63, in prune_html element.drop_tree() AttributeError: 'lxml.etree._Element' object has no attribute 'drop_tree'
A git blame on this line reveals this is new code that has been made 21 days ago in this revision:
74444d2
Note: I am using the latest version of lxml (4.6.1)
Me again :)
I'm trying to extract metadata and encountered this error. Dug a bit deeper and it was caused by metadata extractor. Particularly in this case extracting title.
In html <title> tag is empty - <title></title>
and inside <head></head>
. There is plenty of meta tags and js garbage also and I couldn't make short example to reproduce it but I found solution.
In extract_sitename:
def extract_sitename(tree):
'''Extract the name of a site from the main title (if it exists)'''
title_elem = tree.find('.//head/title')
print (title_elem.text_content())
if title_elem is not None:
try:
mymatch = re.search(r'^.*?[-|]\s+(.*)$', title_elem.text)
if mymatch:
return mymatch.group(1)
except AttributeError:
pass
return None
it checks if title_elem is not None
. My title_elem
passes this and returns True
But when passed to regex raises this error.
I checked title_elem.text_content()
and it is empty
I also checked type(title_elem.text)
and shows that it is NoneType
and is cause for the error.
To fix this, you have to check if title_elem.text is not None
instead of just title_elem
Hello @adbar,
This is not an issue per-se but it seems that when you extract metadata from ld-json
, such as the author, for instance, you don't take into account the fact that it is sometimes stringified with python's infamous ensure_ascii
keyword to default True
. If we take this html page for instance: https://www.bvoltaire.fr/jean-sevillia-letat-francais-et-letat-algerien-doivent-reconnaitre-les-crimes-commis-des-deux-cotes/ the name "Jean Sévillia" is stringified as "Jean S\u00e9villia". This can be a drag sometimes.
I am not saying this should be addressed by trafilatura
directly and I personally use this hacky function to get the job done (a replacer might be dangerous because of some erroneous html text containing badly stringified emojis missing parts of a surrogate combination I keep stumbling upon these days...), but I just wanted to open a discussion.
First of all, let me thank you for working on this amazing tool!
One quick feature request: is it possible to keep the output file name same as the input file names, when running on the command line? Also, I see this error # ERROR: file too small in command line but while running the API programmatically, it doesn't provide any meaningful message when the extraction is not successful. Can we fix it?
I caught the following exception:
...
File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/core.py", line 586, in bare_extraction
docmeta = extract_metadata(tree, url, date_extraction_params)
File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/metadata.py", line 393, in extract_metadata
metadata['categories'] = extract_catstags('category', tree)
File "/usr/local/Caskroom/miniconda/base/envs/myenv/lib/python3.7/site-packages/trafilatura/metadata.py", line 331, in extract_catstags
results.append(element.attrib['content'])
File "src/lxml/etree.pyx", line 2479, in lxml.etree._Attrib.__getitem__
KeyError: 'content'
I use Python 3.7.9 and Trafilatura 0.6.1 (from Pypi)
The configparser library would do the trick in order to override the variables in trafilatura/settings.py
on user's request:
.ini
fileAre there Python libraries which allow to bypass diverse consent mechanisms put in place by news outlets for the readers to allow cookies? It would be too cumbersome to develop a headless browser exclusively for trafilatura.
A good example would be the newspaper zeit.de. Related to #18.
Potential solutions:
The output could be piped directly to trafilatura (in a terminal or via Python).
Thank you for that best stuff!!
A bit problem with parse text from https://gametarget.ru/mmofps/
<doc sitename="GameTarget.Ru" title="Онлайн шутеры 2020 - бесплатные игры MMOFPS на ПК" source="https://gametarget.ru/mmofps/" hostname="gametarget.ru" excerpt="В этом разделе сайта собраны клиентские онлайн шутеры, которые помогут вам избавиться от накопившейся усталости и агрессии, выплеснув все негативные эмоции на виртуальных противников по ту сторону экрана." categories="" tags="">
<main>
<div><p>Среди жанров многопользовательских игр MMOFPS занимают одну из лидирующих позиций, наряду с</p>MMORPG<p>и</p>MOBA<p>. И это неудивительно, ведь онлайн шутеры не требуют к себе много внимания, при этом развлекают игрока буквально с первых же секунд – никаких долгих вступлений и обучающих уровней: вы просто заходите в лобби, выбираете интересующий режим и спустя несколько мгновений оказываетесь в самой гуще сражения.</p><p>Еще один фактор, влияющий на популярность жанра ммофпс – высокое качество продукта. Среди онлайн шутеров на ПК бушует нешуточная конкуренция, поэтому разработчикам приходится выкладываться на все сто процентов, чтобы обеспечить игроков максимально яркими впечатлениями. Лучшие игры завлекают аудиторию красивой графикой, разнообразием режимов, близким к идеальному балансом, тщательно выверенным геймплеем, а также следованием всем современным трендам – например, наличием элементов выживания, ежедневных заданий или ультрамодных режимов вроде «Королевской битвы».</p><p>Так как жанр существует уже несколько десятков лет, в его рамках сформировалось множество поджанров и категорий. Шутеры классифицируются по модели распространения – платные и бесплатные онлайн шутеры, по расположению виртуальной камеры – от первого лица и от третьего лица, по геймплейным особенностям – тактические, реалистичные, кооперативные, снайперские и так далее, по сеттингу – фантастические, современные, постапокалиптические, военные. Потеряться в этом многообразии очень легко, поэтому при выборе онлайн игры для многочасовых сетевых сражений стоит довериться команде, которая хорошо разбирается в теме.</p><p>В нашем каталоге вы найдете лучшие мультиплеерные шутеры на ПК с подробным описанием для каждого из них. Зайдя на страницу игры, вы ознакомитесь с детальной информацией касательно сеттинга, режимов, классов и других особенностей, оцените визуальную составляющую в трейлерах и скриншотах, а также узнаете, можно ли скачать понравившийся шутер бесплатно. Мы постоянно актуализируем данные, так что будьте уверены – здесь вы получите самые правдивые и честные обзоры, которые, мы надеемся, помогут вам принять верное решение.</p></div>
</main>
<comments/>
</doc>
with <p>Среди жанров многопользовательских игр MMOFPS занимают одну из лидирующих позиций, наряду с</p>MMORPG<p>и</p>MOBA<p>
http://prntscr.com/us0qm3
http://prntscr.com/us0r2o
Links from source code are incorrectly replaced with paragraphs, which breaks the entire document.
Thank you for help!!!!!
v 0.5.2
I have a HTML-fragement (nested divs, p, ul).
html_fragement = "<div class="l-main-column">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\n\t\n\t<div class="text-image-container ">\n<div class="text-image">\n<p>Mit dem KfW-Unternehmerkredit fördern wir Unter\xadnehmen sowie Frei\xadberufler, die seit mindestens <span class="u-nobr">5 Jahren</span> am Markt aktiv sind.</p>\n</div>\n</div><div class="text-image-container ">\n<div class="text-image">\n<p><strong>Das Förderprodukt kommt nicht in Frage: </strong></p><ul class="list list--bullets"> <li class="list__item"> für Existenzgründer und junge Unternehmen bis 5 Jahre. Diese unterstützen wir mit anderen Förder\xadprodukten, zum Beispiel mit dem <a class="link link--underline" href="https://www.kfw.de/inlandsfoerderung/Privatpersonen/Gr%C3%BCnden-Erweitern/F%C3%B6rderprodukte/ERP-Gr%C3%BCnderkredit-Universell-(073_074_075_076)/" title="Zur Infoseite zum ERP-Gründerkredit - Universell (073, 074, 075, 076)" data-section="contentcolumn"><span class="link__name"><span class="link__name-inner"><span class="link__name-text">ERP-Gründerkredit – Universell</span></span></span></a>. </li><li class="list__item"> für Unternehmen, die zum 31.12.2019 in Schwierig\xadkeiten waren, also vor Beginn der Coronakrise. </li><li class="list__item"> wenn Sie während der Kredit\xadlaufzeit Gewinn oder Dividende ausschütten. Möglich sind aber markt\xadübliche Ausschüttungen oder Entnahmen für Geschäfts\xadinhaber (natürliche Personen). </li> </ul>\n</div>\n</div>\n\t\n\n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t</div>"
I wanted trafilatura v0.6.0 to extract the "visible" text from the fragement by trafilatura.extract(html_fragment, target_language='de')
. There are two paragraphs and an unordered list. After the extract, I receive the text from second paragraph and text of unordered list. But the first paragraph is lost. why?
the output of the extract is:
In [16]: trafilatura.extract(a[0])
2020-12-15 18:01:01 [trafilatura.core] DEBUG: Taking all p-elements
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 34.250 .text-image-container>.text-image link density 0.000 -> 34.250
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 32.125 .l-main-column>.text-image-container link density 0.000 -> 32.125
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 32.390 .text-image-container>.text-image link density 0.063 -> 30.344
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 31.195 .l-main-column>.text-image-container link density 0.063 -> 29.225
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 34.250 .text-image-container>.text-image
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 32.125 .l-main-column>.text-image-container
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 30.344 .text-image-container>.text-image
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 29.225 .l-main-column>.text-image-container
2020-12-15 18:01:01 [readability.readability] DEBUG: Not removing div{01}>.text-image of length 125: Mit dem KfW-Unternehmerkredit fördern w...
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 32.390 .text-image-container>.text-image link density 0.063 -> 30.344
2020-12-15 18:01:01 [readability.readability] DEBUG: Branch 31.195 .l-main-column>.text-image-container link density 0.063 -> 29.225
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 30.344 .text-image-container>.text-image
2020-12-15 18:01:01 [readability.readability] DEBUG: Top 5 : 29.225 .l-main-column>.text-image-container
2020-12-15 18:01:01 [readability.readability] DEBUG: Not removing .text-image>ul.list.list--bullets of length 435: für Existenzgründer und junge Unternehm...
2020-12-15 18:01:01 [readability.readability] DEBUG: Not removing div{01}>.text-image of length 475: Das Förderprodukt kommt nicht in Frage:...
2020-12-15 18:01:01 [trafilatura.core] DEBUG: extracted length: 476 (algorithm) 165 (extraction)
2020-12-15 18:01:01 [trafilatura.core] INFO: using generic algorithm: None
2020-12-15 18:01:01 [trafilatura.core] INFO: not enough comments None
Out[16]: 'Das Förderprodukt kommt nicht in Frage:\n- für Existenzgründer und junge Unternehmen bis 5 Jahre. Diese unterstützen wir mit anderen Förderprodukten, zum Beispiel mit dem ERP-Gründerkredit – Universell.\n- für Unternehmen, die zum 31.12.2019 in Schwierigkeiten waren, also vor Beginn der Coronakrise.\n- wenn Sie während der Kreditlaufzeit Gewinn oder Dividende ausschütten. Möglich sind aber marktübliche Ausschüttungen oder Entnahmen für Geschäftsinhaber (natürliche Personen).'
The mssing paragraph is not removed by readability.readabilty.
Not removing div{01}>.text-image of length 125: Mit dem KfW-Unternehmerkredit fördern w...
I expected:
Mit dem KfW-Unternehmerkredit fördern wir Unternehmen sowie Freiberufler, die seit mindestens 5 Jahren am Markt aktiv sind.\nDas Förderprodukt kommt nicht in Frage:\n- für Existenzgründer und junge Unternehmen bis 5 Jahre. Diese unterstützen wir mit anderen Förderprodukten, zum Beispiel mit dem ERP-Gründerkredit – Universell.\n- für Unternehmen, die zum 31.12.2019 in Schwierigkeiten waren, also vor Beginn der Coronakrise.\n- wenn Sie während der Kreditlaufzeit Gewinn oder Dividende ausschütten. Möglich sind aber marktübliche Ausschüttungen oder Entnahmen für Geschäftsinhaber (natürliche Personen).
Thanks for your support. Love your work.
With cocon.se site fetch_urlr
result is None
.
Test code
import trafilatura
if __name__ == '__main__':
downloaded = trafilatura.fetch_url('http://cocon.se/')
if downloaded is None:
print('Error downloaded content')
exit(1)
result = trafilatura.extract(downloaded)
print(result)
Result
Error downloaded content
The reason for this error is that no UA is specified in the header of the request.
With this code, It's OK
response = requests.Session().get('http://cocon.se/', timeout=30, verify=False, allow_redirects=True, headers={'User-Agent': 'Mozilla/5.0'})
downloaded = response.text
result = trafilatura.extract(downloaded)
print(result)
If readbility fallback is activated, the Trafilatura library redirects stderr to /dev/null upon every call:
trafilatura/trafilatura/external.py
Line 63 in a56fb3e
Within programs involving other libraries, this causes a host of side effects. E.g., generating a chart with seaborn imports ipython (a dependency of seaborn) which pre-checks upon initialization stdin, stdout and stderr and crashes because stderr is /dev/null. I have other side effects as well in other libraries, including disappearing logs (eg when logs settings are modified after calls to Trafilatura).
This redirection seems to have been necessary to prevent the readibility library to print out messages to stderr. A cursory reading of the current version of readibility seems to indicate it doesn't do that, it only emits proper logs.
Consequently, this redirect may be removed (to be tested).
When i'm trying to import trafilatura, it throws an error that regex has no compile attribute, I have installed version trafilatura-0.8.0. I'm currently running it in jupyter notebook.
Investigate potential issue with missing protocol in URLs as mentioned in #48.
I have got an html with this kind of string in it:
'<a href="" class="post-meta-date sh-default-color">2019 28 meh</a>\n \n \t\t\t\t\t\t\t</div>\n\n\t\t\t\t\t\t\t<a href="some_link/" class="post-title">\n\t\t\t\t\t\t\t\t<h2 itemprop="headline">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tsome random text?\t\t\t\t\t\t\t\t</h2>\n\t\t\t\t\t\t\t</a>'
When I try to do bare_extraction
it fails with:
---------------------------------------------------------------------------
IllegalMonthError Traceback (most recent call last)
.../python3.7/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
654 try:
--> 655 ret = self._build_naive(res, default)
656 except ValueError as e:
.../python3.7/site-packages/dateutil/parser/_parser.py in _build_naive(self, res, default)
1237
-> 1238 if cday > monthrange(cyear, cmonth)[1]:
1239 repl['day'] = monthrange(cyear, cmonth)[1]
.../python3.7/calendar.py in monthrange(year, month)
123 if not 1 <= month <= 12:
--> 124 raise IllegalMonthError(month)
125 day1 = weekday(year, month, 1)
IllegalMonthError: bad month number 28; must be 1-12
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-89-26cbca439b34> in <module>
1 # downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
2 res = bare_extraction(con3,
----> 3 include_formatting = True,
4 # output_format ="xml"
5 )
.../python3.7/site-packages/trafilatura/core.py in bare_extraction(filecontent, url, no_fallback, include_comments, output_format, target_language, include_tables, include_images, include_formatting, include_links, deduplicate, date_extraction_params, with_metadata, max_tree_size, url_blacklist, config)
674 # extract metadata if necessary
675 if output_format != 'txt':
--> 676 docmeta = extract_metadata(tree, url, date_extraction_params)
677 # cut short if extracted URL in blacklist
678 if docmeta['url'] in url_blacklist:
.../python3.7/site-packages/trafilatura/metadata.py in extract_metadata(filecontent, default_url, date_config)
384 date_config['url'] = metadata['url']
385 try:
--> 386 metadata['date'] = find_date(tree, **date_config)
387 # temporary fix for htmldate bug
388 except UnicodeError:
.../python3.7/site-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
629 for expr in DATE_EXPRESSIONS:
630 dateresult = examine_date_elements(
--> 631 search_tree, expr, outputformat, extensive_search, min_date, max_date
632 )
633 if dateresult is not None:
.../python3.7/site-packages/htmldate/core.py in examine_date_elements(tree, expression, outputformat, extensive_search, min_date, max_date)
92 {ord(c): None for c in '\n\t\r'}
93 ).strip()[:100])
---> 94 attempt = try_ymd_date(toexamine, outputformat, extensive_search, min_date, max_date)
95 if attempt is not None:
96 return attempt
.../python3.7/site-packages/htmldate/extractors.py in try_ymd_date(string, outputformat, extensive_search, min_date, max_date)
387 return None
388 # faster
--> 389 customresult = custom_parse(string, outputformat, extensive_search, min_date, max_date)
390 if customresult is not None:
391 return customresult
.../python3.7/site-packages/htmldate/extractors.py in custom_parse(string, outputformat, extensive_search, min_date, max_date)
304 # speed-up by ignoring time zone info if ciso8601 is installed
305 else:
--> 306 result = parse_datetime_as_naive(string)
307 if date_validator(result, outputformat, earliest=min_date, latest=max_date) is True:
308 LOGGER.debug('parsing result: %s', result)
.../python3.7/site-packages/dateutil/parser/_parser.py in parse(timestr, parserinfo, **kwargs)
1372 return parser(parserinfo).parse(timestr, **kwargs)
1373 else:
-> 1374 return DEFAULTPARSER.parse(timestr, **kwargs)
1375
1376
.../python3.7/site-packages/dateutil/parser/_parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
655 ret = self._build_naive(res, default)
656 except ValueError as e:
--> 657 six.raise_from(ParserError(e.args[0] + ": %s", timestr), e)
658
659 if not ignoretz:
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Used the same html string as input for jusText and it worked. Not sure where exactly is bug.
Trafilatura extracts some metadata from JSON-LD tag if it is available. In particular, it tries to search for the title in the "headline" property of the JSON-LD tag, but looks like the headline is not necessarily the title. For example, look at this wikipedia page: https://en.m.wikipedia.org/wiki/Semantic_satiation
The JSON-LD is:
{
"@context":"https:\/\/schema.org","@type":"Article",
"name":"Semantic satiation",
"url":"https:\/\/en.wikipedia.org\/wiki\/Semantic_satiation",
"sameAs":"http:\/\/www.wikidata.org\/entity\/Q226007",
"mainEntity":"http:\/\/www.wikidata.org\/entity\/Q226007",
"author":{"@type":"Organization","name":"Contributors to Wikimedia projects"},
"publisher":{"@type":"Organization","name":"Wikimedia Foundation, Inc.","logo":{"@type":"ImageObject","url":"https:\/\/www.wikimedia.org\/static\/images\/wmf-hor-googpub.png"}},
"datePublished":"2006-07-12T09:27:14Z",
"dateModified":"2020-08-31T23:55:26Z",
"headline":"psychological phenomenon in which repetition causes a word to temporarily lose meaning for the listener"
}
Most of the wikipedia pages are like this.
The title of the page is in the "name" property, and the "headline" property contains a short tagline instead. So trafilatura gives the tagline instead of the title as the title of the page. It probably makes sense to search for the "name" property first? Though it would be hard to extract with a regex: "name" also appears in subfields, like in the "author" property above, so would need to parse the json properly.
There was even a proposal to get rid of the headline property and replace it with "name" or with "title": schemaorg/schemaorg#205
I have mostly tested trafilatura
on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction doesn't work so far.
Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in xpaths.py (see BODY_XPATH
and COMMENTS_XPATH
lists).
Thanks!
extract
API throws this error while trying to extract content:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-333-f77dadd7d5ab> in <module>
----> 1 a = trafilatura.extract(downloaded)
2 a
~/opt/anaconda2/envs/permpressenv/lib/python3.6/site-packages/trafilatura/core.py in extract(filecontent, url, record_id, no_fallback, include_comments, output_format, csv_output, xml_output, tei_output, tei_validation, target_language, include_tables, include_formatting, date_extraction_params)
618
619 # extract content
--> 620 postbody, temp_text, len_text, sure_thing = extract_content(cleaned_tree, include_tables)
621
622 # compare if necessary
~/opt/anaconda2/envs/permpressenv/lib/python3.6/site-packages/trafilatura/core.py in extract_content(tree, include_tables)
340 # print(html.tostring(subtree, pretty_print=True, encoding='unicode'))
341 # extract content
--> 342 processed_elems = [handle_textelem(e, potential_tags) for e in subtree.xpath('.//*')]
343 # list(filter(None.__ne__, processed_elems))
344 result_body.extend([e for e in processed_elems if e is not None])
~/opt/anaconda2/envs/permpressenv/lib/python3.6/site-packages/trafilatura/core.py in <listcomp>(.0)
340 # print(html.tostring(subtree, pretty_print=True, encoding='unicode'))
341 # extract content
--> 342 processed_elems = [handle_textelem(e, potential_tags) for e in subtree.xpath('.//*')]
343 # list(filter(None.__ne__, processed_elems))
344 result_body.extend([e for e in processed_elems if e is not None])
~/opt/anaconda2/envs/permpressenv/lib/python3.6/site-packages/trafilatura/core.py in handle_textelem(element, potential_tags)
293 if element.tail is not None and not element.tail.isspace():
294 new_element = etree.Element('p')
--> 295 new_element.text = process_node(element).tail
296 # new_element.text = handle_textnode(element, comments_fix=False).tail
297 elif element.tag == 'hi':
AttributeError: 'NoneType' object has no attribute 'tail'
Interestingly, contents from the same HTML file can be extracted with html2text
API. For reproducing this error, try this URL: http://sibenlab.blogspot.com/2018/06/sibenlab-privacy-policy.html
Example article: https://www.nytimes.com/2020/10/19/us/politics/trump-ads-biden-election.html
This is authored by Maggie Haberman, Shane Goldmacher and Michael Crowley, but trafilatura will only show the first one. They are all in the JSON-LD so I think they should all be extracted, and author should be an array.
The current use of requests
sessions in cli_utils.py
doesn't appear to be thread-safe (psf/requests#2766).
The full functionality of the module isn't really needed here and a change would help reducing the total number of dependencies as mentioned in #41.
I tried to run the following extraction snippet on the same HTML input multiple times, but every 4th extraction returned shorter extraction.
link = '...'
html = scrape(link)
prev_extraction = None
for x in range(10):
extraction = trafilatura.extract(html, include_comments=False,
include_tables=False, no_fallback=True, target_language='en')
if prev_extraction:
if prev_extraction != extraction:
print('Extraction looks weird!')
prev_extraction = extraction
Is there any parameter I should be using or is this a bug to be fixed?
Thanks!
Traceback (most recent call last):
File "indexer.py", line 53, in
content_trafilatura = trafilatura.extract(document, json_output=True, with_metadata=False, include_tables=False, deduplicate=True, include_comments=False)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/trafilatura/core.py", line 684, in extract
max_tree_size=max_tree_size, url_blacklist=url_blacklist
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/trafilatura/core.py", line 586, in bare_extraction
docmeta = extract_metadata(tree, url, date_extraction_params)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/trafilatura/metadata.py", line 367, in extract_metadata
metadata['date'] = find_date(tree, **date_config)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/htmldate/core.py", line 605, in find_date
original_date, min_date, max_date)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/htmldate/core.py", line 124, in examine_header
headerdate = tryfunc(elem.get('content'))
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/htmldate/extractors.py", line 385, in try_ymd_date
customresult = custom_parse(string, outputformat, extensive_search, min_date, max_date)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/htmldate/extractors.py", line 302, in custom_parse
result = parse_datetime_as_naive(string)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1374, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 655, in parse
ret = self._build_naive(res, default)
File "/Users/luca/enviroments/3.7/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1241, in _build_naive
naive = default.replace(**repl)
OverflowError: signed integer is greater than maximum
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.