Coder Social home page Coder Social logo

bookieio / breadability Goto Github PK

View Code? Open in Web Editor NEW
203.0 203.0 26.0 618 KB

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

Home Page: https://bookieio.github.io/breadability/

License: BSD 2-Clause "Simplified" License

Python 5.39% Makefile 0.08% Roff 0.06% HTML 94.47%
html-extraction html-extractor html-parsing python text-extraction text-mining

breadability's People

Contributors

craigmaloney avatar gjastrab avatar jelmer avatar macmenot avatar miso-belica avatar mitechie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

breadability's Issues

None type error in readable parsing

[D 120827 20:17:42 existing:67] Q1 getting content for a6bb837b68038a http://themeforest.net/item/themeology-portfolio-and-blog-theme/full_screen_preview/127873

Exception in thread Thread-1325:
Traceback (most recent call last):
File "/usr/lib/python2.6/threading.py", line 532, in *bootstrap_inner
self.run()
File "/usr/lib/python2.6/threading.py", line 484, in run
self.__target(_self.__args, _self.__kwargs)
File "scripts/readability/existing.py", line 68, in fetch_content
read = ReadUrl.parse(url)
File "/home/bmark.us/0.5/bookie/lib/readable.py", line 176, in parse
if not document.readable:
File "/home/bmark.us/0.5/lib/python2.6/site-packages/breadability/utils.py", line 55, in __get

value = self.fget(inst)
File "/home/bmark.us/0.5/lib/python2.6/site-packages/breadability/readable.py", line 457, in readable
return tounicode(self._readable)
File "/home/bmark.us/0.5/lib/python2.6/site-packages/breadability/utils.py", line 55, in get
value = self.fget(inst)
File "/home/bmark.us/0.5/lib/python2.6/site-packages/breadability/readable.py", line 482, in _readable
doc = build_base_document(updated_winner.node, self.fragment)
File "/home/bmark.us/0.5/lib/python2.6/site-packages/breadability/readable.py", line 93, in build_base_document
if html.tag == 'body':
AttributeError: 'NoneType' object has no attribute 'tag'

error dropping node

012-06-29T16:55:50+00:00 app[web.1]: File "/app/bookie_parser/handlers/init.py", line 131, in _readable_content
2012-06-29T16:55:50+00:00 app[web.1]: value = self.fget(inst)
2012-06-29T16:55:50+00:00 app[web.1]: File "/app/.heroku/venv/lib/python2.7/site-packages/breadability/readable.py", line 426, in readable
2012-06-29T16:55:50+00:00 app[web.1]: return tounicode(self._readable)
2012-06-29T16:55:50+00:00 app[web.1]: [n.drop_tree() for n in self._should_drop]
2012-06-29T16:55:50+00:00 app[web.1]: value = self.fget(inst)
2012-06-29T16:55:50+00:00 app[web.1]: File "/app/.heroku/venv/lib/python2.7/site-packages/breadability/utils.py", line 55, in get
2012-06-29T16:55:50+00:00 app[web.1]: File "/app/.heroku/venv/lib/python2.7/site-packages/breadability/readable.py", line 436, in _readable
2012-06-29T16:55:50+00:00 app[web.1]: File "/app/.heroku/venv/lib/python2.7/site-packages/lxml/html/init.py", line 169, in drop_tree
2012-06-29T16:55:50+00:00 app[web.1]: assert parent is not None

Python 3.7 compatible release?

Are you planning to issue a Python 3.7 compatible release in a near future?

I am the Debian maintainer of breadability. For now we are based on 0.1.20, and I'm not sure how compatible this version is with recent Python 3 releases.

Thanks for your work.

Single invalid character results in a failed parse.

While attempting to use breadability to parse this page: http://bgr.com/2013/10/08/iphone-6-specs-display-resolution-size/

With something like:

url = 'http://bgr.com/2013/10/08/iphone-6-specs-display-resolution-size/'
content = requests.get(url).content
post = readable.Article(content, url)

I end up with post.dom == None.

I traced this down to OriginalDocument.dom where the character encoding is strictly enforced which results in failure if there is even a single invalid character.

I propose that either the decoding be switched to errors='ignore' in this block:

    @cached_property
    def dom(self):
        """Parsed HTML document from the input."""
        html = self._html
        if not isinstance(html, unicode):
            encoding = determine_encoding(html)
            html = html.decode(encoding)

        html = convert_breaks_to_paragraphs(html)
        document = build_document(html, self._url)

        return document

Or that a parameter be added to Article, OriginalDocument to relax the requirement.

I set up an example on runnable: http://runnable.com/Uw-D11n75cQLKkI_/breadability-parse-failure-for-python

I'll gladly do this and submit a pull request - let me know how you would prefer it to be implemented.

Missing argparse in requirements

I installed it successfully with pip in a clean virtualenv (with --no-site-packages).
I didn't run it because it uses argparse, but argparse isn't present in requirements.txt. Please add argparse to the requirements.
Python 2.6, Debian squeeze.

Bookie tests failing; need to fix installation for Travis CI

@miso-belica pointed out that the tests for Bookie are failing. That's part of the issue. The other part is that Bookie is not currently able to build because certain packages have aged out and are no longer available. I don't remember the specifics at the moment so I can't elaborate more without re-trying the build process. What I remember though is I got stuck while trying to modernize the packages with later versions.

Article fail to pick some content

html_content = """<P>&nbsp;&nbsp;&nbsp;&nbsp;最近一个月内,三圣股份共计登上龙虎榜1次,表明三圣股份股性一般。 (<A href='http://stock.jrj.com.cn/share,002742,lhb.shtml' target=_blank>更多龙虎榜查询请点击</A>)</P><P>&nbsp;&nbsp;&nbsp;&nbsp;公司主要从事    建材化工、医药。</P>"""
article = Article(html_content)
print(article.main_text)  # pick nothing
from lxml import etree
html = etree.HTML(html_content)
html.xpath('//text()')   # pick target content successfully

Includes possible non-free content

breadability ships a number of documents for testing that are possibly restricted because they are copyrighted and not BSD-licensed.

Would it be possible to use alternatives, or possibly fetch them during the test process so they don't have to be distributed?

This makes it harder to package breadability for Debian.

not all nodes are scored/removed in the prep_article phase.

There are a lot of nodes that should be removed during the prep_article phase. It seems only a few nodes are getting hit in the iter() loop there while it should be each node from the winning dom element down.

Not sure why this is, but it's preventing us from getting really clean docs.

exceptions with bad parsing

Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in *bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(_self.__args, _self.__kwargs)
File "scripts/readability/existing.py", line 65, in fetch_content
read = ReadUrl.parse(url)
File "/home/rharding/src/bookie/bookie/lib/readable.py", line 171, in parse
if not document.readable:
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/utils.py", line 55, in __get

value = self.fget(inst)
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/readable.py", line 426, in readable
return tounicode(self._readable)
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/utils.py", line 55, in get
value = self.fget(inst)
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/readable.py", line 431, in _readable
if self.candidates:
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/utils.py", line 55, in get
value = self.fget(inst)
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/readable.py", line 419, in candidates
doc = self.doc
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/utils.py", line 55, in get
value = self.fget(inst)
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/readable.py", line 409, in doc
doc = self.orig.html
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/utils.py", line 55, in get
value = self.fget(inst)
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/document.py", line 93, in html
return self._parse(self.orig_html)
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/document.py", line 80, in _parse
doc = build_doc(html)
File "/home/rharding/src/bookie/local/lib/python2.7/site-packages/breadability/document.py", line 54, in build_doc
page_unicode = page.decode(enc, 'replace')
TypeError: decode() argument 1 must be string, not None

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.