Coder Social home page Coder Social logo

ruby-readability's People

Contributors

andykent avatar andyw8 avatar austinrfnd avatar bitdeli-chef avatar cantino avatar eguitarz avatar gioele avatar greatghoul avatar josip avatar libc avatar louismullie avatar magic003 avatar marcosinger avatar mraaroncruz avatar olleolleolle avatar pagojo avatar peterc avatar samsaffron avatar tobym avatar tuzz avatar vinc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ruby-readability's Issues

TheGuardian.com image galleries are not picked up

Hello,

I tried to apply readability on a specific layout of The Guardian, which heavily relies on JavaScript but still has most of the text available in the HTML source code:

http://www.theguardian.com/football/gallery/2014/sep/10/memory-lane-1980s-footballers-at-home-in-pictures

Readability returned this chunk of HTML:

<div><div> comments <p>Sign in or create your Guardian account to join the discussion. </p> <p>This discussion is closed for comments.</p> <p> We’re doing some maintenance right now. You can still read comments, but please come back later to add your own. </p> <p> Commenting has been disabled for this account (why?) </p> </div></div>

Do you know guys why the main content is not properly extracted, and if it fixable?

Encoding guessing of re-encoded string goes really wrong

When a re-encoded string is processed by Readability both the newly created object as well as the object in processing get assigned the wrong encoding.

Consider the following test case where an ISO-8859-7 (Greek) string is re-encoded to UTF-8

kbody = open("http://www.kathimerini.gr").read.encode("utf-8", undef: :replace, invalid: :replace, replace: " " )

checking kbody with

kbody.encoding
=> #<Encoding:UTF-8> 

but when it is parsed as a UTF-8 from Readability, both the old object as well as the new one have the wrong encoding assigned

rbody = Readability::Document.new(kbody).content
rbody.encoding
=> #<Encoding:ISO-8859-7> 
kbody.encoding
=> #<Encoding:ISO-8859-7> 

This seems like a dual bug.

Although one can argue that the re-encoding could go wrong I see no reason why the original string should get its encoding changed. I would expect that it should be treated as an immutable or at least it should be documented that the input string encoding may change. Perhaps a dup of the string should be manipulated instead of operating on the original input content.

content(..) creates side-effects by altering the state of this

When the content method is used its sanitize alters the state of the this object so that subsequent calls return different results. This is illustrated by using images. If the intention of content is to have side-effects to the state of the object it should perhaps be named content! instead.

e.g.,

body = open("http://www.pagonis.org").read.encode("utf-8", undef: :replace, invalid: :replace, replace: " " )

rbody = Readability::Document.new(body, :tags => %w[div p img a], :attributes => %w[src href], :remove_empty_nodes => false, :do_not_guess_encoding => true)

rbody.images
=> ["images/pagonis.jpg"] 

rbody.content
rbody.images
=> [] 

It looks like this happens because sanitize operates on the article nodes directly and not on a copy of the article in processing.

Multi-language web-page not encoding in UTF-8 can't parse by Nokogiri

I find the reason why ruby-readability can't works well on Chinese website.
Nokogiri can't parse correctly in not UTF-8 encodings, because libxml2 have some bug
See Nokogiri issue: sparklemotion/nokogiri#215

If we put this page ( http://www.nownews.com/2012/09/12/339-2853197.htm ) in ruby-readability.
We will get nothing!
But, if we change encoding to UTF-8 before put in ruby-readability, the output will be good.

So, I am thinking is that better to change encoding to UTF-8 in this gem's work ?

Validity Error

Hi,

I'm fetching HTML pages like the example:

  source = open(url).read
  puts Readability::Document.new(source, tags: %w[div p img a], attributes: %w[src href]).content

But every page I try I get a validation similar to:

  element p: validity error : ID xxxxx already defined

Any ideas?

Thanks

noscript tags which include images are ignored?

The JS Readability seems to work fine with this kind of code, whereas the Ruby version doesn't.

The following snippet was lifted from AppleInsider

<div class="article-img">
<img src="http://photos.appleinsider.com/v9/images/1x1-white.jpg" width="660" height="317" alt="iPhone Plus" class="lazy" data-original="http://cdn1.appleinsider.com/iphoneplus-130205.jpg"><noscript><img src="http://cdn1.appleinsider.com/iphoneplus-130205.jpg"></noscript>
</div>

To me it seems that Ruby Readability should detect the noscript images and use them in its heuristics (modulo issue #51). Currently it doesn't

New Rubygems release?

👋 Hi @cantino @olleolleolle

I realise this an old gem, but I see you've made commits or merged PRs within the past year or so.

The last Rubygems release was in 2014 – do you think it would be feasible to make a new release any time soon?

Thanks

UTF-8 characters again

It seems that pages that contain UTF-8 characters still cannot be processed.

For example, using /bin/readability on a popular french website:
readability http://www.developpez.com/actu/35379/Novell-cede-Mono-a-Xamarin-une-mise-a-jour-de-la-plateforme-est-annoncee-pour-l-automne/

It crashes line 216:
lib/readability.rb:216:in =~': invalid byte sequence in UTF-8 (ArgumentError) from /Users/kimious/.rvm/gems/ruby-1.9.2-p180/gems/ruby-readability-0.2.3/lib/readability.rb:216:in!~'

I can reproduce the bug for a lot of french webpages.

Extra div added (is it expected?)

First, thanks for your work on readability :-)

Just a quick feedback (I'm not a heavy user myself): while upgrading an old setup today, I noticed that a raw content is now wrapped into two levels of divs:

1.9.3-p484 :003 > Readability::Document.new("My content").content
 => "<div><div><p>My content</p></div></div>" 

while previously (2-year old version) was returned as:

 => "<div><p>My content</p></div>" 

Is it expected? I understand that this specific test-case is a bit unrelatistic (not tags at all), but wondered if there could be other similar issues with properly formatted html.

Images documentation

The documentation on retrieving images should perhaps state all the steps to achieve the retrieval of image links

e.g.,

You can get a list of images in the content area with .images. This feature requires that the mini_magick gem be installed.

rbody = Readability::Document.new(body, :tags => %w[div p img a], :attributes => %w[src href], :remove_empty_nodes => false)
rbody.images

Exception on Readability

[ERROR] Readability HTMLClean failed: undefined method name' for nil:NilClass /usr/local/lib/ruby/gems/1.9.1/gems/activesupport-3.0.5/lib/active_support/whiny_nil.rb:48:inmethod_missing'
/usr/local/lib/ruby/gems/1.9.1/gems/ruby-readability-0.2.3/lib/readability.rb:115:in select_best_candidate' /usr/local/lib/ruby/gems/1.9.1/gems/ruby-readability-0.2.3/lib/readability.rb:51:incontent'

this happens in this example on readability.content : it happens on about 5% of the HTML we use which are randomly selected from the web ......

this happens on some debug output that is resolved before the entrance to debug method and our options[:debug] is nil

html = html.encode("UTF-8", :invalid=>:replace, :undef=>:replace, :replace=>" ")
begin
readability = Readability::Document.new(html)
return nil unless readability && readability.html
content = readability.content
content = content.gsub(/<.*?>/,' ') if content # Remove any remaining html < > tags
rescue Exception => ex
STDERR.puts"[ERROR] Readability HTMLClean failed: #{ex}" if @verbose
ex.backtrace.each{|line| STDERR.puts(line) }
end

Can't handle headers as a hash when guessing encoding

I'm using readability with various HTML responses returned from Faraday.

I've just discovered the awesomeness of the 'guess_html_encoding' gem and am trying to get it to work with readability.

This code snippet works fine:
require 'readability'
require 'guess_html_encoding'
require 'faraday'

url = "http://inews.mingpao.com/htm/INews/20130319/gb21451k.htm"
response = Faraday.get(url)
headers = response.headers
html = response.body
guess = GuessHtmlEncoding.guess(html, headers)
puts "Encoding is #{guess}"
doc = Readability::Document.new(html, :encoding => guess, :remove_empty_nodes => true)

However, shortcutting by passing the headers directly to readability fails:
require 'readability'
require 'faraday'

url = "http://inews.mingpao.com/htm/INews/20130319/gb21451k.htm"
response = Faraday.get(url)
headers = response.headers
html = response.body
doc = Readability::Document.new(html, :html_headers => headers, :remove_empty_nodes => true)

I see this error: NoMethodError: undefined method 'gsub' for #<Faraday::Utils::Headers:0x007fb2fc155de0>
Faraday::Utils::Headers subclasses Hash by the way.

It looks like readability is assuming the headers are a string - the README is unclear on what type it should be but as guess_html_encoding takes a hash then I think readability should too.

Thanks, Darren.

H1 gets lost

I have been experimenting with the gem to retrieve content from Wikipedia pages, but it seems that the H1 tags get lost during the process of text extraction:

source = open('http://en.wikipedia.org/wiki/Frimley_Green_Windmill').read
puts Readability::Document.new(source, tags: ['h1', 'p', 'div']).content

Output:

<div><div>
<p>Frimley Green Windmill is a Grade II listed[1]tower mill at Frimley Green, Surrey, England which has been converted to residential use.</p>
 [edit] History 
<p>Frimley Green Windmill was first mentioned in 1784 in the ownership of a Mr Terry. It passed to Thomas Lilley in 1792 and then William Collins in 1801. In 1803, the mill passed into the ownership of the Royal Military College, Sandhurst, remaining in the hands of the military until at least 1832 and probably much later than that. The mill was disused by 1870, and the derelict shell was converted to residential use in 1914. [2]</p>
 [edit] Description 
<div>For an explanation of the various pieces of machinery, see Mill machinery.</div>
<p>Frimley Green Windmill is a four storey brick tower mill. Little is known of the mill, although it had at least one pair of Spring or Patent sails.[2]</p>
 [edit] Millers 
 George Marshall 1792
John Banks 1801
 <p>Reference for above:-[2]</p>
 [edit] External links 
 [edit] References 

</div></div>

This is missing the only h1 tag on the page,

<h1 id="firstHeading" class="firstHeading">Frimley Green Windmill</h1>

I have experienced the same quirk with all Wikipedia pages. Any idea what could be causing this?

retrieve document title

It might be nice if there was a method to retrieve the document title.

Of course this could be easily implemented externally:

doc = Readability::Document.new(html)
title = doc.html.css("title").first
title = title ? title.text : nil

... but it seems like a fairly common thing to do, so perhaps it's worth including?

Segmentation fault ruby 1.9.3-p0

Hi,
I just got this segfault using the provided example scraping this page http://git.or.cz/course/svn.html#merge. I had success with another page. This is running on Lion.
I was using the version you get when you run gem install ruby-readability. I will try it next with --pre
I just tried it with --pre and it also segfaulted.

irb(main):011:0> puts Readability::Document.new(source).content
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:136: [BUG] Segmentation fault
ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0]

-- Control frame information -----------------------------------------------
c:0030 p:---- s:0119 b:0119 l:000118 d:000118 CFUNC  :content
c:0029 p:0044 s:0116 b:0116 l:000115 d:000115 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:136
c:0028 p:0029 s:0110 b:0105 l:001db8 d:000104 BLOCK  /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:164
c:0027 p:---- s:0101 b:0101 l:000100 d:000100 FINISH
c:0026 p:---- s:0099 b:0099 l:000098 d:000098 CFUNC  :each
c:0025 p:0038 s:0096 b:0096 l:001db8 d:001db8 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:163
c:0024 p:0088 s:0091 b:0091 l:000090 d:000090 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:64
c:0023 p:0027 s:0083 b:0082 l:000f58 d:000081 EVAL   (irb):11
c:0022 p:---- s:0080 b:0080 l:000079 d:000079 FINISH
c:0021 p:---- s:0078 b:0078 l:000077 d:000077 CFUNC  :eval
c:0020 p:0028 s:0071 b:0071 l:000070 d:000070 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/workspace.rb:80
c:0019 p:0033 s:0064 b:0063 l:000062 d:000062 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/context.rb:254
c:0018 p:0031 s:0058 b:0058 l:0018d8 d:000057 BLOCK  /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:159
c:0017 p:0042 s:0050 b:0050 l:000049 d:000049 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:273
c:0016 p:0011 s:0045 b:0045 l:0018d8 d:000044 BLOCK  /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:156
c:0015 p:0144 s:0041 b:0041 l:000024 d:000040 BLOCK  /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:243
c:0014 p:---- s:0038 b:0038 l:000037 d:000037 FINISH
c:0013 p:---- s:0036 b:0036 l:000035 d:000035 CFUNC  :loop
c:0012 p:0009 s:0033 b:0033 l:000024 d:000032 BLOCK  /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:229
c:0011 p:---- s:0031 b:0031 l:000030 d:000030 FINISH
c:0010 p:---- s:0029 b:0029 l:000028 d:000028 CFUNC  :catch
c:0009 p:0023 s:0025 b:0025 l:000024 d:000024 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:228
c:0008 p:0046 s:0022 b:0022 l:0018d8 d:0018d8 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:155
c:0007 p:0011 s:0019 b:0019 l:0018b8 d:000018 BLOCK  /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:70
c:0006 p:---- s:0017 b:0017 l:000016 d:000016 FINISH
c:0005 p:---- s:0015 b:0015 l:000014 d:000014 CFUNC  :catch
c:0004 p:0183 s:0011 b:0011 l:0018b8 d:0018b8 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:69
c:0003 p:0039 s:0006 b:0006 l:001b68 d:002078 EVAL   /Users/aaron/.rbenv/versions/1.9.3-p0/bin/irb:12
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:001b68 d:001b68 TOP   

-- Ruby level backtrace information ----------------------------------------
/Users/aaron/.rbenv/versions/1.9.3-p0/bin/irb:12:in `<main>'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:69:in `start'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:69:in `catch'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:70:in `block in start'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:155:in `eval_input'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:228:in `each_top_level_statement'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:228:in `catch'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:229:in `block in each_top_level_statement'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:229:in `loop'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:243:in `block (2 levels) in each_top_level_statement'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:156:in `block in eval_input'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:273:in `signal_status'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:159:in `block (2 levels) in eval_input'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/context.rb:254:in `evaluate'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/workspace.rb:80:in `evaluate'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/workspace.rb:80:in `eval'
(irb):11:in `irb_binding'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:64:in `content'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:163:in `score_paragraphs'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:163:in `each'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:164:in `block in score_paragraphs'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:136:in `get_link_density'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:136:in `content'

-- C level backtrace information -------------------------------------------

   See Crash Report log file under ~/Library/Logs/CrashReporter or
   /Library/Logs/CrashReporter, for the more detail of.

-- Other runtime information -----------------------------------------------

* Loaded script: irb

* Loaded features:

    0 enumerator.so
    1 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/enc/encdb.bundle
    2 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/enc/trans/transdb.bundle
    3 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/defaults.rb
    4 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/rbconfig.rb
    5 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/deprecate.rb
    6 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/exceptions.rb
    7 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/custom_require.rb
    8 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems.rb
    9 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/e2mmap.rb
   10 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/init.rb
   11 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/workspace.rb
   12 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/inspector.rb
   13 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/context.rb
   14 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/extend-command.rb
   15 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/output-method.rb
   16 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/notifier.rb
   17 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/slex.rb
   18 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-token.rb
   19 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb
   20 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/src_encoding.rb
   21 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/magic-file.rb
   22 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/readline.bundle
   23 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/input-method.rb
   24 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/locale.rb
   25 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb
   26 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/version.rb
   27 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/requirement.rb
   28 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/platform.rb
   29 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/specification.rb
   30 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/path_support.rb
   31 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/dependency.rb
   32 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/nokogiri.bundle
   33 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/version.rb
   34 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/syntax_error.rb
   35 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/pp/node.rb
   36 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/pp/character_data.rb
   37 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/pp.rb
   38 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/parse_options.rb
   39 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax/document.rb
   40 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax/parser_context.rb
   41 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax/parser.rb
   42 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax/push_parser.rb
   43 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax.rb
   44 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/stringio.bundle
   45 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/node/save_options.rb
   46 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/node.rb
   47 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/attribute_decl.rb
   48 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/element_decl.rb
   49 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/element_content.rb
   50 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/character_data.rb
   51 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/namespace.rb
   52 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/attr.rb
   53 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/dtd.rb
   54 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/cdata.rb
   55 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/text.rb
   56 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/document.rb
   57 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/document_fragment.rb
   58 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/processing_instruction.rb
   59 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/node_set.rb
   60 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/syntax_error.rb
   61 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/xpath/syntax_error.rb
   62 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/xpath.rb
   63 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/xpath_context.rb
   64 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/builder.rb
   65 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/reader.rb
   66 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/notation.rb
   67 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/entity_decl.rb
   68 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/schema.rb
   69 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/relax_ng.rb
   70 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml.rb
   71 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xslt/stylesheet.rb
   72 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xslt.rb
   73 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/entity_lookup.rb
   74 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/document.rb
   75 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/document_fragment.rb
   76 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/sax/parser_context.rb
   77 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/sax/parser.rb
   78 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/element_description.rb
   79 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/element_description_defaults.rb
   80 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html.rb
   81 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/decorators/slop.rb
   82 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/node.rb
   83 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/xpath_visitor.rb
   84 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/racc/cparse.bundle
   85 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/racc/parser.rb
   86 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/thread.rb
   87 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/parser_extras.rb
   88 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/parser.rb
   89 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/strscan.bundle
   90 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/tokenizer.rb
   91 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/syntax_error.rb
   92 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css.rb
   93 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/builder.rb
   94 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri.rb
   95 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/guess_html_encoding-0.0.2/lib/guess_html_encoding/version.rb
   96 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/guess_html_encoding-0.0.2/lib/guess_html_encoding.rb
   97 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb
   98 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/ruby-readability.rb
   99 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/common.rb
  100 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/generic.rb
  101 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/ftp.rb
  102 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/http.rb
  103 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/https.rb
  104 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/ldap.rb
  105 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/ldaps.rb
  106 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/mailto.rb
  107 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri.rb
  108 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/date_core.bundle
  109 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/date/format.rb
  110 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/date.rb
  111 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/time.rb
  112 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/open-uri.rb
  113 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/socket.bundle
  114 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/socket.rb
  115 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/timeout.rb
  116 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb
  117 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/zlib.bundle
  118 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/net/http.rb
  119 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/enc/trans/single_byte.bundle
  120 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/delegate.rb
  121 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/etc.bundle
  122 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/fileutils.rb
  123 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/tmpdir.rb
  124 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/tempfile.rb
  125 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/enc/iso_8859_1.bundle

[NOTE]
You may have encountered a bug in the Ruby interpreter or extension libraries.
Bug reports are welcome.
For details: http://www.ruby-lang.org/bugreport.html

[1]    29286 abort      irb

Retreive date, created_on and updated_on

How can I extract the date on which the page being retrieved was 'created' and 'updated'?
I tried using the method 'date_published' which is the JSON element that is exposed by the Readability Parser API, but that did not of course work.

I am not exactly sure if there is already a way to do it, but if there isn't, it would be great if we can have a method that does this. However, if there is, this is not exactly an Issue.

Readbility's title

Readability pulls its article title from the title tag right? Well more often than not, the title tag has a whole lot of other information besides just the title of the article. It usually includes the title of the site itself and sometimes a category.

I know the original readability script just grabbed the title, but I'm wondering if this version of the script can be modified to grab the actual title of the article from the markup. It seems as though the scoring system is set up to exclude the header tag that contains the article title.

Example:

<article>
  <div class="article-title">
    <h1>Article title</h1>
  </div>
  <div class="article-content">
    <p>
      Claritatem insitam; est usus legentis in iis qui facit eorum claritatem.
      Investigationes demonstraverunt lectores legere me lius quod ii legunt
      saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem
      consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc
      putamus parum claram, anteposuerit litterarum formas humanitatis per seacula
      quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur
      parum clari, fiant sollemnes in futurum.
    </p>
    <p>
      Nunc varius risus quis nulla. Vivamus vel magna. Ut rutrum. Aenean
      dignissim, leo quis faucibus semper, massa est faucibus massa, sit amet
      pharetra arcu nunc et sem. Aliquam tempor. Nam lobortis sem non urna.
      Pellentesque et urna sit amet leo accumsan volutpat. Nam molestie lobortis
      lorem. Quisque eu nulla. Donec id orci in ligula dapibus egestas. Donec sed
      velit ac lectus mattis sagittis.
    </p>
  </div>
</article>

In the above example, readability will always grab the content from .article-content and not the <article> tag itself. What can I do to modify the script to grab the whole article, title and all?

Select_best_candidate fails on Lifehacker page

Sorry @iterationlabs this is not my week it seems :-/

I just observed the following while retrieving a Lifehacker page:

require 'open-uri'
require 'readability'
page = open("http://lifehacker.com/5866449/lifehacker-faceoff-the-best-digital-digests-on-ipad-and-iphone").read
rbody = Readability::Document.new(page).content

as well as when I do

rbody = Readability::Document.new(page, :do_not_guess_encoding => true).content

The result is the following exception due to a nil element.

#<NoMethodError: undefined method `name' for nil:NilClass>
NoMethodError: undefined method `name' for nil:NilClass
    from /home/user/.rvm/gems/ruby-1.9.2-p290@classic/gems/ruby-readability-0.5.2/lib/readability.rb:194:in `select_best_candidate'
    from /home/user/.rvm/gems/ruby-1.9.2-p290@classic/gems/ruby-readability-0.5.2/lib/readability.rb:44:in `prepare_candidates'
    from /home/user/.rvm/gems/ruby-1.9.2-p290@classic/gems/ruby-readability-0.5.2/lib/readability.rb:130:in `content'

Any ideas if this is a bug or am I doing something wrong?

Images not detected when the height and width are not inline

I think I've identified that this is what is causing some images to come in and others not. When the image width and height is not inline in the source content, the images gets left behind. I think it's coming from this block:

readability.rb

elements.each do |element|
          next unless element["src"]

          url     = element["src"].value
          height  = element["height"].nil?  ? 0 : element["height"].value.to_i
          width   = element["width"].nil?   ? 50 : element["width"].value.to_i

          if url =~ /\Ahttps?:\/\//i && (height.zero? || width.zero?)
            image   = get_image_size(url)
            next unless image
          else
            image = {:width => width, :height => height}
          end

          image[:format] = File.extname(url).gsub(".", "")

          if tested_images.include?(url)
            debug("Image was tested: #{url}")
            next
          end

          tested_images.push(url)
          if image_meets_criteria?(image)
            list_images << url
          else
            debug("Image discarded: #{url} - height: #{image[:height]} - width: #{image[:width]} - format: #{image[:format]}")
          end
        end

      (list_images.empty? and content != @html) ? images(@html, true) : list_images
    end

Help troubleshooting what's stripped out

Thanks for your work on this neat gem.

Running readability on the HTML from https://100wordstory.org/submit/, I expected more markup to remain than readability leaves intact.

Expected

image

Observed

In the screenshot above, the following content is stripped out:

  1. the red "Submit" heading:
<h1 class="titles">
  <a href="https://100wordstory.org/submit/" rel="bookmark" title="SubmitPermanent Link to ">Submit</a>
</h1>
  1. the red "Submissions are now open through January 9, 2024" and "Submit!" headings and links:
<h2 style="text-align: center;"><a href="https://100wordstory.submittable.com/submit">Submissions are now open through January 9, 2024!</a></h2>
<h2 style="text-align: center;"><a href="https://100wordstory.submittable.com/submit">Submit!</a></h2>

Turning on debug: true doesn't seem to cite why these items are missing:

% readability -d https://100wordstory.org/submit/
/Users/avk/.rvm/gems/ruby-2.7.8@wbm/gems/ruby-readability-0.7.0/bin/readability:31: warning: calling URI.open via Kernel#open is deprecated, call URI.open directly or use URI#open
Removing unlikely candidate - magnific_popup-css
Removing unlikely candidate - nav superfishmenu-100-word-story-menu
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-73menu-item-73
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page current-menu-item page_item page-item-6 current_page_item menu-item-72menu-item-72
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-83menu-item-83
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-189menu-item-189
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-70menu-item-70
Removing unlikely candidate - header
Removing unlikely candidate - comments
Removing unlikely candidate - commentlist clearfix
Removing unlikely candidate - comment even thread-even depth-1 parentcomment-65
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment byuser comment-author-100words bypostauthor odd alt depth-2comment-66
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment byuser comment-author-100words bypostauthor even thread-odd thread-alt depth-1comment-57
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment odd alt thread-even depth-1comment-56
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment even thread-odd thread-alt depth-1comment-52
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - sidebar-wrapper
Removing unlikely candidate - sidebar
Removing unlikely candidate - sidebar-box widget_blockblock-3
Removing unlikely candidate - widget_text sidebar-box widget_custom_htmlcustom_html-2
Removing unlikely candidate - sidebar-box widget_texttext-3
Removing unlikely candidate - sidebar-box widget_texttext-4
Removing unlikely candidate - sidebar-box widget_texttext-7
Removing unlikely candidate - sidebar-box widget_linkslinkcat-10
Removing unlikely candidate - footer
Altering div(#pages.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Top 5 candidates:
Candidate div#.post-wrapper with score 51.935052531041066
Candidate div#left-div. with score 16.71186440677966
Best candidate div#.post-wrapper with score 51.935052531041066
Conditionally cleaned div#.addtoany_share_save_container addtoany_content addtoany_content_bottom with weight 25 and content score 0 because it has too short a content length without a single image.
Conditionally cleaned div#.a2a_kit a2a_kit_size_24 addtoany_list with weight 0 and content score 0 because it has too short a content length without a single image.
Conditionally cleaned div#.recentposts with weight 25 and content score 0 because it has too short a content length without a single image.
<div><div>
					
					
					
					<p>100 words for your story … no more or no less. Tell a story, pen a slice of your memoir, or try your hand at an essay.</p>
<p>You get 100 words—exactly 100 words—which is both the pain and the pleasure here. It’s short, you tell yourself. You could write 100 words at a bus stop, on your lunch break, in your sleep. But with 100 words you must tell the whole story in its entirety, so it holds together like a perfect little doll house. (Your title is not part of the 100 words.)</p>
<p>Please include a short bio (25 words, max!) with your submission. Also, did we say exactly 100 words? We weren’t kidding! We count words according to Microsoft Word’s word-count tally. Also, make friends with your spell-check, or have a friend proofread your story.</p>
<p>We currently charge a $2 submission fee, the minimum in order to cover the costs of the submission system.</p>
<p> </p>
<p> </p>
					
										
											
						
						
									</div></div>

Any ideas on how to broaden or include this content?

Extracting the HTML of main content?

Is it possible to get the HTML of the main content area? I would like to preserve the tags present in the main content area (whitelisted tags, mainly the divs, p tags)

Bundle error (Johnson gem related)

I added readability to Gemfile:

gem readability

and tried to bundle. But bundling failed due to the following reason:

An error occured while installing johnson (2.0.0.pre3), and Bundler cannot continue.

After Googling I have learned it is because johnson doesn't work with Ruby 1.9. But since readability requires johnson to work in Rails, the problem makes me impossible to use readability in Rails projects.

Am I wrong or is there anyway to fix the issue? Thank you!

Java::JavaLang::NullPointerException on JRuby

When you don't put 'div' as a tag in the initializer like:

require 'rubygems'
require 'readability'
require 'open-uri'
    
source = open('https://developers.google.com/custom-search/docs/tutorial/creatingcse').read
puts Readability::Document.new(source, tags: []).content

it trows the error:
NullPointerException:
from nokogiri.internals.SaveContextVisitor.isHtmlScript(SaveContextVisitor.java:741)

you can put any tags you want, if you don't add 'div' the error happens

using jruby-9.1.6.0 [ x86_64 ]

HTML of the content extracted

main content of the article is extracted using .content. But how can the main content of article be extracted in the same css format?

Possible Memory Leak

There is a possible memory leak with Readability. I have a process with 40 threads, which calls a method which runs Readability on HTML documents. The method is rather simple:

def stripTags(source)
content = Readability::Document.new(source).content
content = strip_tags(source).gsub("\n", " ").squeeze(" ").strip //strip_tags is the Rails helper method for stripping tags
return content
end

I saw my memory usage increase gradually, from < 5% all the way to 80% after a day or so. What I did was try to narrow down the cause, so I commented out the Readability logic/calls, and that resolved the issue: no memory leaks. As soon as I put back the Readability call, the memory leak started again.

To temporarily fix this, I simply monitored my process with God, and had it restart if memory usage got too high, but I'm fairly certain there's a memory leak with the Ruby port of Readability.

Loses links in h-titles, even when asked

Ran into a new issue. I am trying it to give me the content but keep a and h1 tags in place. It works fine for something like <h1>foo</h1> or <a href='#'>foo</a>, but the following example just gets lost: <h1><a href='#'>foo</a></h1>

Image URLs which are not FQDN and have no size are missed by images()

eg.,

...
<div id="page">

<div class="article-img">
<img src="/v9/images/1x1-white.jpg" width="660" height="317" alt="iPhone Plus" class="lazy" data-original="/iphoneplus-130205.jpg"><noscript><img src="/iphoneplus-130205.jpg"></noscript>
</div>
<div class="article-img">
<img src="/plus-130906.jpg">
</div>
...

will only return /v9/images/1x1-white.jpg when images() is called.

However, if images_with_fqdn_uris!("http://bla.com") is called then subsequent calls to images() will return an array with all image URLs (fully qualified with http;//bla.com).

Now the question is, what the desired behaviour should be?

'transform_misused_divs_into_paragraphs!' causes segfault

I originally thought this was a Nokogiri issue (which I documented with lots of details here ), but further testing leads me to believe that it's a ruby-readability issue triggered when the anything hits the transform_misused_divs_into_paragraphs! in /lib/readability.rb. This doesn't seem to happen with every url we test, which leads me to believe that something is borking on non-standard encoding or a weird element in the example page below.

Error

# TO REPRODUCE
url = "http://www.eweek.com/c/a/Apple/Look-Out-Enterprise-Mac-OS-X-to-Get-Journaling/" 
text = (Faraday.get url).body #this is a successful request

Readability::Document.new(text).content

/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:830: [BUG] Segmentation fault
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.2.0]

-- Control frame information -----------------------------------------------
c:0050 p:---- s:0189 b:0189 l:000188 d:000188 CFUNC  :native_write_to
c:0049 p:0250 s:0182 b:0182 l:000181 d:000181 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:830
c:0048 p:0183 s:0172 b:0172 l:000171 d:000171 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:752
c:0047 p:0149 s:0163 b:0163 l:000162 d:000162 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:769
c:0046 p:0014 s:0159 b:0159 l:000410 d:000158 BLOCK  /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640
c:0045 p:---- s:0156 b:0156 l:000155 d:000155 FINISH
c:0044 p:---- s:0154 b:0154 l:0014c0 d:000153 IFUNC 
c:0043 p:0015 s:0152 b:0151 l:000141 d:000150 BLOCK  /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239
c:0042 p:---- s:0148 b:0148 l:000147 d:000147 FINISH
c:0041 p:---- s:0146 b:0146 l:000145 d:000145 CFUNC  :upto
c:0040 p:0023 s:0142 b:0142 l:000141 d:000141 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238
c:0039 p:---- s:0138 b:0138 l:000137 d:000137 FINISH
c:0038 p:---- s:0136 b:0136 l:0014c0 d:0014c0 CFUNC  :map
c:0037 p:0017 s:0133 b:0133 l:000410 d:000410 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640
c:0036 p:0034 s:0129 b:0129 l:000468 d:000128 BLOCK  /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:303
c:0035 p:0015 s:0126 b:0126 l:000116 d:000125 BLOCK  /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239
c:0034 p:---- s:0123 b:0123 l:000122 d:000122 FINISH
c:0033 p:---- s:0121 b:0121 l:000120 d:000120 CFUNC  :upto
c:0032 p:0023 s:0117 b:0117 l:000116 d:000116 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238
c:0031 p:0021 s:0113 b:0113 l:000468 d:000468 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:300
c:0030 p:0046 s:0110 b:0110 l:000ba8 d:000ba8 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:41
c:0029 p:0030 s:0107 b:0107 l:000106 d:000106 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:137
c:0028 p:0026 s:0101 b:0101 l:000c68 d:000100 EVAL   (irb):5
c:0027 p:---- s:0099 b:0099 l:000098 d:000098 FINISH
c:0026 p:---- s:0097 b:0097 l:000096 d:000096 CFUNC  :eval
c:0025 p:0028 s:0090 b:0090 l:000089 d:000089 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/workspace.rb:80
c:0024 p:0033 s:0083 b:0082 l:000081 d:000081 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/context.rb:254
c:0023 p:0031 s:0077 b:0077 l:000a98 d:000076 BLOCK  /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:159
c:0022 p:0042 s:0069 b:0069 l:000068 d:000068 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:273
c:0021 p:0011 s:0064 b:0064 l:000a98 d:000063 BLOCK  /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:156
c:0020 p:0144 s:0060 b:0060 l:000043 d:000059 BLOCK  /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:243
c:0019 p:---- s:0057 b:0057 l:000056 d:000056 FINISH
c:0018 p:---- s:0055 b:0055 l:000054 d:000054 CFUNC  :loop
c:0017 p:0009 s:0052 b:0052 l:000043 d:000051 BLOCK  /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:229
c:0016 p:---- s:0050 b:0050 l:000049 d:000049 FINISH
c:0015 p:---- s:0048 b:0048 l:000047 d:000047 CFUNC  :catch
c:0014 p:0023 s:0044 b:0044 l:000043 d:000043 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:228
c:0013 p:0046 s:0041 b:0041 l:000a98 d:000a98 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:155
c:0012 p:0011 s:0038 b:0038 l:002018 d:000037 BLOCK  /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:70
c:0011 p:---- s:0036 b:0036 l:000035 d:000035 FINISH
c:0010 p:---- s:0034 b:0034 l:000033 d:000033 CFUNC  :catch
c:0009 p:0183 s:0030 b:0030 l:002018 d:002018 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:69
c:0008 p:0222 s:0025 b:0025 l:002040 d:002040 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands/console.rb:45
c:0007 p:0019 s:0021 b:0021 l:000020 d:000020 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands/console.rb:8
c:0006 p:0615 s:0017 b:0017 l:000016 d:000016 TOP    /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands.rb:40
c:0005 p:---- s:0012 b:0012 l:000011 d:000011 FINISH
c:0004 p:---- s:0010 b:0010 l:000009 d:000009 CFUNC  :require
c:0003 p:0061 s:0006 b:0006 l:001228 d:001458 EVAL   script/rails:6
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:001228 d:001228 TOP   

-- Ruby level backtrace information ----------------------------------------
script/rails:6:in `<main>'
script/rails:6:in `require'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands.rb:40:in `<top (required)>'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands/console.rb:8:in `start'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands/console.rb:45:in `start'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:69:in `start'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:69:in `catch'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:70:in `block in start'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:155:in `eval_input'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:228:in `each_top_level_statement'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:228:in `catch'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:229:in `block in each_top_level_statement'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:229:in `loop'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:243:in `block (2 levels) in each_top_level_statement'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:156:in `block in eval_input'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:273:in `signal_status'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:159:in `block (2 levels) in eval_input'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/context.rb:254:in `evaluate'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/workspace.rb:80:in `evaluate'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/workspace.rb:80:in `eval'
(irb):5:in `irb_binding'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:137:in `content'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:41:in `prepare_candidates'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:300:in `transform_misused_divs_into_paragraphs!'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `each'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `upto'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239:in `block in each'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:303:in `block in transform_misused_divs_into_paragraphs!'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640:in `inner_html'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640:in `map'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `each'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `upto'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239:in `block in each'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640:in `block in inner_html'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:769:in `to_html'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:752:in `serialize'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:830:in `write_to'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:830:in `native_write_to'

Environment

ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.2.0]

ruby-readability (0.5.3)

nokogiri (1.5.5, 1.5.2)

RubyGems Environment:
  - RUBYGEMS VERSION: 1.8.19
  - RUBY VERSION: 1.9.3 (2012-02-16 patchlevel 125) [x86_64-darwin11.2.0]
  - INSTALLATION DIRECTORY: /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125
  - RUBY EXECUTABLE: /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/bin/ruby
  - EXECUTABLE DIRECTORY: /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bin
  - RUBYGEMS PLATFORMS:
    - ruby
    - x86_64-darwin-11
  - GEM PATHS:
     - /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125
     - /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125@global
  - GEM CONFIGURATION:
     - :update_sources => true
     - :verbose => true
     - :benchmark => false
     - :backtrace => false
     - :bulk_threshold => 1000
  - REMOTE SOURCES:
     - http://rubygems.org/

RuntimeError: Could not reparent node

Hi,

I stumbled upon an interesting problem.
I think this shouldn't really happen.

source = open("http://www.dnevnik.bg/izbori2013/2013/05/20/2064070_cvetan_cvetanov_predvijda_novi_predsrochni_izbori_sled/").read
Readability::Document.new(source, :tags => %w[p img a strong em], :attributes => %w[src href], :remove_empty_nodes => true).content
RuntimeError: Could not reparent node

Running with DIV tags produces the expected result though.

Readability::Document.new(source, :tags => %w[div p img a strong em], :attributes => %w[src href], :remove_empty_nodes => true).content

In case the source of the page changes I'm giving preview:

puts Readability::Document.new(source).content
<div><div>


 Заместник-председателят на ГЕРБ Цветан Цветанов изрази увереност, че след няколко месеца ще има нови предсрочни парламентарни избори. В сутрешния блок на БНТ бившият вицепремиер и вътрешен министър заяви, че е поискал изрично от лидера на партията Бойко Борисов да не бъде включван в проектокабинет, докато не бъде изчистено името му.<p>Цветанов подчерта, че оттук нататък ще има много спекулации за това, че депутати от ГЕРБ ще подкрепят правителство на Пламен Орешарски. "Към настоящия момент всички избрани народни представители от ГЕРБ ще бъдат единни", допълни обаче той.</p>
<p>"Важно е да видим тази тройна коалиция, която ще бъде прикрита пред едно програмно правителство. Много е важно за българските граждани да чуят лъжите, които се казаха по време на предизборната кампания от трите други партии в новия парламент – БСП, ДПС и "Атака", каза още зам.-лидерът на ГЕРБ, като предрече, че след няколко месеца ще има нови предсрочни избори.</p>
<p>По думите му БСП, ДПС и "Атака" имат много различия, като например БСП подкрепя изграждането на АЕЦ "Белене", а според ДПС това не може да се случи. ДПС пък настоява за плосък данък, "Атака" иска сега и веднага 1000 лева заплата, а 500 лева пенсия.</p>
<p>От друга страна, ГЕРБ е показала какво е свършила през своя мандат и постави основните акценти на партията като настояване за плосък данък от 10%, бързо възстановяване на ДДС, законодателни промени в ДКЕВР.</p>
<p>Цветан Цветанов заяви, че е напълно закономерно ГЕРБ да предлага и касиране на изборите и да подготвя кабинет. По думите му конституирането на 42-ото народно събрание ще е факт, докато се произнесе Конституционният съд. Партията вече е решила да предложи Цецка Цачева отново за председател на парламента, защото "тя се е доказала като знаеща и можеща".</p>
<p>Когато сме получили подкрепата на над един милион български граждани, най-естественото е да предложим правителство на обществото, защото е важно в период на финансова и политическа криза да поемеш отговорността, подчерта бившият вицепремиер.</p>
<p>Самият той няма да бъде включен в този проектокабинет, като самият той го е поискал категорично пред лидера Бойко Борисов. Цветанов подчерта, че не иска на всяка цена да е на власт. Той отново заяви, че е невинен и сега трябва да изчисти името си от обвиненията и спекулациите, изречени в условията на предизборна кампания</p>
<p>За бившия си колега Симеон Дянков, към когото бяха отправени сериозни критики за финансовото и икономическо състояние на държавата и социалното недоволство на хората преди кабинетът "Борисов" да подаде оставка, Цветанов каза, че никога не е имало остра реакция. "Критиката беше, че на моменти не е имало гъвкавост", обясни той. Отново за бюлетините в Костинброд</p>
<p>Това са толкова несериозни неща, че не знам как е възможно да ги коментираме, каза Цветанов на въпрос за бюлетините в печатницата "Мултипринт" в Костинброд, които бяха открити в деня за размисъл. "Как можеш да излезеш да твърдиш колко са бюлетините, като никой не ги е броил, а и никой не е доказал, че те са годни за експедиция", подчерта той. По думите му се вижда от снимките на прокуратурата, че, например, има бюлетини за външното министерство, а тези бюлетини не могат да стигнат за толкова кратко време до секциите в чужбина.</p>
<p>Зам.-председателят на ГЕРБ каза, че към настоящия момент не може да вини службите за изтеклата информация, тъй като кабинетът "Борисов" и предишният парламент са въвели контрол за специалните разузнавателни средства и службите, така че да няма течове. Все пак той заяви, че не обвинява и прокуратурата за това. "Вярвам в главния прокурор за проверката в Костинброд", каза Цветанов и призова да се провери кой е подал сигнала до ТВ7, така че медията да знае за извършваната акция.</p>
<p>"И трябва да ви кажа, че когато бъдат оповестени тези данни за тази проверка, защото аз съм убеден, че българската прокуратура в лицето на главния прокурор ще направи тази проверка, ако не е направена вече, за да се оповести кой е източникът, който подаде тази информация към тази медия. Защото всички знаем тази медия с кого се свързва, как се свързва и какво изпълнява в цялата тази предизборна кампания", заяви бившият вицепремиер.</p>
<p>Не се е стигало до такова брутално нарушаване на свещеното право на всеки гражданин да има спокойствие в предизборния ден, заяви Цветан Цветанов и допълни, че изтичането на информация за акцията в печатницата в Костинброд е най-голямото престъпление в предизборната кампания. Според него за следващите избори могат да бъдат заложени наказателни промени за много тежки наказания срещу всяка медия, която ги наруши, и да се стига дори до отнемане на лицензи.</p>
<p>Трябва да бъдем истинска правова държава – ако ние самите не направим така, че всичко да бъде законно, това означава, че денят за размисъл и занапред ще бъде много агресивен, подчерта бившият министър на вътрешните работи. Никой не може да твърди какви доказателства ще бъдат събрани само няколко часа след започването на акцията, допълни той за акцията в Костинброд. </p>



 </div></div>
 => nil 

Specs:
ruby-readability (0.5.7)
rails (4.0.0.rc1)
ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin12.3.0]

NoMethodError: images method missing

Tying to get this working in a rails console with Ruby 1.9.3 and keep running into this error when I run:

source = open('http://www.economist.com/blogs/schumpeter/2012/02/greek-exit').read
Readability::Document.new(source).images

NoMethodError: undefined method `images' for Readability::Document

installed gems:
ruby-readability (0.5.0)
mini_magick (3.4)

When I run the readability specs w/ the same gemset, everything passes fine.

Failing tests on Ruby 3.0 (invalid byte sequence in UTF-8)

Failures:

  1) Readability images should show one image, but outside of the best candidate
     Failure/Error: @input = @input.gsub(REGEXES[:replaceBrsRe], '</p><p>').gsub(REGEXES[:replaceFontsRe], '<\1span>')
     
     ArgumentError:
       invalid byte sequence in UTF-8
     # ./lib/readability.rb:48:in `gsub'
     # ./lib/readability.rb:48:in `initialize'
     # ./spec/readability_spec.rb:79:in `new'
     # ./spec/readability_spec.rb:79:in `block (3 levels) in <top (required)>'

  2) Readability the cant_read.html fixture should work on the cant_read.html fixture with some allowed tags
     Failure/Error: @input = @input.gsub(REGEXES[:replaceBrsRe], '</p><p>').gsub(REGEXES[:replaceFontsRe], '<\1span>')
     
     ArgumentError:
       invalid byte sequence in UTF-8
     # ./lib/readability.rb:48:in `gsub'
     # ./lib/readability.rb:48:in `initialize'
     # ./spec/readability_spec.rb:386:in `new'
     # ./spec/readability_spec.rb:386:in `block (3 levels) in <top (required)>'

Finished in 0.87143 seconds (files took 0.28091 seconds to load)
49 examples, 2 failures

CI build failures

One of the failures is about do_not_allow not being accessible in an RSpec ExampleGroup. Is that outside an "it"? Well, it fails.

Perhaps, as an update, the matrix of Ruby versions to run in CI could be updated, too!

Problem parsing content from medium.com

I've found that ruby-readability seems to have problems with a couple of blog posts on Medium seem (it may be more, I've only tested two).

https://medium.com/our-addictions/ae81e19b0289 & https://medium.com/on-product-management/926ab5c39156.

ruby-readability seems to see the various paragraphs (divided by <hr class="section-divider"> as separate sections and then it picks the one with the highest score.

readability --debug https://medium.com/our-addictions/ae81e19b0289
Top 5 candidates:
 Candidate div#.section-inner layout-single-column with score 45.0
 Candidate div#.section-inner layout-single-column with score 42.0
 Candidate div#.section-inner layout-single-column with score 42.0
 Candidate div#.section-inner layout-single-column with score 40.0
 Candidate div#.section-inner layout-single-column with score 37.0
 Best candidate div#.section-inner layout-single-column with score 45.0

It then just shows the text from what looks like the longest paragraph.

Is there likely to be an easy fix for things like this - or some way of working around it for specific sites?
Instapaper used to have a custom parser with user-contributed rules (although it seems to have gone away since it was sold). Have there been any thoughts about doing that sort of thing for ruby-readability at all?

Thanks, Darren.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.