cantino / ruby-readability Goto Github PK
View Code? Open in Web Editor NEWPort of arc90's readability project to Ruby
License: Apache License 2.0
Port of arc90's readability project to Ruby
License: Apache License 2.0
Hello,
I tried to apply readability on a specific layout of The Guardian, which heavily relies on JavaScript but still has most of the text available in the HTML source code:
Readability returned this chunk of HTML:
<div><div> comments <p>Sign in or create your Guardian account to join the discussion. </p> <p>This discussion is closed for comments.</p> <p> We’re doing some maintenance right now. You can still read comments, but please come back later to add your own. </p> <p> Commenting has been disabled for this account (why?) </p> </div></div>
Do you know guys why the main content is not properly extracted, and if it fixable?
When a re-encoded string is processed by Readability both the newly created object as well as the object in processing get assigned the wrong encoding.
Consider the following test case where an ISO-8859-7 (Greek) string is re-encoded to UTF-8
kbody = open("http://www.kathimerini.gr").read.encode("utf-8", undef: :replace, invalid: :replace, replace: " " )
checking kbody with
kbody.encoding
=> #<Encoding:UTF-8>
but when it is parsed as a UTF-8 from Readability, both the old object as well as the new one have the wrong encoding assigned
rbody = Readability::Document.new(kbody).content
rbody.encoding
=> #<Encoding:ISO-8859-7>
kbody.encoding
=> #<Encoding:ISO-8859-7>
This seems like a dual bug.
Although one can argue that the re-encoding could go wrong I see no reason why the original string should get its encoding changed. I would expect that it should be treated as an immutable or at least it should be documented that the input string encoding may change. Perhaps a dup of the string should be manipulated instead of operating on the original input content.
When the content
method is used its sanitize
alters the state of the this object so that subsequent calls return different results. This is illustrated by using images
. If the intention of content
is to have side-effects to the state of the object it should perhaps be named content!
instead.
e.g.,
body = open("http://www.pagonis.org").read.encode("utf-8", undef: :replace, invalid: :replace, replace: " " )
rbody = Readability::Document.new(body, :tags => %w[div p img a], :attributes => %w[src href], :remove_empty_nodes => false, :do_not_guess_encoding => true)
rbody.images
=> ["images/pagonis.jpg"]
rbody.content
rbody.images
=> []
It looks like this happens because sanitize
operates on the article nodes directly and not on a copy of the article in processing.
I find the reason why ruby-readability can't works well on Chinese website.
Nokogiri can't parse correctly in not UTF-8 encodings, because libxml2 have some bug
See Nokogiri issue: sparklemotion/nokogiri#215
If we put this page ( http://www.nownews.com/2012/09/12/339-2853197.htm ) in ruby-readability.
We will get nothing!
But, if we change encoding to UTF-8 before put in ruby-readability, the output will be good.
So, I am thinking is that better to change encoding to UTF-8 in this gem's work ?
Hi,
I'm fetching HTML pages like the example:
source = open(url).read
puts Readability::Document.new(source, tags: %w[div p img a], attributes: %w[src href]).content
But every page I try I get a validation similar to:
element p: validity error : ID xxxxx already defined
Any ideas?
Thanks
The JS Readability seems to work fine with this kind of code, whereas the Ruby version doesn't.
The following snippet was lifted from AppleInsider
<div class="article-img">
<img src="http://photos.appleinsider.com/v9/images/1x1-white.jpg" width="660" height="317" alt="iPhone Plus" class="lazy" data-original="http://cdn1.appleinsider.com/iphoneplus-130205.jpg"><noscript><img src="http://cdn1.appleinsider.com/iphoneplus-130205.jpg"></noscript>
</div>
To me it seems that Ruby Readability should detect the noscript
images and use them in its heuristics (modulo issue #51). Currently it doesn't
👋 Hi @cantino @olleolleolle
I realise this an old gem, but I see you've made commits or merged PRs within the past year or so.
The last Rubygems release was in 2014 – do you think it would be feasible to make a new release any time soon?
Thanks
It seems that pages that contain UTF-8 characters still cannot be processed.
For example, using /bin/readability on a popular french website:
readability http://www.developpez.com/actu/35379/Novell-cede-Mono-a-Xamarin-une-mise-a-jour-de-la-plateforme-est-annoncee-pour-l-automne/
It crashes line 216:
lib/readability.rb:216:in =~': invalid byte sequence in UTF-8 (ArgumentError) from /Users/kimious/.rvm/gems/ruby-1.9.2-p180/gems/ruby-readability-0.2.3/lib/readability.rb:216:in
!~'
I can reproduce the bug for a lot of french webpages.
First, thanks for your work on readability :-)
Just a quick feedback (I'm not a heavy user myself): while upgrading an old setup today, I noticed that a raw content is now wrapped into two levels of divs:
1.9.3-p484 :003 > Readability::Document.new("My content").content
=> "<div><div><p>My content</p></div></div>"
while previously (2-year old version) was returned as:
=> "<div><p>My content</p></div>"
Is it expected? I understand that this specific test-case is a bit unrelatistic (not tags at all), but wondered if there could be other similar issues with properly formatted html.
This url: http://www.businessinsider.com/anonymous-facebook-2011-8
Causes this error:
undefined method name' for nil:NilClass /home/howard/.rvm/gems/ruby-1.9.2-p290@global/gems/ruby-readability-0.2.3/lib/readability.rb:115:in
select_best_candidate'
/home/howard/.rvm/gems/ruby-1.9.2-p290@global/gems/ruby-readability-0.2.3/lib/readability.rb:51:in `content'
lost img tag in div if no text
example: http://railscasts.com/episodes/417-foundation?view=asciicast
I'm having an odd issue where I have some raw HTML from http://missgeeky.com/2011/12/13/christmas-giveaway-threadless-vouchers/ that I'm trying to extract the content from including images. I tried running
Readability::Document.new(source, :tags => %w[img], :attributes => %w[src], :remove_empty_nodes => false).content
But with no luck.
I'm not sure if it's because the img
tag autocloses, or because the img
is wrapped in "empty" p
and a
tags.
The documentation on retrieving images should perhaps state all the steps to achieve the retrieval of image links
e.g.,
You can get a list of images in the content area with .images
. This feature requires that the mini_magick
gem be installed.
rbody = Readability::Document.new(body, :tags => %w[div p img a], :attributes => %w[src href], :remove_empty_nodes => false)
rbody.images
[ERROR] Readability HTMLClean failed: undefined method name' for nil:NilClass /usr/local/lib/ruby/gems/1.9.1/gems/activesupport-3.0.5/lib/active_support/whiny_nil.rb:48:in
method_missing'
/usr/local/lib/ruby/gems/1.9.1/gems/ruby-readability-0.2.3/lib/readability.rb:115:in select_best_candidate' /usr/local/lib/ruby/gems/1.9.1/gems/ruby-readability-0.2.3/lib/readability.rb:51:in
content'
this happens in this example on readability.content : it happens on about 5% of the HTML we use which are randomly selected from the web ......
this happens on some debug output that is resolved before the entrance to debug method and our options[:debug] is nil
html = html.encode("UTF-8", :invalid=>:replace, :undef=>:replace, :replace=>" ")
begin
readability = Readability::Document.new(html)
return nil unless readability && readability.html
content = readability.content
content = content.gsub(/<.*?>/,' ') if content # Remove any remaining html < > tags
rescue Exception => ex
STDERR.puts"[ERROR] Readability HTMLClean failed: #{ex}" if @verbose
ex.backtrace.each{|line| STDERR.puts(line) }
end
I'm using readability with various HTML responses returned from Faraday.
I've just discovered the awesomeness of the 'guess_html_encoding' gem and am trying to get it to work with readability.
This code snippet works fine:
require 'readability'
require 'guess_html_encoding'
require 'faraday'
url = "http://inews.mingpao.com/htm/INews/20130319/gb21451k.htm"
response = Faraday.get(url)
headers = response.headers
html = response.body
guess = GuessHtmlEncoding.guess(html, headers)
puts "Encoding is #{guess}"
doc = Readability::Document.new(html, :encoding => guess, :remove_empty_nodes => true)
However, shortcutting by passing the headers directly to readability fails:
require 'readability'
require 'faraday'
url = "http://inews.mingpao.com/htm/INews/20130319/gb21451k.htm"
response = Faraday.get(url)
headers = response.headers
html = response.body
doc = Readability::Document.new(html, :html_headers => headers, :remove_empty_nodes => true)
I see this error: NoMethodError: undefined method 'gsub' for #<Faraday::Utils::Headers:0x007fb2fc155de0>
Faraday::Utils::Headers subclasses Hash by the way.
It looks like readability is assuming the headers are a string - the README is unclear on what type it should be but as guess_html_encoding takes a hash then I think readability should too.
Thanks, Darren.
Hello,
I tried readability on this page: https://twitter.com/jamescridland/status/555108097803694080 – which returned this HTML content:
<div><div> <p> Reply </p> <p> Retweet Retweeted </p> <p> Favourite 1 Favourited 1 </p> </div></div>
Tweets are tricky, as they might not be long enough to be picked up – is the markup good enough to refine the main content though?
Thanks :-)
I have been experimenting with the gem to retrieve content from Wikipedia pages, but it seems that the H1 tags get lost during the process of text extraction:
source = open('http://en.wikipedia.org/wiki/Frimley_Green_Windmill').read
puts Readability::Document.new(source, tags: ['h1', 'p', 'div']).content
Output:
<div><div>
<p>Frimley Green Windmill is a Grade II listed[1]tower mill at Frimley Green, Surrey, England which has been converted to residential use.</p>
[edit] History
<p>Frimley Green Windmill was first mentioned in 1784 in the ownership of a Mr Terry. It passed to Thomas Lilley in 1792 and then William Collins in 1801. In 1803, the mill passed into the ownership of the Royal Military College, Sandhurst, remaining in the hands of the military until at least 1832 and probably much later than that. The mill was disused by 1870, and the derelict shell was converted to residential use in 1914. [2]</p>
[edit] Description
<div>For an explanation of the various pieces of machinery, see Mill machinery.</div>
<p>Frimley Green Windmill is a four storey brick tower mill. Little is known of the mill, although it had at least one pair of Spring or Patent sails.[2]</p>
[edit] Millers
George Marshall 1792
John Banks 1801
<p>Reference for above:-[2]</p>
[edit] External links
[edit] References
</div></div>
This is missing the only h1 tag on the page,
<h1 id="firstHeading" class="firstHeading">Frimley Green Windmill</h1>
I have experienced the same quirk with all Wikipedia pages. Any idea what could be causing this?
It might be nice if there was a method to retrieve the document title.
Of course this could be easily implemented externally:
doc = Readability::Document.new(html)
title = doc.html.css("title").first
title = title ? title.text : nil
... but it seems like a fairly common thing to do, so perhaps it's worth including?
Hi,
I just got this segfault using the provided example scraping this page http://git.or.cz/course/svn.html#merge. I had success with another page. This is running on Lion.
I was using the version you get when you run gem install ruby-readability
. I will try it next with --pre
I just tried it with --pre
and it also segfaulted.
irb(main):011:0> puts Readability::Document.new(source).content
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:136: [BUG] Segmentation fault
ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0]
-- Control frame information -----------------------------------------------
c:0030 p:---- s:0119 b:0119 l:000118 d:000118 CFUNC :content
c:0029 p:0044 s:0116 b:0116 l:000115 d:000115 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:136
c:0028 p:0029 s:0110 b:0105 l:001db8 d:000104 BLOCK /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:164
c:0027 p:---- s:0101 b:0101 l:000100 d:000100 FINISH
c:0026 p:---- s:0099 b:0099 l:000098 d:000098 CFUNC :each
c:0025 p:0038 s:0096 b:0096 l:001db8 d:001db8 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:163
c:0024 p:0088 s:0091 b:0091 l:000090 d:000090 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:64
c:0023 p:0027 s:0083 b:0082 l:000f58 d:000081 EVAL (irb):11
c:0022 p:---- s:0080 b:0080 l:000079 d:000079 FINISH
c:0021 p:---- s:0078 b:0078 l:000077 d:000077 CFUNC :eval
c:0020 p:0028 s:0071 b:0071 l:000070 d:000070 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/workspace.rb:80
c:0019 p:0033 s:0064 b:0063 l:000062 d:000062 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/context.rb:254
c:0018 p:0031 s:0058 b:0058 l:0018d8 d:000057 BLOCK /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:159
c:0017 p:0042 s:0050 b:0050 l:000049 d:000049 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:273
c:0016 p:0011 s:0045 b:0045 l:0018d8 d:000044 BLOCK /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:156
c:0015 p:0144 s:0041 b:0041 l:000024 d:000040 BLOCK /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:243
c:0014 p:---- s:0038 b:0038 l:000037 d:000037 FINISH
c:0013 p:---- s:0036 b:0036 l:000035 d:000035 CFUNC :loop
c:0012 p:0009 s:0033 b:0033 l:000024 d:000032 BLOCK /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:229
c:0011 p:---- s:0031 b:0031 l:000030 d:000030 FINISH
c:0010 p:---- s:0029 b:0029 l:000028 d:000028 CFUNC :catch
c:0009 p:0023 s:0025 b:0025 l:000024 d:000024 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:228
c:0008 p:0046 s:0022 b:0022 l:0018d8 d:0018d8 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:155
c:0007 p:0011 s:0019 b:0019 l:0018b8 d:000018 BLOCK /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:70
c:0006 p:---- s:0017 b:0017 l:000016 d:000016 FINISH
c:0005 p:---- s:0015 b:0015 l:000014 d:000014 CFUNC :catch
c:0004 p:0183 s:0011 b:0011 l:0018b8 d:0018b8 METHOD /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:69
c:0003 p:0039 s:0006 b:0006 l:001b68 d:002078 EVAL /Users/aaron/.rbenv/versions/1.9.3-p0/bin/irb:12
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:001b68 d:001b68 TOP
-- Ruby level backtrace information ----------------------------------------
/Users/aaron/.rbenv/versions/1.9.3-p0/bin/irb:12:in `<main>'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:69:in `start'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:69:in `catch'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:70:in `block in start'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:155:in `eval_input'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:228:in `each_top_level_statement'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:228:in `catch'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:229:in `block in each_top_level_statement'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:229:in `loop'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb:243:in `block (2 levels) in each_top_level_statement'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:156:in `block in eval_input'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:273:in `signal_status'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb:159:in `block (2 levels) in eval_input'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/context.rb:254:in `evaluate'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/workspace.rb:80:in `evaluate'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/workspace.rb:80:in `eval'
(irb):11:in `irb_binding'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:64:in `content'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:163:in `score_paragraphs'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:163:in `each'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:164:in `block in score_paragraphs'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:136:in `get_link_density'
/Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb:136:in `content'
-- C level backtrace information -------------------------------------------
See Crash Report log file under ~/Library/Logs/CrashReporter or
/Library/Logs/CrashReporter, for the more detail of.
-- Other runtime information -----------------------------------------------
* Loaded script: irb
* Loaded features:
0 enumerator.so
1 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/enc/encdb.bundle
2 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/enc/trans/transdb.bundle
3 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/defaults.rb
4 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/rbconfig.rb
5 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/deprecate.rb
6 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/exceptions.rb
7 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/custom_require.rb
8 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems.rb
9 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/e2mmap.rb
10 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/init.rb
11 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/workspace.rb
12 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/inspector.rb
13 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/context.rb
14 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/extend-command.rb
15 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/output-method.rb
16 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/notifier.rb
17 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/slex.rb
18 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-token.rb
19 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/ruby-lex.rb
20 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/src_encoding.rb
21 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/magic-file.rb
22 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/readline.bundle
23 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/input-method.rb
24 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb/locale.rb
25 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/irb.rb
26 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/version.rb
27 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/requirement.rb
28 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/platform.rb
29 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/specification.rb
30 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/path_support.rb
31 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/rubygems/dependency.rb
32 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/nokogiri.bundle
33 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/version.rb
34 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/syntax_error.rb
35 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/pp/node.rb
36 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/pp/character_data.rb
37 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/pp.rb
38 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/parse_options.rb
39 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax/document.rb
40 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax/parser_context.rb
41 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax/parser.rb
42 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax/push_parser.rb
43 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/sax.rb
44 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/stringio.bundle
45 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/node/save_options.rb
46 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/node.rb
47 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/attribute_decl.rb
48 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/element_decl.rb
49 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/element_content.rb
50 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/character_data.rb
51 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/namespace.rb
52 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/attr.rb
53 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/dtd.rb
54 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/cdata.rb
55 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/text.rb
56 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/document.rb
57 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/document_fragment.rb
58 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/processing_instruction.rb
59 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/node_set.rb
60 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/syntax_error.rb
61 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/xpath/syntax_error.rb
62 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/xpath.rb
63 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/xpath_context.rb
64 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/builder.rb
65 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/reader.rb
66 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/notation.rb
67 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/entity_decl.rb
68 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/schema.rb
69 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml/relax_ng.rb
70 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xml.rb
71 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xslt/stylesheet.rb
72 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/xslt.rb
73 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/entity_lookup.rb
74 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/document.rb
75 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/document_fragment.rb
76 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/sax/parser_context.rb
77 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/sax/parser.rb
78 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/element_description.rb
79 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/element_description_defaults.rb
80 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html.rb
81 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/decorators/slop.rb
82 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/node.rb
83 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/xpath_visitor.rb
84 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/racc/cparse.bundle
85 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/racc/parser.rb
86 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/thread.rb
87 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/parser_extras.rb
88 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/parser.rb
89 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/strscan.bundle
90 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/tokenizer.rb
91 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css/syntax_error.rb
92 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/css.rb
93 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri/html/builder.rb
94 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/nokogiri-1.5.0/lib/nokogiri.rb
95 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/guess_html_encoding-0.0.2/lib/guess_html_encoding/version.rb
96 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/guess_html_encoding-0.0.2/lib/guess_html_encoding.rb
97 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/readability.rb
98 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/gems/1.9.1/gems/ruby-readability-0.5.0/lib/ruby-readability.rb
99 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/common.rb
100 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/generic.rb
101 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/ftp.rb
102 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/http.rb
103 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/https.rb
104 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/ldap.rb
105 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/ldaps.rb
106 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri/mailto.rb
107 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/uri.rb
108 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/date_core.bundle
109 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/date/format.rb
110 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/date.rb
111 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/time.rb
112 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/open-uri.rb
113 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/socket.bundle
114 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/socket.rb
115 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/timeout.rb
116 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb
117 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/zlib.bundle
118 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/net/http.rb
119 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/enc/trans/single_byte.bundle
120 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/delegate.rb
121 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/etc.bundle
122 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/fileutils.rb
123 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/tmpdir.rb
124 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/tempfile.rb
125 /Users/aaron/.rbenv/versions/1.9.3-p0/lib/ruby/1.9.1/x86_64-darwin11.2.0/enc/iso_8859_1.bundle
[NOTE]
You may have encountered a bug in the Ruby interpreter or extension libraries.
Bug reports are welcome.
For details: http://www.ruby-lang.org/bugreport.html
[1] 29286 abort irb
How can I extract the date on which the page being retrieved was 'created' and 'updated'?
I tried using the method 'date_published' which is the JSON element that is exposed by the Readability Parser API, but that did not of course work.
I am not exactly sure if there is already a way to do it, but if there isn't, it would be great if we can have a method that does this. However, if there is, this is not exactly an Issue.
What version of readability is this port based on? Is it the last released version (0.5.1) or is it the last trunk version (1.7.1) or is it something else?
Readability pulls its article title from the title
tag right? Well more often than not, the title
tag has a whole lot of other information besides just the title of the article. It usually includes the title of the site itself and sometimes a category.
I know the original readability script just grabbed the title, but I'm wondering if this version of the script can be modified to grab the actual title of the article from the markup. It seems as though the scoring system is set up to exclude the header tag that contains the article title.
Example:
<article>
<div class="article-title">
<h1>Article title</h1>
</div>
<div class="article-content">
<p>
Claritatem insitam; est usus legentis in iis qui facit eorum claritatem.
Investigationes demonstraverunt lectores legere me lius quod ii legunt
saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem
consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc
putamus parum claram, anteposuerit litterarum formas humanitatis per seacula
quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur
parum clari, fiant sollemnes in futurum.
</p>
<p>
Nunc varius risus quis nulla. Vivamus vel magna. Ut rutrum. Aenean
dignissim, leo quis faucibus semper, massa est faucibus massa, sit amet
pharetra arcu nunc et sem. Aliquam tempor. Nam lobortis sem non urna.
Pellentesque et urna sit amet leo accumsan volutpat. Nam molestie lobortis
lorem. Quisque eu nulla. Donec id orci in ligula dapibus egestas. Donec sed
velit ac lectus mattis sagittis.
</p>
</div>
</article>
In the above example, readability will always grab the content from .article-content
and not the <article>
tag itself. What can I do to modify the script to grab the whole article, title and all?
Sorry @iterationlabs this is not my week it seems :-/
I just observed the following while retrieving a Lifehacker page:
require 'open-uri'
require 'readability'
page = open("http://lifehacker.com/5866449/lifehacker-faceoff-the-best-digital-digests-on-ipad-and-iphone").read
rbody = Readability::Document.new(page).content
as well as when I do
rbody = Readability::Document.new(page, :do_not_guess_encoding => true).content
The result is the following exception due to a nil element.
#<NoMethodError: undefined method `name' for nil:NilClass>
NoMethodError: undefined method `name' for nil:NilClass
from /home/user/.rvm/gems/ruby-1.9.2-p290@classic/gems/ruby-readability-0.5.2/lib/readability.rb:194:in `select_best_candidate'
from /home/user/.rvm/gems/ruby-1.9.2-p290@classic/gems/ruby-readability-0.5.2/lib/readability.rb:44:in `prepare_candidates'
from /home/user/.rvm/gems/ruby-1.9.2-p290@classic/gems/ruby-readability-0.5.2/lib/readability.rb:130:in `content'
Any ideas if this is a bug or am I doing something wrong?
I think I've identified that this is what is causing some images to come in and others not. When the image width and height is not inline in the source content, the images gets left behind. I think it's coming from this block:
elements.each do |element|
next unless element["src"]
url = element["src"].value
height = element["height"].nil? ? 0 : element["height"].value.to_i
width = element["width"].nil? ? 50 : element["width"].value.to_i
if url =~ /\Ahttps?:\/\//i && (height.zero? || width.zero?)
image = get_image_size(url)
next unless image
else
image = {:width => width, :height => height}
end
image[:format] = File.extname(url).gsub(".", "")
if tested_images.include?(url)
debug("Image was tested: #{url}")
next
end
tested_images.push(url)
if image_meets_criteria?(image)
list_images << url
else
debug("Image discarded: #{url} - height: #{image[:height]} - width: #{image[:width]} - format: #{image[:format]}")
end
end
(list_images.empty? and content != @html) ? images(@html, true) : list_images
end
I wanna get the content of web page 'http://blog.csdn.net/luoshengyang/article/details/17131835', but '403 forbidden' is returned
1.9.3p374 :010 > source = open('http://blog.csdn.net/luoshengyang/article/details/17131835').read
OpenURI::HTTPError: 403 Forbidden
false positive, sorry
Hi;
I have tried this gem with this link; "http://gigaom.com/2011/03/16/are-apis-the-new-black/".
At the beginning of page there is an image. When I use the firefox readibility plugin, it shows the image. but from the rails project with this gem, it is not showing the image. Also it is cleaning the links. I don't know if there is an option.
Thanks for your work on this neat gem.
Running readability on the HTML from https://100wordstory.org/submit/, I expected more markup to remain than readability leaves intact.
In the screenshot above, the following content is stripped out:
<h1 class="titles">
<a href="https://100wordstory.org/submit/" rel="bookmark" title="SubmitPermanent Link to ">Submit</a>
</h1>
<h2 style="text-align: center;"><a href="https://100wordstory.submittable.com/submit">Submissions are now open through January 9, 2024!</a></h2>
<h2 style="text-align: center;"><a href="https://100wordstory.submittable.com/submit">Submit!</a></h2>
Turning on debug: true
doesn't seem to cite why these items are missing:
% readability -d https://100wordstory.org/submit/
/Users/avk/.rvm/gems/ruby-2.7.8@wbm/gems/ruby-readability-0.7.0/bin/readability:31: warning: calling URI.open via Kernel#open is deprecated, call URI.open directly or use URI#open
Removing unlikely candidate - magnific_popup-css
Removing unlikely candidate - nav superfishmenu-100-word-story-menu
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-73menu-item-73
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page current-menu-item page_item page-item-6 current_page_item menu-item-72menu-item-72
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-83menu-item-83
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-189menu-item-189
Removing unlikely candidate - menu-item menu-item-type-post_type menu-item-object-page menu-item-70menu-item-70
Removing unlikely candidate - header
Removing unlikely candidate - comments
Removing unlikely candidate - commentlist clearfix
Removing unlikely candidate - comment even thread-even depth-1 parentcomment-65
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment byuser comment-author-100words bypostauthor odd alt depth-2comment-66
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment byuser comment-author-100words bypostauthor even thread-odd thread-alt depth-1comment-57
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment odd alt thread-even depth-1comment-56
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - comment even thread-odd thread-alt depth-1comment-52
Removing unlikely candidate - comment-author vcard
Removing unlikely candidate - comment-meta commentmetadata
Removing unlikely candidate - sidebar-wrapper
Removing unlikely candidate - sidebar
Removing unlikely candidate - sidebar-box widget_blockblock-3
Removing unlikely candidate - widget_text sidebar-box widget_custom_htmlcustom_html-2
Removing unlikely candidate - sidebar-box widget_texttext-3
Removing unlikely candidate - sidebar-box widget_texttext-4
Removing unlikely candidate - sidebar-box widget_texttext-7
Removing unlikely candidate - sidebar-box widget_linkslinkcat-10
Removing unlikely candidate - footer
Altering div(#pages.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Altering div(#.) to p
Top 5 candidates:
Candidate div#.post-wrapper with score 51.935052531041066
Candidate div#left-div. with score 16.71186440677966
Best candidate div#.post-wrapper with score 51.935052531041066
Conditionally cleaned div#.addtoany_share_save_container addtoany_content addtoany_content_bottom with weight 25 and content score 0 because it has too short a content length without a single image.
Conditionally cleaned div#.a2a_kit a2a_kit_size_24 addtoany_list with weight 0 and content score 0 because it has too short a content length without a single image.
Conditionally cleaned div#.recentposts with weight 25 and content score 0 because it has too short a content length without a single image.
<div><div>
<p>100 words for your story … no more or no less. Tell a story, pen a slice of your memoir, or try your hand at an essay.</p>
<p>You get 100 words—exactly 100 words—which is both the pain and the pleasure here. It’s short, you tell yourself. You could write 100 words at a bus stop, on your lunch break, in your sleep. But with 100 words you must tell the whole story in its entirety, so it holds together like a perfect little doll house. (Your title is not part of the 100 words.)</p>
<p>Please include a short bio (25 words, max!) with your submission. Also, did we say exactly 100 words? We weren’t kidding! We count words according to Microsoft Word’s word-count tally. Also, make friends with your spell-check, or have a friend proofread your story.</p>
<p>We currently charge a $2 submission fee, the minimum in order to cover the costs of the submission system.</p>
<p> </p>
<p> </p>
</div></div>
Any ideas on how to broaden or include this content?
The multi pages articles are not supported
for example : http://www.extremetech.com/computing/101577-what-happened-to-the-ipad-killers-of-2010
but the javascript implementation supports it
Is it possible to get the HTML of the main content area? I would like to preserve the tags present in the main content area (whitelisted tags, mainly the divs, p tags)
I added readability to Gemfile:
gem readability
and tried to bundle. But bundling failed due to the following reason:
An error occured while installing johnson (2.0.0.pre3), and Bundler cannot continue.
After Googling I have learned it is because johnson
doesn't work with Ruby 1.9. But since readability
requires johnson
to work in Rails, the problem makes me impossible to use readability
in Rails projects.
Am I wrong or is there anyway to fix the issue? Thank you!
When you don't put 'div' as a tag in the initializer like:
require 'rubygems'
require 'readability'
require 'open-uri'
source = open('https://developers.google.com/custom-search/docs/tutorial/creatingcse').read
puts Readability::Document.new(source, tags: []).content
it trows the error:
NullPointerException:
from nokogiri.internals.SaveContextVisitor.isHtmlScript(SaveContextVisitor.java:741)
you can put any tags you want, if you don't add 'div' the error happens
using jruby-9.1.6.0 [ x86_64 ]
Hi,
Any chance of releasing a new version with the commit 8c26e94 that stops crashes when the src element is missing?
main content of the article is extracted using .content. But how can the main content of article be extracted in the same css format?
When using the gem, it will take —
from the original source and turn it into —
. Is there some way of avoiding this? It makes the output display improperly in web browsers.
There is a possible memory leak with Readability. I have a process with 40 threads, which calls a method which runs Readability on HTML documents. The method is rather simple:
def stripTags(source)
content = Readability::Document.new(source).content
content = strip_tags(source).gsub("\n", " ").squeeze(" ").strip //strip_tags is the Rails helper method for stripping tags
return content
end
I saw my memory usage increase gradually, from < 5% all the way to 80% after a day or so. What I did was try to narrow down the cause, so I commented out the Readability logic/calls, and that resolved the issue: no memory leaks. As soon as I put back the Readability call, the memory leak started again.
To temporarily fix this, I simply monitored my process with God, and had it restart if memory usage got too high, but I'm fairly certain there's a memory leak with the Ruby port of Readability.
Ran into a new issue. I am trying it to give me the content but keep a
and h1
tags in place. It works fine for something like <h1>foo</h1>
or <a href='#'>foo</a>
, but the following example just gets lost: <h1><a href='#'>foo</a></h1>
eg.,
...
<div id="page">
<div class="article-img">
<img src="/v9/images/1x1-white.jpg" width="660" height="317" alt="iPhone Plus" class="lazy" data-original="/iphoneplus-130205.jpg"><noscript><img src="/iphoneplus-130205.jpg"></noscript>
</div>
<div class="article-img">
<img src="/plus-130906.jpg">
</div>
...
will only return /v9/images/1x1-white.jpg
when images()
is called.
However, if images_with_fqdn_uris!("http://bla.com")
is called then subsequent calls to images()
will return an array with all image URLs (fully qualified with http;//bla.com).
Now the question is, what the desired behaviour should be?
Image detection works properly when the src is a full qualified url including he host part. But it does not work when the src contains no host. For example the images on this page: http://www.heise.de/developer/artikel/Herausforderung-Brownfield-Teil-7-Veraenderungen-iterativ-angehen-1392406.html
I originally thought this was a Nokogiri issue (which I documented with lots of details here ), but further testing leads me to believe that it's a ruby-readability issue triggered when the anything hits the transform_misused_divs_into_paragraphs! in /lib/readability.rb. This doesn't seem to happen with every url we test, which leads me to believe that something is borking on non-standard encoding or a weird element in the example page below.
Error
# TO REPRODUCE
url = "http://www.eweek.com/c/a/Apple/Look-Out-Enterprise-Mac-OS-X-to-Get-Journaling/"
text = (Faraday.get url).body #this is a successful request
Readability::Document.new(text).content
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:830: [BUG] Segmentation fault
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.2.0]
-- Control frame information -----------------------------------------------
c:0050 p:---- s:0189 b:0189 l:000188 d:000188 CFUNC :native_write_to
c:0049 p:0250 s:0182 b:0182 l:000181 d:000181 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:830
c:0048 p:0183 s:0172 b:0172 l:000171 d:000171 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:752
c:0047 p:0149 s:0163 b:0163 l:000162 d:000162 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:769
c:0046 p:0014 s:0159 b:0159 l:000410 d:000158 BLOCK /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640
c:0045 p:---- s:0156 b:0156 l:000155 d:000155 FINISH
c:0044 p:---- s:0154 b:0154 l:0014c0 d:000153 IFUNC
c:0043 p:0015 s:0152 b:0151 l:000141 d:000150 BLOCK /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239
c:0042 p:---- s:0148 b:0148 l:000147 d:000147 FINISH
c:0041 p:---- s:0146 b:0146 l:000145 d:000145 CFUNC :upto
c:0040 p:0023 s:0142 b:0142 l:000141 d:000141 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238
c:0039 p:---- s:0138 b:0138 l:000137 d:000137 FINISH
c:0038 p:---- s:0136 b:0136 l:0014c0 d:0014c0 CFUNC :map
c:0037 p:0017 s:0133 b:0133 l:000410 d:000410 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640
c:0036 p:0034 s:0129 b:0129 l:000468 d:000128 BLOCK /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:303
c:0035 p:0015 s:0126 b:0126 l:000116 d:000125 BLOCK /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239
c:0034 p:---- s:0123 b:0123 l:000122 d:000122 FINISH
c:0033 p:---- s:0121 b:0121 l:000120 d:000120 CFUNC :upto
c:0032 p:0023 s:0117 b:0117 l:000116 d:000116 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238
c:0031 p:0021 s:0113 b:0113 l:000468 d:000468 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:300
c:0030 p:0046 s:0110 b:0110 l:000ba8 d:000ba8 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:41
c:0029 p:0030 s:0107 b:0107 l:000106 d:000106 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:137
c:0028 p:0026 s:0101 b:0101 l:000c68 d:000100 EVAL (irb):5
c:0027 p:---- s:0099 b:0099 l:000098 d:000098 FINISH
c:0026 p:---- s:0097 b:0097 l:000096 d:000096 CFUNC :eval
c:0025 p:0028 s:0090 b:0090 l:000089 d:000089 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/workspace.rb:80
c:0024 p:0033 s:0083 b:0082 l:000081 d:000081 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/context.rb:254
c:0023 p:0031 s:0077 b:0077 l:000a98 d:000076 BLOCK /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:159
c:0022 p:0042 s:0069 b:0069 l:000068 d:000068 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:273
c:0021 p:0011 s:0064 b:0064 l:000a98 d:000063 BLOCK /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:156
c:0020 p:0144 s:0060 b:0060 l:000043 d:000059 BLOCK /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:243
c:0019 p:---- s:0057 b:0057 l:000056 d:000056 FINISH
c:0018 p:---- s:0055 b:0055 l:000054 d:000054 CFUNC :loop
c:0017 p:0009 s:0052 b:0052 l:000043 d:000051 BLOCK /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:229
c:0016 p:---- s:0050 b:0050 l:000049 d:000049 FINISH
c:0015 p:---- s:0048 b:0048 l:000047 d:000047 CFUNC :catch
c:0014 p:0023 s:0044 b:0044 l:000043 d:000043 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:228
c:0013 p:0046 s:0041 b:0041 l:000a98 d:000a98 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:155
c:0012 p:0011 s:0038 b:0038 l:002018 d:000037 BLOCK /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:70
c:0011 p:---- s:0036 b:0036 l:000035 d:000035 FINISH
c:0010 p:---- s:0034 b:0034 l:000033 d:000033 CFUNC :catch
c:0009 p:0183 s:0030 b:0030 l:002018 d:002018 METHOD /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:69
c:0008 p:0222 s:0025 b:0025 l:002040 d:002040 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands/console.rb:45
c:0007 p:0019 s:0021 b:0021 l:000020 d:000020 METHOD /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands/console.rb:8
c:0006 p:0615 s:0017 b:0017 l:000016 d:000016 TOP /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands.rb:40
c:0005 p:---- s:0012 b:0012 l:000011 d:000011 FINISH
c:0004 p:---- s:0010 b:0010 l:000009 d:000009 CFUNC :require
c:0003 p:0061 s:0006 b:0006 l:001228 d:001458 EVAL script/rails:6
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:0000 s:0002 b:0002 l:001228 d:001228 TOP
-- Ruby level backtrace information ----------------------------------------
script/rails:6:in `<main>'
script/rails:6:in `require'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands.rb:40:in `<top (required)>'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands/console.rb:8:in `start'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/railties-3.1.4/lib/rails/commands/console.rb:45:in `start'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:69:in `start'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:69:in `catch'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:70:in `block in start'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:155:in `eval_input'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:228:in `each_top_level_statement'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:228:in `catch'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:229:in `block in each_top_level_statement'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:229:in `loop'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/ruby-lex.rb:243:in `block (2 levels) in each_top_level_statement'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:156:in `block in eval_input'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:273:in `signal_status'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb.rb:159:in `block (2 levels) in eval_input'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/context.rb:254:in `evaluate'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/workspace.rb:80:in `evaluate'
/Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/irb/workspace.rb:80:in `eval'
(irb):5:in `irb_binding'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:137:in `content'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:41:in `prepare_candidates'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:300:in `transform_misused_divs_into_paragraphs!'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `each'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `upto'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239:in `block in each'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bundler/gems/ruby-readability-574c54e8c02c/lib/readability.rb:303:in `block in transform_misused_divs_into_paragraphs!'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640:in `inner_html'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640:in `map'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `each'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `upto'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239:in `block in each'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:640:in `block in inner_html'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:769:in `to_html'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:752:in `serialize'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:830:in `write_to'
/Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/gems/nokogiri-1.5.5/lib/nokogiri/xml/node.rb:830:in `native_write_to'
Environment
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.2.0]
ruby-readability (0.5.3)
nokogiri (1.5.5, 1.5.2)
RubyGems Environment:
- RUBYGEMS VERSION: 1.8.19
- RUBY VERSION: 1.9.3 (2012-02-16 patchlevel 125) [x86_64-darwin11.2.0]
- INSTALLATION DIRECTORY: /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125
- RUBY EXECUTABLE: /Users/danbarrett/.rvm/rubies/ruby-1.9.3-p125/bin/ruby
- EXECUTABLE DIRECTORY: /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125/bin
- RUBYGEMS PLATFORMS:
- ruby
- x86_64-darwin-11
- GEM PATHS:
- /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125
- /Users/danbarrett/.rvm/gems/ruby-1.9.3-p125@global
- GEM CONFIGURATION:
- :update_sources => true
- :verbose => true
- :benchmark => false
- :backtrace => false
- :bulk_threshold => 1000
- REMOTE SOURCES:
- http://rubygems.org/
Hi,
I stumbled upon an interesting problem.
I think this shouldn't really happen.
source = open("http://www.dnevnik.bg/izbori2013/2013/05/20/2064070_cvetan_cvetanov_predvijda_novi_predsrochni_izbori_sled/").read
Readability::Document.new(source, :tags => %w[p img a strong em], :attributes => %w[src href], :remove_empty_nodes => true).content
RuntimeError: Could not reparent node
Running with DIV tags produces the expected result though.
Readability::Document.new(source, :tags => %w[div p img a strong em], :attributes => %w[src href], :remove_empty_nodes => true).content
In case the source of the page changes I'm giving preview:
puts Readability::Document.new(source).content
<div><div>
Заместник-председателят на ГЕРБ Цветан Цветанов изрази увереност, че след няколко месеца ще има нови предсрочни парламентарни избори. В сутрешния блок на БНТ бившият вицепремиер и вътрешен министър заяви, че е поискал изрично от лидера на партията Бойко Борисов да не бъде включван в проектокабинет, докато не бъде изчистено името му.<p>Цветанов подчерта, че оттук нататък ще има много спекулации за това, че депутати от ГЕРБ ще подкрепят правителство на Пламен Орешарски. "Към настоящия момент всички избрани народни представители от ГЕРБ ще бъдат единни", допълни обаче той.</p>
<p>"Важно е да видим тази тройна коалиция, която ще бъде прикрита пред едно програмно правителство. Много е важно за българските граждани да чуят лъжите, които се казаха по време на предизборната кампания от трите други партии в новия парламент – БСП, ДПС и "Атака", каза още зам.-лидерът на ГЕРБ, като предрече, че след няколко месеца ще има нови предсрочни избори.</p>
<p>По думите му БСП, ДПС и "Атака" имат много различия, като например БСП подкрепя изграждането на АЕЦ "Белене", а според ДПС това не може да се случи. ДПС пък настоява за плосък данък, "Атака" иска сега и веднага 1000 лева заплата, а 500 лева пенсия.</p>
<p>От друга страна, ГЕРБ е показала какво е свършила през своя мандат и постави основните акценти на партията като настояване за плосък данък от 10%, бързо възстановяване на ДДС, законодателни промени в ДКЕВР.</p>
<p>Цветан Цветанов заяви, че е напълно закономерно ГЕРБ да предлага и касиране на изборите и да подготвя кабинет. По думите му конституирането на 42-ото народно събрание ще е факт, докато се произнесе Конституционният съд. Партията вече е решила да предложи Цецка Цачева отново за председател на парламента, защото "тя се е доказала като знаеща и можеща".</p>
<p>Когато сме получили подкрепата на над един милион български граждани, най-естественото е да предложим правителство на обществото, защото е важно в период на финансова и политическа криза да поемеш отговорността, подчерта бившият вицепремиер.</p>
<p>Самият той няма да бъде включен в този проектокабинет, като самият той го е поискал категорично пред лидера Бойко Борисов. Цветанов подчерта, че не иска на всяка цена да е на власт. Той отново заяви, че е невинен и сега трябва да изчисти името си от обвиненията и спекулациите, изречени в условията на предизборна кампания</p>
<p>За бившия си колега Симеон Дянков, към когото бяха отправени сериозни критики за финансовото и икономическо състояние на държавата и социалното недоволство на хората преди кабинетът "Борисов" да подаде оставка, Цветанов каза, че никога не е имало остра реакция. "Критиката беше, че на моменти не е имало гъвкавост", обясни той. Отново за бюлетините в Костинброд</p>
<p>Това са толкова несериозни неща, че не знам как е възможно да ги коментираме, каза Цветанов на въпрос за бюлетините в печатницата "Мултипринт" в Костинброд, които бяха открити в деня за размисъл. "Как можеш да излезеш да твърдиш колко са бюлетините, като никой не ги е броил, а и никой не е доказал, че те са годни за експедиция", подчерта той. По думите му се вижда от снимките на прокуратурата, че, например, има бюлетини за външното министерство, а тези бюлетини не могат да стигнат за толкова кратко време до секциите в чужбина.</p>
<p>Зам.-председателят на ГЕРБ каза, че към настоящия момент не може да вини службите за изтеклата информация, тъй като кабинетът "Борисов" и предишният парламент са въвели контрол за специалните разузнавателни средства и службите, така че да няма течове. Все пак той заяви, че не обвинява и прокуратурата за това. "Вярвам в главния прокурор за проверката в Костинброд", каза Цветанов и призова да се провери кой е подал сигнала до ТВ7, така че медията да знае за извършваната акция.</p>
<p>"И трябва да ви кажа, че когато бъдат оповестени тези данни за тази проверка, защото аз съм убеден, че българската прокуратура в лицето на главния прокурор ще направи тази проверка, ако не е направена вече, за да се оповести кой е източникът, който подаде тази информация към тази медия. Защото всички знаем тази медия с кого се свързва, как се свързва и какво изпълнява в цялата тази предизборна кампания", заяви бившият вицепремиер.</p>
<p>Не се е стигало до такова брутално нарушаване на свещеното право на всеки гражданин да има спокойствие в предизборния ден, заяви Цветан Цветанов и допълни, че изтичането на информация за акцията в печатницата в Костинброд е най-голямото престъпление в предизборната кампания. Според него за следващите избори могат да бъдат заложени наказателни промени за много тежки наказания срещу всяка медия, която ги наруши, и да се стига дори до отнемане на лицензи.</p>
<p>Трябва да бъдем истинска правова държава – ако ние самите не направим така, че всичко да бъде законно, това означава, че денят за размисъл и занапред ще бъде много агресивен, подчерта бившият министър на вътрешните работи. Никой не може да твърди какви доказателства ще бъдат събрани само няколко часа след започването на акцията, допълни той за акцията в Костинброд. </p>
</div></div>
=> nil
Specs:
ruby-readability (0.5.7)
rails (4.0.0.rc1)
ruby 2.0.0p0 (2013-02-24 revision 39474) [x86_64-darwin12.3.0]
I wanted to display the content of the article as it is. That is, with image. Can this be implemented?
Tying to get this working in a rails console with Ruby 1.9.3 and keep running into this error when I run:
source = open('http://www.economist.com/blogs/schumpeter/2012/02/greek-exit').read
Readability::Document.new(source).images
NoMethodError: undefined method `images' for Readability::Document
installed gems:
ruby-readability (0.5.0)
mini_magick (3.4)
When I run the readability specs w/ the same gemset, everything passes fine.
Failures:
1) Readability images should show one image, but outside of the best candidate
Failure/Error: @input = @input.gsub(REGEXES[:replaceBrsRe], '</p><p>').gsub(REGEXES[:replaceFontsRe], '<\1span>')
ArgumentError:
invalid byte sequence in UTF-8
# ./lib/readability.rb:48:in `gsub'
# ./lib/readability.rb:48:in `initialize'
# ./spec/readability_spec.rb:79:in `new'
# ./spec/readability_spec.rb:79:in `block (3 levels) in <top (required)>'
2) Readability the cant_read.html fixture should work on the cant_read.html fixture with some allowed tags
Failure/Error: @input = @input.gsub(REGEXES[:replaceBrsRe], '</p><p>').gsub(REGEXES[:replaceFontsRe], '<\1span>')
ArgumentError:
invalid byte sequence in UTF-8
# ./lib/readability.rb:48:in `gsub'
# ./lib/readability.rb:48:in `initialize'
# ./spec/readability_spec.rb:386:in `new'
# ./spec/readability_spec.rb:386:in `block (3 levels) in <top (required)>'
Finished in 0.87143 seconds (files took 0.28091 seconds to load)
49 examples, 2 failures
One of the failures is about do_not_allow not being accessible in an RSpec ExampleGroup. Is that outside an "it"? Well, it fails.
Perhaps, as an update, the matrix of Ruby versions to run in CI could be updated, too!
In clean_conditionally(node, candidates, selector)
the constant TEXT_LENGTH_THRESHOLD
is used which ever being declared.
Do we know what was the intention for this?
I've found that ruby-readability seems to have problems with a couple of blog posts on Medium seem (it may be more, I've only tested two).
https://medium.com/our-addictions/ae81e19b0289 & https://medium.com/on-product-management/926ab5c39156.
ruby-readability seems to see the various paragraphs (divided by <hr class="section-divider">
as separate sections and then it picks the one with the highest score.
readability --debug https://medium.com/our-addictions/ae81e19b0289
Top 5 candidates:
Candidate div#.section-inner layout-single-column with score 45.0
Candidate div#.section-inner layout-single-column with score 42.0
Candidate div#.section-inner layout-single-column with score 42.0
Candidate div#.section-inner layout-single-column with score 40.0
Candidate div#.section-inner layout-single-column with score 37.0
Best candidate div#.section-inner layout-single-column with score 45.0
It then just shows the text from what looks like the longest paragraph.
Is there likely to be an easy fix for things like this - or some way of working around it for specific sites?
Instapaper used to have a custom parser with user-contributed rules (although it seems to have gone away since it was sold). Have there been any thoughts about doing that sort of thing for ruby-readability at all?
Thanks, Darren.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.