Coder Social home page Coder Social logo

scrubyt's Introduction

scRUBYt! - Hpricot and Mechanize (or FireWatir) on steroids

A simple to learn and use, yet very powerful web extraction framework written in Ruby. Navigate through the Web, Extract, query, transform and save relevant data from the Web page of your interest by the concise and easy to use DSL.

Do you think that Mechanize and Hpricot are powerful libraries? You’re right, they are, indeed - hats off to their authors: without these libs scRUBYt! could not exist now! I have been wondering whether their functionality could be still enhanced further - so I took these two powerful ingredients, threw in a handful of smart heuristics, wrapped them around with a chunky DSL coating and sprinkled the whole stuff with a lots of convention over configuration(tm) goodies

  • and … enter scRUBYt! and decide it yourself.

Wait… why do we need one more web-scraping toolkit?

After all, we have HPricot, and Rubyful-soup, and Mechanize, and scrAPI, and ARIEL and scrapes and … Well, because scRUBYt! is different. It has an entirely different philosophy, underlying techniques, theoretical background, use cases, todo list, real-life scenarios etc. - shortly it should be used in different situations with different requirements than the previosly mentioned ones.

If you need something quick and/or would like to have maximal control over the scraping process, I recommend HPricot. Mechanize shines when it comes to interaction with Web pages. Since scRUBYt! is operating based on XPaths, sometimes you will chose scrAPI because CSS selectors will better suit your needs. The list goes on and on, boiling down to the good old mantra: use the right tool for the right job!

I hope there will be also times when you will want to experiment with Pandora’s box and reach after the power of scRUBYt! :-)

Sounds fine - show me an example!

Let’s apply the “show don’t tell” principle. Okay, here we go:

ebay_data = Scrubyt::Extractor.define do

fetch 'http://www.ebay.com/'
fill_textfield 'satitle', 'ipod'
submit
click_link 'Apple iPod'

record do
  item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
  price '$71.99'
end
next_page 'Next >', :limit => 5

end

output:

<root>

<record>
  <item_name>APPLE IPOD NANO 4GB - PINK - MP3 PLAYER</item_name>
  <price>$149.95</price>
</record>
<record>
  <item_name>APPLE IPOD 30GB BLACK VIDEO/PHOTO/MP3 PLAYER</item_name>
  <price>$172.50</price>
</record>
<record>
  <item_name>NEW APPLE IPOD NANO 4GB PINK MP3 PLAYER</item_name>
  <price>$171.06</price>
</record>
<!-- another 200+ results -->

</root>

This was a relatively beginner-level example (scRUBYt knows a lot more than this and there are much complicated extractors than the above one) - yet it did a lot of things automagically. First of all, it automatically loaded the page of interest (by going to ebay.com, automatically searching for ipods and narrowing down the results by clicking on ‘Apple iPod’), then it extracted all the items that looked like the specified example (which btw described also how the output structure should look like) - on the first 5 result pages. Not so bad for about 10 lines of code, eh?

OK, OK, I believe you, what should I do?

You can find everything you will need at these addresses (or if not, I doubt you will find it elsewhere…). See the next section about installation, and after installing be sure to check out these URLs:

  • <a href=‘www.rubyrailways.com’>rubyrailways.com</a> - for some theory; if you would like to take a sneak peek

at web scraping in general and/or you would like to understand what’s going on under the hood, check out <a href=‘www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails’>this article about web-scraping</a>!

open and closed bugs, files etc.

  • projects.rubyforge.org/scrubyt/files… - fair amount (and still growing with every release) of examples, showcasing

the features of scRUBYt!

  • planned: public extractor repository - hopefully (after people realize how great this package is :-)) scRUBYt! will

have a community, and people will upload their extractors for whatever reason

If you still can’t find something here, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!

How to install

scRUBYt! requires these packages to be installed:

  • Ruby 1.8.4

  • Hpricot 0.5

  • Mechanize 0.6.3

I assume you have ruby any rubygems installed. To install WWW::Mechanize 0.6.3 or higher, just run

sudo gem install mechanize

Hpricot 0.5 is just hot off the frying pan - perfect timing, _why! - install it with

sudo gem install hpricot

Once all the dependencies (Mechanize and Hpricot) are up and running, you can install scrubyt with

sudo gem install scrubyt

If you encounter any problems, drop a mail to the guys at scrubyt@/NO-SPAM/scrubyt.org!

Author

Copyright © 2006 by Peter Szinek (peter@/NO-SPAM/rubyrailways.com)

Copyright

This library is distributed under the GPL. Please see the LICENSE file.

scrubyt's People

Contributors

m3talsmith avatar scrubber avatar sutch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrubyt's Issues

Gem Requirements: firewatir

I had the understanding that firewatir was not a requirement in order to use scrubyt for normal scrapes - that firewatir was used for ajax scrapes. Yet I get this when I try requiring scrubyt in irb:

irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'scrubyt'
LoadError: no such file to load -- firewatir
    from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require'
    from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
    from /usr/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/navigation/agents/firewatir.rb:2
    from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `gem_original_require'
    from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:31:in `require'
    from /usr/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt.rb:29
    from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:36:in `gem_original_require'
    from /usr/local/lib/ruby/site_ruby/1.8/rubygems/custom_require.rb:36:in `require'
    from (irb):2
irb(main):003:0>

I actually ran in to this with the latest master, uninstalled that and installed 0.4.06 and still got the above error. Normally I would go ahead and install the required gem but I spent an hour last night trying to get firewatir installed correctly on OS X 10.5.6 to no avail. I'd rather just side step firewatir for now since the majority of things I'm scraping are actually xhtml docs anyways.

Edit: This is on Mac OS X 10.6 btw.

What are the dependencies

I just installed the latest versions of hpricot and mechanize and am getting all kinds of undefined method errors. What versions of the gems should I be including? Shouldn't this be specified in the gemspec? You should consider tightening that up.

readme.rdoc links

Links markup is incorrect in the section
"OK, OK, I believe you, what should I do?"

Hi.

I got this error while running my rb file.when scrapping the data

generate_XPath_for_example': undefined method `parent' for nil:NilClass (NoMethodError)

plzz help me...

scrubyt on 1.9.1

var/lib/gems/1.9.1/gems/scrubyt-0.4.06/lib/scrubyt.rb:1: warning: variable $KCODE is no longer effective; ignored
no such file to load -- jcode

Feature Request: UITableView Adoption?

This is really cool, I've been hacking with it and I've come to find nothing else like this in terms of working functionality. One thing I'd recommend however is minimizing the animation state as less frequently as possible because they have the potential to drain bits of the device's battery.

Will this support UITableView adoption in the future? If you're able to document the code better, I can help out if you'd like.

skimr: Confusing error message when using an improperly named block

Trying to use the new skimr rewrite, and I tried use one of the old scrubyt examples, not knowing that things aren't the same way anymore. Spent a long time debugging, just to find out that the nil object error was insignificant, and that I was passing a block to a nonexistent function. The traceback and error did not show the name of the function I was calling

The following code generated the following traceback:

@extractor = Scrubyt::Extractor.new do
fetch 'http://theninthcut.com/anime'
record do # Could also be "doesntExist do", for the same results.
year '2010'
item_name 'Okamikakushi'
end
next_page 'next >>'
end

NoMethodError: You have a nil object when you didn't expect it!
The error occurred while evaluating nil.sub
from /home/nilbus/git/theninthcut/vendor/plugins/scrubyt-skimr/lib/scrubyt/results_extraction.rb:292:in clean_xpath' from /home/nilbus/git/theninthcut/vendor/plugins/scrubyt-skimr/lib/scrubyt/results_extraction.rb:122:inevaluate_xpath'
from /home/nilbus/git/theninthcut/vendor/plugins/scrubyt-skimr/lib/scrubyt/results_extraction.rb:160:in extract_detail' from /home/nilbus/git/theninthcut/vendor/plugins/scrubyt-skimr/lib/scrubyt/results_extraction.rb:159:inmap'
from /home/nilbus/git/theninthcut/vendor/plugins/scrubyt-skimr/lib/scrubyt/results_extraction.rb:159:in extract_detail' from /home/nilbus/git/theninthcut/vendor/plugins/scrubyt-skimr/lib/scrubyt/extractor.rb:60:inmethod_missing'
from (irb):5:in boom' from /home/nilbus/git/theninthcut/vendor/plugins/scrubyt-skimr/lib/scrubyt/extractor.rb:41:ininstance_eval'
from /home/nilbus/git/theninthcut/vendor/plugins/scrubyt-skimr/lib/scrubyt/extractor.rb:41:in initialize' from (irb):3:innew'
from (irb):3:in `boom'
from (irb):12

Fill_textfield: undefined method

I get this error when I try to run any example..

c:/ruby/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/navigation/agents/mechanize.rb:178:in fill_textfield': undefined method[]=' for nil:NilClass (NoMethodError)

    from c:/ruby/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/navigation/navigation_actions.rb:27:in `eval'

    from c:/ruby/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/navigation/agents/mechanize.rb:178:in `fill_textfield'

    from c:/ruby/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/navigation/navigation_actions.rb:27:in `fill_textfield'

    from C:/Documents and Settings/JB/Mina dokument/NetBeansProjects/SBBotbeta1/lib/test1.rb:6

    from c:/ruby/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/shared/extractor.rb:75:in `instance_eval'

    from c:/ruby/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/shared/extractor.rb:75:in `initialize'

    from c:/ruby/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/shared/extractor.rb:32:in `new'

    from c:/ruby/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/shared/extractor.rb:32:in `define'

I have all the dependencies installed..

shared_utils.rb:42:in `traverse_for_match': undefined method `each' for nil:NilClass (NoMethodError)

Here is the fix: http://www.hindoogle.com/blog/2009/09/scrubyt-traverse_for_match-nilclass/

Here is the error:

/opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/utils/shared_utils.rb:42:in traverse_for_match': undefined methodeach' for nil:NilClass (NoMethodError)
from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/utils/shared_utils.rb:42:in call' from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/utils/shared_utils.rb:42:intraverse_for_match'
from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/utils/shared_utils.rb:42:in each' from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/utils/shared_utils.rb:42:intraverse_for_match'
from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/utils/shared_utils.rb:44:in call' from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/utils/shared_utils.rb:44:intraverse_for_match'
from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/utils/simple_example_lookup.rb:36:in find_node_from_text' from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/scraping/filters/tree_filter.rb:67:ingenerate_XPath_for_example'
... 7 levels...
from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/shared/extractor.rb:75:in initialize' from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/shared/extractor.rb:32:innew'
from /opt/local/lib/ruby/gems/1.8/gems/scrubyt-0.4.06/lib/scrubyt/core/shared/extractor.rb:32:in `define'
from my-scraper.rb:11

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.