Coder Social home page Coder Social logo

yomu's Introduction

Travis Build Status Code Climate Score Gem Version

Yomu 読む

Gitter

Yomu is a library for extracting text and metadata from files and documents using the Apache Tika content analysis toolkit.

Here are some of the formats supported:

  • Microsoft Office OLE 2 and Office Open XML Formats (.doc, .docx, .xls, .xlsx, .ppt, .pptx)
  • OpenOffice.org OpenDocument Formats (.odt, .ods, .odp)
  • Apple iWorks Formats
  • Rich Text Format (.rtf)
  • Portable Document Format (.pdf)

For the complete list of supported formats, please visit the Apache Tika Supported Document Formats page.

Usage

Text, metadata and MIME type information can be extracted by calling Yomu.read directly:

require 'yomu'

data = File.read 'sample.pages'
text = Yomu.read :text, data
metadata = Yomu.read :metadata, data
mimetype = Yomu.read :mimetype, data

Reading text from a given filename

Create a new instance of Yomu and pass a filename.

yomu = Yomu.new 'sample.pages'
text = yomu.text

Reading text from a given URL

This is useful for reading remote files, like documents hosted on Amazon S3.

yomu = Yomu.new 'http://svn.apache.org/repos/asf/poi/trunk/test-data/document/sample.docx'
text = yomu.text

Reading text from a stream

Yomu can also read from a stream or any object that responds to read, including file uploads from Ruby on Rails or Sinatra.

post '/:name/:filename' do
  yomu = Yomu.new params[:data][:tempfile]
  yomu.text
end

Reading metadata

Metadata is returned as a hash.

yomu = Yomu.new 'sample.pages'
yomu.metadata['Content-Type'] #=> "application/vnd.apple.pages"

Reading MIME types

MIME type is returned as a MIME::Type object.

yomu = Yomu.new 'sample.docx'
yomu.mimetype.content_type #=> "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
yomu.mimetype.extensions #=> ['docx']

Installation and Dependencies

Java Runtime

Yomu packages the Apache Tika application jar and requires a working JRE for it to work.

Gem

Add this line to your application's Gemfile:

gem 'yomu'

And then execute:

$ bundle

Or install it yourself as:

$ gem install yomu

Contributing

  1. Fork it
  2. Create your feature branch ( git checkout -b my-new-feature )
  3. Create tests and make them pass ( rake test )
  4. Commit your changes ( git commit -am 'Added some feature' )
  5. Push to the branch ( git push origin my-new-feature )
  6. Create a new Pull Request

yomu's People

Contributors

antonpaisov avatar cernyjakub avatar erol avatar fliiiix avatar jipis avatar rogeriochaves avatar stephan-nordnes-eriksen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yomu's Issues

The supplied password does not match either the owner or user password in the document.

How do I supply a password for a password protected PDF?

I am getting this error:

*** Reading ./data/attachments/dasd.pdf
INFO - Document is encrypted
[Fatal Error] :1:1: Content is not allowed in prolog.
Exception in thread "main" org.apache.tika.exception.TikaException: Unable to extract PDF content
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:146)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:407)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:116)
Caused by: org.apache.pdfbox.exceptions.WrappedIOException: Error decrypting document, details: 
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:341)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
    ... 7 more
Caused by: org.apache.pdfbox.exceptions.CryptographyException: Error: The supplied password does not match either the owner or user password in the document.
    at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.prepareForDecryption(StandardSecurityHandler.java:264)
    at org.apache.pdfbox.pdmodel.encryption.StandardSecurityHandler.decryptDocument(StandardSecurityHandler.java:156)
    at org.apache.pdfbox.pdmodel.PDDocument.openProtection(PDDocument.java:1597)
    at org.apache.pdfbox.pdmodel.PDDocument.decrypt(PDDocument.java:943)
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:337)
    ... 8 more

Unable to gem install yomu

When I run gem install yomu, I get the following error.

ERROR:  While executing gem ... (NoMethodError)
    undefined method `size' for nil:NilClass

I'm able to install yomu from source without any issues.

I'm running OSX 10.9.3 with the following

ruby --version
ruby 2.0.0p451 (2014-02-24 revision 45167) [universal.x86_64-darwin13]

gem --version 
2.0.14

bundle --version
Bundler version 1.5.2

Thread crash when parsing some special file

This file has name with .doc, but actually is html file. When processing it, yomu will running for a very long time without and response, until I force to kill the thread.
Even if I change the filename to *.html, it still the same, so maybe the file is special.
And then I try to parse with tika directly, it extract text rightly.

fake_doc_but_htm.doc.zip

undefined method `[]' for false:FalseClass in yomu.rb:119

I am getting this error from time to time (running many files through this). Have not caught the guilty file yet.

"path_to_gem/yomu-0.1.10/lib/yomu.rb:119:in `method_missing'", 
"path_to_gem/yomu-0.1.10/lib/yomu.rb:119:in `mimetype'"

The line in my code which causes this:

#running many files through this procedure.
yomu = Yomu.new(file_path)
....
case yomu.mimetype

Using Ruby 1.8.7, if it matters.

Making it work on Heroku

First of all: Nice Gem! (although it seems like it isn't maintained anymore?)
This is just in case, someone wants to use it on Heroku (e.g. in a Rails project).

You have to add the JVM Buildpack:
heroku buildpacks:add heroku/jvm --index 1 -a YOUR_APP_NAME

Otherwise you will get errors like: "No such file or directory - java".

Hope that helps ;)
Best Regards

No popen Directory

Hi, am getting the following error while trying to read a file post gem installation.

C:/Ruby23-x64/lib/ruby/gems/2.3.0/gems/yomu-0.2.4/lib/yomu.rb:51:in `popen': No
such file or directory - java -Djava.awt.headless=true -jar C:/Ruby23-x64/lib/ru
by/gems/2.3.0/gems/yomu-0.2.4/jar/tika-app-1.11.jar -t (Errno::ENOENT)

Failed to open TCP connection to :80

While parsing docx, pptx, xlsx, pdf the following error is being displayed
image

The code is pretty standard and shouldn't be the source of the issue
image

Even have a working jre
image

Any ideas on what might be causing this issue, or how to fix?

Outdated mime-types dependency

mime-types dependency was defined in 26686f8 using the ~>.

However, the version hasn't been updated since. This causes bundlerto complain if you have gems that depend on more current versions installed.

This is the case with Mechanize, for instance, which depends on 'mime-types ~> 2.0' and is a perfect companion to Yomu 😄

RubyNLP

Dear Erol,

I've recently added your project to our RubyNLP list: https://github.com/arbox/nlp-with-ruby

I wonder if you want to participate in the Ruby for NLP network. You could do this in a very simple step by adding the rubynlp topic to your GitHub repository. You may want to spread a word on Twitter or on other media :)

Thank you for the project!

Errno::EPIPE: Broken pipe

Some times I am getting broken pipe error while trying to fetch data from file. Do you known why I am getting this ?

[GEM_ROOT]/gems/yomu-0.1.9/lib/yomu.rb, line 28

NameError (uninitialized constant #<Class:0x0000000804d528>::Yomu):

Hi, I need some help here. I got Yomu working on my development environment without any issue, but once I deployed my code to production, I'm getting the following error:

NameError (uninitialized constant #<Class:0x0000000804d528>::Yomu):

app/models/candidate.rb:136:in `block (2 levels) in <class:Candidate>'

Line 136 is: Yomu.read :text, latest_resume.data

I'm pretty clueless what that error means. The production server has JRE installed:

$ java -version
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

And I thought Yomu is shipping Tika. Or do I need to install it manually on production?

Thanks

Could this gem be used under Linux OS?

I used the example you posted under Linux , but failed to read file.

code:
requre 'yomu'
yomu = Yomu.new '111.doc'
text = yomu.text

error info:
/var/lib/gems/1.9.1/gems/yomu-0.2.0/lib/yomu.rb:29:in popen': No such file or directory - java -Djava.awt.headless=true -jar /var/lib/gems/1.9.1/gems/yomu-0.2.0/jar/tika-app-1.5.jar -t (Errno::ENOENT) from /var/lib/gems/1.9.1/gems/yomu-0.2.0/lib/yomu.rb:29:inread'
from /var/lib/gems/1.9.1/gems/yomu-0.2.0/lib/yomu.rb:85:in text' from test.rb:4:in

'

The code and file '111.doc' are in same folder.
Please tell me how to solve this problem, thank you in advance.

Error opening iWorks files on OS X

I am getting an error when attempting to access the iWorks files. I realize this is because OS X stores the file group extracted into a folder even though it appears as a single file. How can this be overcome?

yomu = Yomu.new 'sample.pages'
yomu.text

generates this error

Errno::EISDIR: Is a directory @ io_fread - /path/to/test.pages
    from /gems/ruby-2.1.2/gems/yomu-0.2.0/lib/yomu.rb:173:in `read'
    from /gems/ruby-2.1.2/gems/yomu-0.2.0/lib/yomu.rb:173:in `data'
    from /gems/ruby-2.1.2/gems/yomu-0.2.0/lib/yomu.rb:85:in `text' from (irb):15
    ...

No Java runtime present, requesting install.

I am trying to use Yomu, but it is showing "No Java runtime present, requesting install." - it then open a website and lets me download the Java JRE - after installing the Java JRE and restarting the terminal, I am still not able to run yomu. Any ideas?

I'm using Yomu 0.2.2 on Mac Yosemite.

I have this JRE package installed: jre-8u31-macosx-x64.dmg

Use of defined? returns truthy value on nil

For the methods whose names end with ?, do you maybe want to test for nil rather than test to see if the variable is defined? That is, you want the method to return false if it is defined as nil, right?

For example:

def stream?
  defined? @stream
end

Would this be better off as the following?:

def stream?
  !! @stream
end

I use !! because I find it helpful for methods ending with ? to return true or false, rather than truthy or falsy values. I feel it is more precise, and produces clearer and more concise output when logging/debugging using expressions like puts "stream? == #{foo.stream?}".

Add other type of extraction

Hi, it's a very useful gem, but I would like to use the html format, I saw in the code that it doesn't handle it.

Options:
-? or --help Print this usage message
-v or --verbose Print debug level messages
-g or --gui Start the Apache Tika GUI
-x or --xml Output XHTML content (default)
-h or --html Output HTML content
-t or --text Output plain text content
-m or --metadata Output only metadata

License missing from gemspec

Some companies will only use gems with a certain license.
The canonical and easy way to check is via the gemspec
via e.g.

spec.license = 'MIT'
# or
spec.licenses = ['MIT', 'GPL-2']

There is even a License Finder to help companies ensure all gems they use
meet their licensing needs. This tool depends on license information being available in the gemspec.
Including a license in your gemspec is a good practice, in any case.

How did I find you?

I'm using a script to collect stats on gems, originally looking for download data, but decided to collect licenses too,
and make issues for missing ones as a public service :)
https://gist.github.com/bf4/5952053#file-license_issue-rb-L13 So far it's going pretty well

Many PDF documents don't parse correctly

Not sure if other people have this problem, but half of the pdfs I throw at this thing return gobbledeegook for text. The other half are fine. Incidentally, pdf-reader processes those same docs no problem.

Reading from remote files with https:// addresses causes crash

Full disclosure, I'm not even close to an expert with this, so I can only tell you what I'm experiencing and how I managed to fix it.

I noticed that after updating my carrierwave initialization file from using S3 to connect to my Amazon AWS server to use fog that my calls to Yomu were not longer working. I figured out after awhile that the problem was this line in particlar:

@data = Net::HTTP.get @uri

which was causing the following crash:

Errno::ECONNRESET: Connection reset by peer
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/protocol.rb:141:in `read_nonblock'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/protocol.rb:141:in `rbuf_fill'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/protocol.rb:122:in `readuntil'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/protocol.rb:132:in `readline'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:2563:in `read_status_line'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:2552:in `read_new'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:1320:in `block in transport_request'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:1317:in `catch'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:1317:in `transport_request'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:1294:in `request'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:1196:in `request_get'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:455:in `block in get_response'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:746:in `start'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:454:in `get_response'
    from /app/vendor/ruby-1.9.3/lib/ruby/1.9.1/net/http.rb:431:in `get'
    from (irb):41

Whenever I switched my connection to AWS from S3 to Fog, carrier wave started uploading the files that would eventually be read by Yomu to SSL server addresses. From what I've read online, requests out to Net::HTTP need to be specifically told that SSL is being used, hence the sudden crashes from Yomu.

Instead of explicitly setting up the connection, someone suggested the following change to that same line (yomu.rb line 177):

@data = @uri.read

Alternatively - what I did as a temporary solution until I see how this pans out is to set my carrierwave config to disable SSL:

config.fog_use_ssl_for_aws = false

Which seems to work for me. Just in case, this on Rails 4.1.6, ruby 1.9.3p547 and Yomu 0.2.0.

Exception in Tika 1.5 (when reading "values" from metadata)

If you parse the file at https://github.com/roo-rb/roo/blob/master/test/files/Pfand_from_windows_phone.xlsx with yomu and get the metadata, the resulting object is "False" -- which leads to all sorts of fun. Even reading the text from the file generates a Java exception (though the read method returns "").

I've found that Tika 1.5 seems to be the issue; Tika 1.7 (and there are even-newer versions) successfully reads this file. (I saw someone forked and upgraded to 1.8 but the pull request didn't go through due to CI tests not passing.)

Speed up bulk processing with Tika

Yomu is great. I'm currently using it to process thousands of documents. Unfortunately, this is very slow, because, right now, Yomu starts the JVM for each document. This takes about 2 seconds per document -- which significantly slows me down.

Tika has thought of this and included "server" mode, where Tika starts as a server and processes whatever documents are thrown at it over a socket. Starting Java in server mode takes a little longer, but only has to happen once.

I've modified Yomu to support server mode. The API is the same, but if you want server mode, put this

Yomu.server(:text)

before your code and

Yomu.kill_server!

after it.

For processing even only 6 documents, the speed-up is noticeable: 12ish seconds with the current version of Yomu and 4ish with my server version.

In order to preserve the API as-is (tests pass on my branch with no changes), my method isn't terribly elegant (e.g. class variables) and requires the target extraction type (text/html/metadata) to be selected when the server is inited (this is a Tika constraint). A more elegant and Rubyish way would be to do all the server-based extraction in a block. But this would require changing the API.

If you'd be amenable to this as a patch, @yomu, I'll write tests, edit the docs and submit a PR. I'm happy, too, to submit as is or with the block-based method I mention above, based on what you think is best for the library. Until then, my version is at https://github.com/jeremybmerrill/yomu/tree/feature/servermode

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.