factful / ocr_testing Goto Github PK

Scripts and results from our OCR roundup, available on Source

Home Page: https://source.opennews.org/articles/so-many-ocr-options/

Ruby 73.42% Python 26.58%

ocr_testing's Introduction

Welcome

This repository contains the scripts and outputs from our OCR comparison tests.

We identified a few sample documents to run through OCR systems so we could compare the results. The documents we used in our final write up are these:

A receipt -- This receipt from the Riker's commissary was included in States of Incarceration, a collaborative storytelling project and traveling exhibition about incarceration in America.
A heavily redacted document -- Carter Page's FISA warrant is a legal filing with a lot of redacted portions, just the kind of exasperating thing reporters deal with all the time.
Something historical -- Executive Order 9066 authorized the internment of Japanese Americans in 1942. The scanned image available in the national archives is fairly high quality but it is still an old, typewritten document.
A form -- This Texas campaign finance report, from a Texas Tribune story about abuses in the juvenile justice system has very clean text but the formatting is important to understanding the document.
Something wrinkled -- in early 2014 a group of divers retrieved hundreds of pages of documents from a lake at Ukrainian President Viktor Yanukovych's vast country estate. The former president or his staff had dumped the records there in the hopes of destroying them, but many pages were still at least somewhat legible. Reporters laid them out to dry and began the process of transcribing the waterlogged papers. We selected a page that is more or less readable to the human eye but definitely warped with water damage.

We also tested the OCR engines against a handful of alternate documents. We've preserved two of those documents here so that you can look them over, too. Both are relatively easy to read and all the OCR engines we tested handled them well.

The first, cepr_oversight_order is an order giving the Puerto Rico Energy Commission oversight powers over the Puerto Rico Electric Power Authority, after the latter authority's highly unusual $300 Million contract with Whitefish Energy came under scrutiny.

The second, whitefish_energy_vs_commonwealth_puerto_rico is the full text of a legal filing in the protracted fight over who is responsible for delays in rebuilding Puerto Rico's electric grid. These two articles are a great place to get more context:

Puerto Rico moves to cancel contract with Whitefish Energy to repair electric grid, The Washington Post, Oct 29, 2017; and
Puerto Rico Grid Contractor Dispute Devolves Into Litigation, The Wall Street Journal, Nov 22, 2017

Using this Repository

The /lib/ directory includes the scripts that we used to test each OCR client. Each tool requires some setup, but once you've got a tool installed, you can invoke it with:

ruby ./lib/ocr.rb {command}

For example once you have installed Tesseract, ruby ./lib/ocr.rb tesseract documents will use Tesseract to OCR all the images in the "documents" directory.

Once you have set up Google Cloud services and stored your credentials, ruby ./lib/ocr.rb google google_cloud_vision/credentials.json documents/historical-executive_order_9066-japanese_internment.jpg will use Google Cloud Vision to OCR a single image.

Installation

These scripts in this repository depend on a few ruby gems. Install them with:

Install Bundler first: gem install bundler
Then install gems in the Gemfile: bundle install

This script uses mutool, a PDF processing tool included in mupdf, to convert multi-page PDFs into images. Install with:

Mac/Homebrew brew install mupdf
Ubuntu apt install mupdf-tools

Cloud Services

Each of the cloud services we tested requires you to authenticate your account. Our scripts look for those credentials in the credentials.json file in each directory.

Google Cloud Vision

The Ruby Gems that Google Cloud Vision requires are included in the bundle install for this repository.

Google Cloud Vision requires authentication credentials. Use the example in google_cloud_vision/credentials.sample.json to create your own credentials.json and make sure to point to it when you invoke ./lib/ocr.rb, eg.

ruby ./lib/ocr.rb google google_cloud_vision/credentials.json documents/document.jpg

Microsoft Azure Computer Vision

The Ruby Gems that Microsoft Azure requires are included in the bundle install for this repository.

Use the example in azure/credentials.sample.json to create your own credentials.json and make sure to point to it when you invoke ./lib/ocr.rb.

Abbyy

Abbyy provides a python script, which is what we used to test documents in Abbyy. You can feed your id and password to the script when you run it:

ABBYY_APPID="{YOUR APPID}" ABBYY_PWD="{YOUR PASSWORD}" python process.py {PATH TO IMAGE} {PATH TO OUTPUT}

Command Line Tools

The free and open source tools that we tested are all command line applications that you'll run locally.

Tesseract

tesseract is far and away the best maintained and easiest to use of the command line tools we tested. You should be able to install it with a package manager.

MacOS: brew install tesseract --with-all-languages

Ubuntu/Debian: apt install tesseract tesseract-ocr-*

Calamari

Calamari depends on OCRopus's tools to improve contrast, and to deskew and split images. Unfortunately, Calamari requires python 3.x, and OCRopus requires python 2.x. Because TensorFlow has issues with Python 3.7, we used Python 3.6. In retrospect, using kraken might been much smoother, but here's what we actually did:

We used pyenv and virtualenv to manage multiple Python instances. (If you're using pyenv please also note their installation instructions).

We installed Python 3.6 with pyenv, and then used virtualenv to create a space to install Calamari and its dependencies.

# from the root of this directory first install Python 3.6 and create a virtual env.
mkdir -p venv
pyenv install 3.6.8
virtualenv -p ~/.pyenv/versions/3.6.8/bin/python venv/calamari
# activate the virtualenv
source venv/calamari/bin/activate

# Clone the calamari code
cd ..
git clone https://github.com/Calamari-OCR/calamari.git
cd calamari
# then install the dependencies and library.
pip install -r requirements.txt
python setup.py install

Calamari provides some pre-trained data models to power its recognizer. You should download them into a models directory in the Calamari directory.

git clone https://github.com/Calamari-OCR/calamari_models.git models

If your installation was successful, calamari-predict will be available at the command line, and you can run ruby ./lib/ocr.rb calamari {filename} to OCR files with Calamari.

OCRopus

OCRopus requires python 2.7, so it's helpful to use pyenv to manage instances.

mkdir -p venv
pyenv install 2.7
virtualenv -p ~/.pyenv/versions/2.7/bin/python venv/ocropus

Clone OCRopus with

git clone https://github.com/tmbdev/ocropy.git

# activate the ocropus virtualenv
source venv/ocropus/bin/activate
# find the ocropus source directory
cd ../ocropy
# and install the dependencies
pip install -r requirements.txt
python setup.py install

To get OCRopus working you'll also need to download trained models. Prebuilt models for OCRopus can be found on the OCRopus wiki. You should download the english model into a models directory in the OCRopus directory.

If your installation was successful, ocropus-rpred will be available at the command line, and you can run ruby ./lib/ocr.rb calamari {filename} to OCR files with Calamari.

ocr_testing's People

Contributors

Stargazers

Watchers

Forkers

stjordanis ichim-david reubenjacob votamvan shalevy1 ryanyaohz wwwanghao allensmile tranquiltravis hivewang kuan-li jacksonjack001 orange888 shikharv10 kapitsa2811 simyen jakobjanot yangspeaking valrcs andres-mejia choongkyu aniketgurav com77002 mamafun shrinivas-io owenanalytics charlie6echo noise-trader mjdhasan henryzulux king-kay-mods mzkaramat aftab685 nirmalkq wassimyoussef openaitools nourfdss benqian akdeniz27 gdia harshit1698 werayuthgswu haqpahr iq-scm duckduckgrayduck riyachhikara

ocr_testing's Issues

/usr/lib/ruby/2.5.0/rubygems/core_ext/kernel_require.rb:59:in `require': cannot load such file -- fastimage (LoadError)

I have tried to run calamari for python 3 and tesseract but getting the above error.
For calamari I am using this command:

ruby ./lib/ocr.rb calamari ../../Pictures/test/test.png

What am I doing wrong here?

Complete Attention OCR Testing

If we do complete testing on Attention OCR, these are the notes we had in the post draft on things we wanted to flag.

THINGS TO BE ALERT TO WHEN SETTING UP

Gotta make certain you’ve installed the right dependencies (tensorflow version).
Training requires specifying what the longest possible word it should expect is.

HOW DO THE TWO REPOSITORIES RELATE? WE AT LEAST JUST ACKNOWLEDGE THAT THEY BOTH EXIST.
The original authors of Attention-OCR and the paper that introduced it placed their code in https://github.com/da03/Attention-OCR . They published a paper on Attention-OCR as well.
Ed Medvedev (who appears to be just some random guy?) create a modified fork which is stylistically closer to TensorFlow’s recommended usage: https://github.com/emedvedev/attention-ocr Medvedev’s branch includes the instructions necessary to use

WHAT FORMATS WILL IT RETURN? HOCR? JSON? PDF?
The input into Attention-OCR is a single line of text at a time, so it’s just providing text output.

ANY OBSERVATIONS ABOUT THE RESULTS/THINGS TO FLAG?

Where are ocropus models?

the readme says you need to download models to run Ocropus, but it wasn't obvious at a glance from the repository how I would do that. Can you give me a URL that we should point to for that piece of the instructions?

Review the Tesseract, OCRopus and Calamari instructions

Can you set aside a few focused minutes to walk through the install instructions for the command line tools and confirm that they seem to make sense to you? I'm still fixing a lot of little things that seem obvious (if you're already in /calamari/ you shouldn't need to cd into that directory to clone the models).

Clarify setting ENV

@knowtheory

At the top of the azure.rb we need to clarify setting the ENV variable -- if you set it in a virtualenv will that stick? Are you assuming a virtualenv? Or will that AZURE_KEY be available globally?

Clarify Calamari instructions

The Calamari setup instructions end with:

"calamari-predict should now be available on the commandline"

Am I correct that once that is running you would actually invoke Calamari with lib/ocr.rb calamari {filename} ? If so, we should say that explicitly.

Add Attention OCR to the Viewer

We need to add Attention OCR results to the ocr-viewer.

Rotate Yanukovych 90° and re-run Calamari and Ocropus

Looks like the Yanukovych document wasn't rotated when it was run through Ocropus and Calamari. Need to rerun with the original oriented the right direction.

Carter Page FISA in Abbyy

I don't need the whole Carter Page warrant but I do need page 1 run through Abbyy.

Add instructions on how to use the scripts to OCR documents

explain the interface and how to invoke it
explain installation dependencies for each of the tools as needed
explain credentials files and where they should be placed for cloud services.

No Yanukovych document in Abbyy, Acrobat, Tesseract?

I don't see any Yanukovych documents in the Abbyy directory or in the Acrobat directory or in the Tesseract directory.

Where do I put my Abbyy credentials?

It don't see a credentials.json file, and it looks like the python script is expecting an environment variable: os.environ["ABBYY_APPID"]

If that is correct, the readme should explain that you need to set those variables.

If that is incorrect, please let me know what I'm missing here?

Test CometDocs

Cometdocs provides an API, and can make credits available if you want to add them to the review.

See https://twitter.com/cometdocs/status/1098240764813938689 for more.

IRE members are entitled to a free Cometdocs premium account, but if you're not eligible for IRE membership, Cometdocs pricing starts at $10/mo and tops out at $130 for a lifetime account -- not a bad deal if their OCR works.

Still waiting on Attention OCR details

I still need more details so I can incorporate Attention OCR into our writeup.

When you think there's enough there that I can write around it, let me know and I'll do that.

Wrong Yanukovych document in Azure

It looks like you OCR'd the wrong Yanukovych document in Azure.

Remove duplicate .rb files

lib/ocr/azure.rb duplicates azure/azure.rb -- we're only using one of those.

Address Problem: different tools handle all of this differently

As noted on #13 "Just noting again that the different tools handle all of this differently. Google, Azure, Abbyy and Tesseract all automatically rotate the pages."

This seems to reflect a larger concern than the fairly straightforward task of re-running the now-rotated Yanukovych document through Ocropus and Calamari.

Where and how should this be addressed?

Review Attention OCR graf in post

Can you confirm that what I put in the post is accurate and reasonable?

Add Ukranian doc to viewer

Need to add the Yanukovych document to the viewer. Could go to press without it.

What's the conclusion?

Thank you for your excellent work! I read your article https://source.opennews.org/articles/so-many-ocr-options/ with great interest.

Your article opens with "Do you need to pay a lot of money to get reliable OCR results? Is Google Cloud Vision actually better than Tesseract? Are any cutting edge neural network-based OCR engines worth the time investment of getting them set up?" Unfortunately I didn't find any of those answers in your article :(

What's the bottom line on OCR these days? What's your conclusion?

no calamari install link

The Calamari instructions don't include cloning or downloading Calamari.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.