Coder Social home page Coder Social logo

ocr_testing's Introduction

Welcome

This repository contains the scripts and outputs from our OCR comparison tests.

We identified a few sample documents to run through OCR systems so we could compare the results. The documents we used in our final write up are these:

  • A receipt -- This receipt from the Riker's commissary was included in States of Incarceration, a collaborative storytelling project and traveling exhibition about incarceration in America.
  • A heavily redacted document -- Carter Page's FISA warrant is a legal filing with a lot of redacted portions, just the kind of exasperating thing reporters deal with all the time.
  • Something historical -- Executive Order 9066 authorized the internment of Japanese Americans in 1942. The scanned image available in the national archives is fairly high quality but it is still an old, typewritten document.
  • A form -- This Texas campaign finance report, from a Texas Tribune story about abuses in the juvenile justice system has very clean text but the formatting is important to understanding the document.
  • Something wrinkled -- in early 2014 a group of divers retrieved hundreds of pages of documents from a lake at Ukrainian President Viktor Yanukovych's vast country estate. The former president or his staff had dumped the records there in the hopes of destroying them, but many pages were still at least somewhat legible. Reporters laid them out to dry and began the process of transcribing the waterlogged papers. We selected a page that is more or less readable to the human eye but definitely warped with water damage.

We also tested the OCR engines against a handful of alternate documents. We've preserved two of those documents here so that you can look them over, too. Both are relatively easy to read and all the OCR engines we tested handled them well.

The first, cepr_oversight_order is an order giving the Puerto Rico Energy Commission oversight powers over the Puerto Rico Electric Power Authority, after the latter authority's highly unusual $300 Million contract with Whitefish Energy came under scrutiny.

The second, whitefish_energy_vs_commonwealth_puerto_rico is the full text of a legal filing in the protracted fight over who is responsible for delays in rebuilding Puerto Rico's electric grid. These two articles are a great place to get more context:

Using this Repository

The /lib/ directory includes the scripts that we used to test each OCR client. Each tool requires some setup, but once you've got a tool installed, you can invoke it with:

ruby ./lib/ocr.rb {command}

For example once you have installed Tesseract, ruby ./lib/ocr.rb tesseract documents will use Tesseract to OCR all the images in the "documents" directory.

Once you have set up Google Cloud services and stored your credentials, ruby ./lib/ocr.rb google google_cloud_vision/credentials.json documents/historical-executive_order_9066-japanese_internment.jpg will use Google Cloud Vision to OCR a single image.

Installation

These scripts in this repository depend on a few ruby gems. Install them with:

  • Install Bundler first: gem install bundler
  • Then install gems in the Gemfile: bundle install

This script uses mutool, a PDF processing tool included in mupdf, to convert multi-page PDFs into images. Install with:

  • Mac/Homebrew brew install mupdf
  • Ubuntu apt install mupdf-tools

Cloud Services

Each of the cloud services we tested requires you to authenticate your account. Our scripts look for those credentials in the credentials.json file in each directory.

Google Cloud Vision

The Ruby Gems that Google Cloud Vision requires are included in the bundle install for this repository.

Google Cloud Vision requires authentication credentials. Use the example in google_cloud_vision/credentials.sample.json to create your own credentials.json and make sure to point to it when you invoke ./lib/ocr.rb, eg.

ruby ./lib/ocr.rb google google_cloud_vision/credentials.json documents/document.jpg

Microsoft Azure Computer Vision

The Ruby Gems that Microsoft Azure requires are included in the bundle install for this repository.

Use the example in azure/credentials.sample.json to create your own credentials.json and make sure to point to it when you invoke ./lib/ocr.rb.

Abbyy

Abbyy provides a python script, which is what we used to test documents in Abbyy. You can feed your id and password to the script when you run it:

ABBYY_APPID="{YOUR APPID}" ABBYY_PWD="{YOUR PASSWORD}" python process.py {PATH TO IMAGE} {PATH TO OUTPUT}

Command Line Tools

The free and open source tools that we tested are all command line applications that you'll run locally.

Tesseract

tesseract is far and away the best maintained and easiest to use of the command line tools we tested. You should be able to install it with a package manager.

MacOS: brew install tesseract --with-all-languages

Ubuntu/Debian: apt install tesseract tesseract-ocr-*

Calamari

Calamari depends on OCRopus's tools to improve contrast, and to deskew and split images. Unfortunately, Calamari requires python 3.x, and OCRopus requires python 2.x. Because TensorFlow has issues with Python 3.7, we used Python 3.6. In retrospect, using kraken might been much smoother, but here's what we actually did:

We used pyenv and virtualenv to manage multiple Python instances. (If you're using pyenv please also note their installation instructions).

We installed Python 3.6 with pyenv, and then used virtualenv to create a space to install Calamari and its dependencies.

# from the root of this directory first install Python 3.6 and create a virtual env.
mkdir -p venv
pyenv install 3.6.8
virtualenv -p ~/.pyenv/versions/3.6.8/bin/python venv/calamari
# activate the virtualenv
source venv/calamari/bin/activate
# Clone the calamari code
cd ..
git clone https://github.com/Calamari-OCR/calamari.git
cd calamari
# then install the dependencies and library.
pip install -r requirements.txt
python setup.py install

Calamari provides some pre-trained data models to power its recognizer. You should download them into a models directory in the Calamari directory.

git clone https://github.com/Calamari-OCR/calamari_models.git models

If your installation was successful, calamari-predict will be available at the command line, and you can run ruby ./lib/ocr.rb calamari {filename} to OCR files with Calamari.

OCRopus

OCRopus requires python 2.7, so it's helpful to use pyenv to manage instances.

mkdir -p venv
pyenv install 2.7
virtualenv -p ~/.pyenv/versions/2.7/bin/python venv/ocropus

Clone OCRopus with

git clone https://github.com/tmbdev/ocropy.git

# activate the ocropus virtualenv
source venv/ocropus/bin/activate
# find the ocropus source directory
cd ../ocropy
# and install the dependencies
pip install -r requirements.txt
python setup.py install

To get OCRopus working you'll also need to download trained models. Prebuilt models for OCRopus can be found on the OCRopus wiki. You should download the english model into a models directory in the OCRopus directory.

If your installation was successful, ocropus-rpred will be available at the command line, and you can run ruby ./lib/ocr.rb calamari {filename} to OCR files with Calamari.

ocr_testing's People

Contributors

amandabee avatar knowtheory avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ocr_testing's Issues

More detail questions

How would I run azure.rb? Assuming all the necessary libraries are installed?

Are you assuming a virtualenv or no?

Complete Attention OCR Testing

If we do complete testing on Attention OCR, these are the notes we had in the post draft on things we wanted to flag.

THINGS TO BE ALERT TO WHEN SETTING UP

  • Gotta make certain you’ve installed the right dependencies (tensorflow version).
  • Training requires specifying what the longest possible word it should expect is.

HOW DO THE TWO REPOSITORIES RELATE? WE AT LEAST JUST ACKNOWLEDGE THAT THEY BOTH EXIST.
The original authors of Attention-OCR and the paper that introduced it placed their code in https://github.com/da03/Attention-OCR . They published a paper on Attention-OCR as well.
Ed Medvedev (who appears to be just some random guy?) create a modified fork which is stylistically closer to TensorFlow’s recommended usage: https://github.com/emedvedev/attention-ocr Medvedev’s branch includes the instructions necessary to use

WHAT FORMATS WILL IT RETURN? HOCR? JSON? PDF?
The input into Attention-OCR is a single line of text at a time, so it’s just providing text output.

ANY OBSERVATIONS ABOUT THE RESULTS/THINGS TO FLAG?

Where are ocropus models?

the readme says you need to download models to run Ocropus, but it wasn't obvious at a glance from the repository how I would do that. Can you give me a URL that we should point to for that piece of the instructions?

Review the Tesseract, OCRopus and Calamari instructions

Can you set aside a few focused minutes to walk through the install instructions for the command line tools and confirm that they seem to make sense to you? I'm still fixing a lot of little things that seem obvious (if you're already in /calamari/ you shouldn't need to cd into that directory to clone the models).

Clarify setting ENV

@knowtheory

At the top of the azure.rb we need to clarify setting the ENV variable -- if you set it in a virtualenv will that stick? Are you assuming a virtualenv? Or will that AZURE_KEY be available globally?

Clarify Calamari instructions

The Calamari setup instructions end with:

"calamari-predict should now be available on the commandline"

Am I correct that once that is running you would actually invoke Calamari with lib/ocr.rb calamari {filename} ? If so, we should say that explicitly.

Where do I put my Abbyy credentials?

It don't see a credentials.json file, and it looks like the python script is expecting an environment variable: os.environ["ABBYY_APPID"]

If that is correct, the readme should explain that you need to set those variables.

If that is incorrect, please let me know what I'm missing here?

Still waiting on Attention OCR details

I still need more details so I can incorporate Attention OCR into our writeup.

When you think there's enough there that I can write around it, let me know and I'll do that.

Address Problem: different tools handle all of this differently

As noted on #13 "Just noting again that the different tools handle all of this differently. Google, Azure, Abbyy and Tesseract all automatically rotate the pages."

This seems to reflect a larger concern than the fairly straightforward task of re-running the now-rotated Yanukovych document through Ocropus and Calamari.

Where and how should this be addressed?

What's the conclusion?

Thank you for your excellent work! I read your article https://source.opennews.org/articles/so-many-ocr-options/ with great interest.

Your article opens with "Do you need to pay a lot of money to get reliable OCR results? Is Google Cloud Vision actually better than Tesseract? Are any cutting edge neural network-based OCR engines worth the time investment of getting them set up?" Unfortunately I didn't find any of those answers in your article :(

What's the bottom line on OCR these days? What's your conclusion?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.