tleyden / open-ocr Goto Github PK

View Code? Open in Web Editor NEW

1.3K 69.0 225.0 1 MB

Run your own OCR-as-a-Service using Tesseract and Docker

License: Apache License 2.0

Go 80.80% Shell 10.45% API Blueprint 8.75%

open-ocr's Introduction

OpenOCR makes it simple to host your own OCR REST API.

The heavy lifting OCR work is handled by Tesseract OCR.

Docker is used to containerize the various components of the service.

Features

Scalable message passing architecture via RabbitMQ.
Platform independence via Docker containers.
Kubernetes support: workers can run in a Kubernetes Replication Controller
Supports 31 languages in addition to English
Ability to use an image pre-processing chain. An example using Stroke Width Transform is provided.
PDF support via a PDF preprocessor
Pass arguments to Tesseract such as character whitelist and page segment mode.
REST API docs
A Go REST client is available.

Launching OpenOCR on a Docker PAAS

OpenOCR can easily run on any PAAS that supports Docker containers. Here are the instructions for a few that have already been tested:

If your preferred PAAS isn't listed, please open a Github issue to request instructions.

Launching OpenOCR on Ubuntu 14.04

OpenOCR can be launched on anything that supports Docker, such as Ubuntu 14.04.

Here's how to install it from scratch and verify that it's working correctly.

Install Docker

See Installing Docker on Ubuntu instructions.

Find out your host address

$ ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:43:40:c7
          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
          ...

The ip address 10.0.2.15 will be used as the RABBITMQ_HOST env variable below.

Launching OpenOCR command run.sh

Install docker
Install docker-compose
git clone https://github.com/tleyden/open-ocr.git
cd open-ocr/docker-compose
Type ./run.sh (in case you don't have execute right type sudo chmod +x run.sh
The runner will ask you if you want to delete the images (choose y or n for each)
The runner will ask you to choose between version 1 and 2
- Version 1 is using the ocr Tesseract 3.04. The memory usage is light. It is pretty fast and not costly in term of size (a simple aws instance with 1GB of ram and 8GB of storage is sufficiant). Result are acceptable
- Version 2 is using the ocr Tesseract 4.00. The memory usage is light. It is less fast than tesseract 3 and more costly in term of size (an simple aws instance with 1GB of ram is sufficient but with an EBS of 16GB of storage). Result are really better compared to version 3.04.
- To see a comparative you can have a look to the official page of tesseract

You can use the docker-compose without the run.sh. For this just do:

# for v1
export OPEN_OCR_INSTANCE=open-ocr

# for v2
export OPEN_OCR_INSTANCE=open-ocr-2

# then up (with -d to start it as deamon)
docker-compose up

Docker Compose will start four docker instances

You are now ready to decode images → text via your REST API.

Launching OpenOCR with Docker Compose on OSX

Install docker
Install docker toolbox
Checkout OpenOCR repository
cd docker-compose directory
docker-machine start default
docker-machine env
Look at the Docker host IP address
Run docker-compose up -d to run containers as daemons or docker-compose up to see the log in console

How to test the REST API after turning on the docker-compose up

Where IP_ADDRESS_OF_DOCKER_HOST is what you saw when you run docker-machine env (e.g. 192.168.99.100) and where HTTP_POST is the port number inside the .yml file inside the docker-compose directory presuming it should be the same 9292.

Request

$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://IP_ADDRESS_OF_DOCKER_HOST:HTTP_PORT/ocr

Assuming the values are (192.168.99.100 and 9292 respectively)

$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://192.168.99.100:9292/ocr

Response

It will return the decoded text for the test image:

< HTTP/1.1 200 OK
< Date: Tue, 13 May 2014 16:18:50 GMT
< Content-Length: 283
< Content-Type: text/plain; charset=utf-8
<
You can create local variables for the pipelines within the template by
preﬁxing the variable name with a “$" sign. Variable names have to be
composed of alphanumeric characters and the underscore. In the example
below I have used a few variations that work for variable names.

Test the REST API

With image url

Request

$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://10.0.2.15:$HTTP_PORT/ocr

Response

It will return the decoded text for the test image:

< HTTP/1.1 200 OK
< Date: Tue, 13 May 2014 16:18:50 GMT
< Content-Length: 283
< Content-Type: text/plain; charset=utf-8
<
You can create local variables for the pipelines within the template by
preﬁxing the variable name with a “$" sign. Variable names have to be
composed of alphanumeric characters and the underscore. In the example
below I have used a few variations that work for variable names.

With image base64

Request

$ curl -X POST -H "Content-Type: application/json" -d '{"img_base64":"<YOUR BASE 64 HERE>","engine":"tesseract"}' http://10.0.2.15:$HTTP_PORT/ocr

The REST API also supports:

Uploading the image content via multipart/related, rather than passing an image URL. (example client code provided in the Go REST client)
Tesseract config vars (eg, equivalent of -c arguments when using Tesseract via the command line) and Page Seg Mode
Ability to use an image pre-processing chain, eg Stroke Width Transform.
Non-English languages

See the REST API docs and the Go REST client for details.

Client Libraries

These client libraries make it easier to invoke the REST API:

Uploading local files using curl

The supplied docs/upload-local-file.sh provides an example of how to upload a local file using curl with multipart/related encoding of the json and image data:

usage: docs/upload-local-file.sh <urlendpoint> <file> [mimetype]
download the example ocr image wget http://bit.ly/ocrimage
example: docs/upload-local-file.sh http://10.0.2.15:$HTTP_PORT/ocr-file-upload ocrimage

Community

Follow @OpenOCR on Twitter
Checkout the Github issue tracker

License

OpenOCR is Open Source and available under the Apache 2 License.

open-ocr's People

Contributors

Stargazers

Watchers

Forkers

djodjoni dreadlord1984 chauthai windse7en vvw xiegy ameeuw23 dreamingblackcat rednut alexproca tacohunter jamesgunja dhorbach robig paramaw byndcivilization mkmojo jonghyeopkim puffygeek skallumadi arcina andreydelpozo2 mcantrell addousas emcmakin arinbec paulcullin nahasops usefuland jseteny ftenaf soheilsalehian ravisinghsfbay priestd09 eric013 sauravmondallive simkimsia wuwenbao clu-pei-dae biostone sextractor joskid bhanu475 hchengwang hackable xhniu hyzhak guo-chong haijun-crayon jeckep jtperry maiduchuy pagevault hope183 connectrafay sasikumarm elbow-jason a1am1n chagge yuansanwan hannraoi codeaudit ahaque12 markrey camilomolina89 minyk shammishailaj florentcigolotti cosmocracy keerekeerweere cinqict itmgr orchestor osvaldoj speedfl aitsalaheddine lulzzz silasxue vcdflow swat521 seeya mikal-k xjdata gokulreddy sferoze linecode kristofer solertis caojian1983 crizzs spiritdude crazypenguincode send2cloud ahmadfikri zimiw truecarry spencerfrazier ariganis zhangxinnan as-you-like

open-ocr's Issues

Installation guide for other popular docker PAAS platforms such as GCE

I am new to dcoker, but very interested to run the open-ocr.

There are several docker PAAS platforms but I prefer to use GCE, as it gives $500 credit which basically allows us to run GCE for one year.

I found openocr at index.docker.io, but after I run "docker pull tleyden5iwx/open-ocr"
I had no idea how to run it.

Could you kindly provide us a detailed guide how to run open-ocr in GCE?

Thanks a lot.

Support non-english languages

Update OpenOCR to support non-english languages

Failed to connect to 0.0.0.0 port 9292: Connection refused

I have similiar problem to issue #26. I've checked 127.0.0.1:9292 and localhost:9292 and 0.0.0.0:9292
I have created instances using docker-machine on osx.
Here is docker ps result:

CONTAINER ID        IMAGE                               COMMAND                  CREATED             STATUS              PORTS                    NAMES
7086f4b1f569        tleyden5iwx/open-ocr                "/opt/open-ocr/open-o"   34 minutes ago      Up 28 minutes       0.0.0.0:9292->9292/tcp   dockercompose_openocr_1
073e0fb08b65        tleyden5iwx/open-ocr-preprocessor   "/opt/open-ocr/open-o"   34 minutes ago      Up 28 minutes                                dockercompose_strokewidthtransform_1
614d0df1ff2b        tleyden5iwx/open-ocr                "/opt/open-ocr/open-o"   34 minutes ago      Up 28 minutes                                dockercompose_openocrworker_1
b6accfeba49b        tutum/rabbitmq                      "/run.sh"                34 minutes ago      Up 28 minutes       5672/tcp, 15672/tcp      dockercompose_rabbitmq_1

docker top on first container:

PID                 USER                COMMAND
4422                root                {open-ocr-httpd} /bin/sh /opt/open-ocr/open-ocr-httpd -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ -http_port 9292
4533                root                open-ocr-httpd -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/ -http_port 9292
4972                root                bash

hOCR output?

Is there a way to use the REST API to request hOCR output from Tesseract via open-ocr?

Using standalone tesseract a command to do with could be:
tesseract -l eng inputfile.png output hocr tess.cfg

Position & Size of text

I'm wondering if it's possible to recognize the x/y/width/height of the text scanned, taking a physical market receipt (those you get in a convenience store), you normally have some text in the middle then some text left and right aligned, would it be possible to recognize the relative position of every word/line/character? Great project, works like a charm!! :-)

Img URL not parsed from request body

Hello,

I created a fig file which does the same as setup.sh but when I try to call the rest service it seems like the img_url is not passed correctly.

Fig file:

rabbitmq:
  image: tutum/rabbitmq
  dns: 8.8.8.8
  # ports:
  #     - "5672:5672"
  #     - "15672:15672"
  environment:
    - "RABBITMQ_PASS=1234"

openocr:
  image: tleyden5iwx/open-ocr
  dns: 8.8.8.8
  links:
    - rabbitmq
  ports:
    - "9292:9292"
  command: open-ocr-httpd -amqp_uri "amqp://admin:1234@rabbitmq/" -http_port 9292

openocrworker:
  image: tleyden5iwx/open-ocr
  dns: 8.8.8.8
  links:
    - rabbitmq
  command: open-ocr-worker -amqp_uri "amqp://admin:1234@rabbitmq/"

Curl request:

curl -v -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://192.168.59.103:9292/ocr
* Hostname was NOT found in DNS cache
*   Trying 192.168.59.103...
* Connected to 192.168.59.103 (192.168.59.103) port 9292 (#0)
> POST /ocr HTTP/1.1
> User-Agent: curl/7.37.1
> Host: 192.168.59.103:9292
> Accept: */*
> Content-Type: application/json
> Content-Length: 57
>
* upload completely sent off: 57 out of 57 bytes
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< Date: Sun, 25 Jan 2015 22:08:02 GMT
< Content-Length: 71
<
Unable to perform OCR decode.  Error: Timeout waiting for RPC response
* Connection #0 to host 192.168.59.103 left intact

Open OCR output:

Creating restocr_rabbitmq_1...
Creating restocr_openocr_1...
Creating restocr_openocrworker_1...
Attaching to restocr_rabbitmq_1, restocr_openocr_1, restocr_openocrworker_1
openocr_1       | 22:04:17.441712 OCR_HTTP: Starting listener on :9292
rabbitmq_1      | => Securing RabbitMQ with a preset password
rabbitmq_1      | => Done!
rabbitmq_1      | ========================================================================
rabbitmq_1      | You can now connect to this RabbitMQ server using, for example:
rabbitmq_1      |
rabbitmq_1      |     curl --user admin:1234 http://<host>:<port>/api/vhosts
rabbitmq_1      |
rabbitmq_1      | Please remember to change the above password as soon as possible!
rabbitmq_1      | ========================================================================
openocrworker_1 | 22:04:17.809227 OCR_WORKER: Creating new OCR Worker
openocrworker_1 | 22:04:17.809743 OCR_WORKER: Run() called...
openocrworker_1 | 22:04:17.809750 OCR_WORKER: dialing "amqp://admin:1234@rabbitmq/"
rabbitmq_1      |
rabbitmq_1      |               RabbitMQ 3.4.0. Copyright (C) 2007-2014 GoPivotal, Inc.
rabbitmq_1      |   ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
rabbitmq_1      |   ##  ##
rabbitmq_1      |   ##########  Logs: /var/log/rabbitmq/[email protected]
rabbitmq_1      |   ######  ##        /var/log/rabbitmq/[email protected]
rabbitmq_1      |   ##########
rabbitmq_1      |               Starting broker... completed with 6 plugins.
openocr_1       | 22:06:01.212527 OCR_HTTP: serveHttp called
openocr_1       | 22:06:01.212827 OCR_CLIENT: dialing "amqp://admin:1234@rabbitmq/"
openocr_1       | 22:06:01.225323 OCR_CLIENT: callbackQueue name: amq.gen-tvP8DR0dvqGbTdVHnLenaA
openocr_1       | 22:06:01.225860 OCR_CLIENT: looping over deliveries..
openocr_1       | 22:06:02.480430 OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
openocr_1       | 22:06:02.480525 OCR_CLIENT: publishing with routing key "decode-ocr"
openocr_1       | 22:06:02.480538 OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
openocr_1       | 22:08:02.485886 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80
openocr_1       | 22:08:02.485944 ERROR: Unable to perform OCR decode.  Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40
openocr_1       | 22:08:05.672308 OCR_HTTP: serveHttp called
openocr_1       | 22:08:05.672368 ERROR: EOF -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:30
openocr_1       | 22:08:45.817174 OCR_HTTP: serveHttp called
openocr_1       | 22:08:45.817242 OCR_CLIENT: dialing "amqp://admin:1234@rabbitmq/"
openocr_1       | 22:08:45.826096 OCR_CLIENT: callbackQueue name: amq.gen-CS1047W2EPQv2it79aEgeA
openocr_1       | 22:08:45.826630 OCR_CLIENT: looping over deliveries..
openocr_1       | 22:08:46.501806 OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
openocr_1       | 22:08:46.501902 OCR_CLIENT: publishing with routing key "decode-ocr"
openocr_1       | 22:08:46.501905 OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
openocr_1       | 22:09:12.505743 OCR_HTTP: serveHttp called
openocr_1       | 22:09:12.505833 OCR_CLIENT: dialing "amqp://admin:1234@rabbitmq/"
openocr_1       | 22:09:12.511872 OCR_CLIENT: callbackQueue name: amq.gen-0xthgpXMHuCWnt9YKEPhKA
openocr_1       | 22:09:12.512488 OCR_CLIENT: looping over deliveries..
openocr_1       | 22:09:13.255830 OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
openocr_1       | 22:09:13.255890 OCR_CLIENT: publishing with routing key "decode-ocr"
openocr_1       | 22:09:13.255899 OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
openocr_1       | 22:10:46.511147 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80
openocr_1       | 22:10:46.511213 ERROR: Unable to perform OCR decode.  Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40
openocr_1       | 22:11:13.261620 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80
openocr_1       | 22:11:13.261686 ERROR: Unable to perform OCR decode.  Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40
openocr_1       | 22:13:35.549198 OCR_HTTP: serveHttp called
openocr_1       | 22:13:35.549294 OCR_CLIENT: dialing "amqp://admin:1234@rabbitmq/"
openocr_1       | 22:13:35.561996 OCR_CLIENT: callbackQueue name: amq.gen-UYpRwcD9xbPfGVnbEhhjcw
openocr_1       | 22:13:35.562443 OCR_CLIENT: looping over deliveries..
openocr_1       | 22:13:37.276163 OCR_CLIENT: ocrRequest before: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
openocr_1       | 22:13:37.276358 OCR_CLIENT: publishing with routing key "decode-ocr"
openocr_1       | 22:13:37.276376 OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []
openocr_1       | 22:15:37.283421 ERROR: Timeout waiting for RPC response -- open-ocr.HandleOcrRequest() at ocr_http_handler.go:80
openocr_1       | 22:15:37.283486 ERROR: Unable to perform OCR decode.  Error: Timeout waiting for RPC response -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:40

It seems like the img_url is not parsed from request body

openocr_1 | 22:09:13.255899 OCR_CLIENT: ocrRequest after: ImgUrl: , EngineType: ENGINE_TESSERACT, Preprocessors: []

Unable to perform OCR decode. Error: dial tcp

After install host is saying
OpenOCR is running!

But POST request is failing
Unable to perform OCR decode. Error: dial tcp <hostip>:5672: no route to host

my system:
Ubuntu 2.6.32-573.3.1.el6.x86_64 #1 SMP Thu Aug 13 22:55:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

docker:
Docker version 1.7.1, build 786b29d

how to test open-ocr in macbook pro using docker-compose up

Using Macbook Pro

OS X 10.10.5 Yosemite

Docker version 1.10.1, build 9e83765

docker-compose version 1.6.0, build d99cad6

docker-machine version 0.6.0, build e27fb87

How did I start up?

After installing docker toolbox,

I did this:

# this is required. why? i dunno
docker-machine start default

# this runs the 4 instances as daemon
docker-compose up -d
Creating dockercompose_rabbitmq_1
Creating dockercompose_openocr_1
Creating dockercompose_strokewidthtransform_1
Creating dockercompose_openocrworker_1

I am not sure what's my RABBITMQ_HOST ip address in order to test.

The following is my docker-machine env

export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://192.168.99.100:2376"
export DOCKER_CERT_PATH="/Users/kim/.docker/machine/machines/default"
export DOCKER_MACHINE_NAME="default"
# Run this command to configure your shell:
# eval $(docker-machine env)

By test, I mean https://github.com/tleyden/open-ocr#test-the-rest-api

curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://10.0.2.15:$HTTP_PORT/ocr

How do I change 10.0.2.15 to something else?

Log files and debugging

How can I see the logs of the open-ocr after it is ran on the orchard servers ? Is there a good debugging tool to use for the open-ocr?

Worker failed with error: <nil>

orchard docker logs a1c08fd16f0a shows:

06:01:10.840962 PANIC: Worker failed with error: <nil> -- main.main() at main.go:31
panic: Worker failed with error: <nil>

goroutine 1 [running]:
runtime.panic(0x61dca0, 0xc21004b9c0)
    /usr/lib/go/src/pkg/runtime/panic.c:266 +0xb6
github.com/couchbaselabs/logg.LogPanic(0x7285f0, 0x1c, 0x7f88ea0f2d50, 0x1, 0x1)
    /opt/go/src/github.com/couchbaselabs/logg/logg.go:136 +0xec
main.main()
    /opt/go/src/github.com/tleyden/open-ocr/cli-worker/main.go:31 +0x26e

goroutine 5 [syscall]:
runtime.goexit()
    /usr/lib/go/src/pkg/runtime/proc.c:1394

goroutine 15 [finalizer wait]:
runtime.park(0x40c360, 0xbe94e0, 0xbd5a28)
    /usr/lib/go/src/pkg/runtime/proc.c:1342 +0x66
runfinq()
    /usr/lib/go/src/pkg/runtime/mgc0.c:2279 +0x84
runtime.goexit()
    /usr/lib/go/src/pkg/runtime/proc.c:1394

Unable to test SWT preprocessor

I followed the documentation for AWS to set everything up https://github.com/tleyden/open-ocr/wiki/Installation-on-CoreOS-Fleet

Although the OCR works, it doesn't work with the preprocessor.

curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage-swt","engine":"tesseract", "preprocessors":["stroke-width-transform"]}' http://_my_httpd_service:8080/ocr
Unable to perform OCR decode.  Error: Timeout waiting for RPC response

How can I run open-ocr on IBM bluemix

Hi,
Can you give an instruction for lanching open-ocr on IBM bluemix?

invalid json shoud be rejected

invalid json data should be rejected

invalid content (config_vars must be inside engine_args)

{
    "engine": "tesseract",
    "engine_args": {
        "lang": "fra",
        "psm": "3"
    },
      "config_vars": {
        "tessedit_create_hocr": "1",
        "tessedit_pageseg_mode": "1"
    },
    "img_url": "http://i.imgur.com/xYAaDjV.png"
}

fig should be switched to docker-compose

Fig was replaced by Docker Compose.

In order to continue to be relevant, we should change the fig scripts to Docker Compose scripts.

CORS for ajax clients

How about adding a configurable parm for the docker images to allow activation of headers for CORS? Maybe its in there somewhere but I don't see it in the code, and I'm getting CORS probs when I try to drive it with the tutum hosted docker setup from my browser-based angular app.

Something like this:

Access-Control-Allow-Headers: Content-Type
Access-Control-Allow-Methods: GET, POST, OPTIONS
Access-Control-Allow-Origin: *

In general it would be great to be able to make these docker images myself. Maybe I'll have to figure that out too, but your docker-based tutum setup is so easy to use! I don't want to figure all that out at this time.

push image in addition to pull

I've never messed with Go, so I haven't figured out the code yet. The examples show it pulling images. I want to do a push - a POST with different -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' maybe. I suppose it supports that already. If not where do I look in the code to figure out how that's handled so I can tweek it.

docs link seems to show nothing

I've tried a few different browsers and I always get an empty docs dir when I try to view them.

When I run docker-compose up I get an error when pulling strokewidthtransform

This is my output when I run docker-compose up

I have no way of getting it to work.

e4b5185880dc: Pull complete
bb3a20c79671: Pull complete
d180bfcd2661: Pull complete
92bd0cb7f0db: Pull complete
27ca5c22f162: Pull complete
347fcf793eea: Pull complete
Digest: sha256:94dfd2b14c7be479f7db16a9d94374f1510004d3911bb7465a0466b5908ac327
Status: Downloaded newer image for tleyden5iwx/open-ocr:latest
Creating dockercompose_openocr_1
Pulling strokewidthtransform (tleyden5iwx/open-ocr-preprocessor:latest)...
ERROR: Get https://registry-1.docker.io/v2/tleyden5iwx/open-ocr-preprocessor/manifests/latest: Get https://auth.docker.io/token?scope=repository%3Atleyden5iwx%2Fopen-ocr-preprocessor%3Apull&service=registry.docker.io: net/http: TLS handshake timeout

Unable to perform OCR decode. Error: Get http://bit.ly/ocrimage: dial tcp: lookup bit.ly: no such host

root@admaticvel:/home/admatic# docker run -d -p 5672:5672 -p 15672:15672 -e RABBITMQ_PASS=mypass tutum/rabbitmq

root@admaticvel:/home/admatic# docker run -d -p 8080:8080 tleyden5iwx/open-ocr open-ocr-httpd -amqp_uri "amqp://admin:[email protected]/" -http_port 8080

root@admaticvel:/home/admatic# docker run -d tleyden5iwx/open-ocr open-ocr-worker -amqp_uri "amqp://admin:[email protected]/"

root@admaticvel:/home/admatic# curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://172.16.0.2:8080/ocr

Unable to perform OCR decode. Error: Get http://bit.ly/ocrimage: dial tcp: lookup bit.ly: no such host

But, i can do a wget of the same URL ( == means Internet connection is working)

Use stdout rather than writing to a file

As described here: #18 (comment)

tesseract pre-process by zooming up the given image

Is there a way to let tesseract zoom up the given image, and then do the processing?

How do I know what version of tesseract this is using?

The reason is because I want to add training for more languages and it differs depending if tesseract 2 or tesseract 3.

The latest is v3.04.01 on Feb 2016

How to get the hocr output

how do i get the hocr as an output for a given image ?

HTTP Server won't run on Tutum

I filed this issue to Tutum support:

I'm getting the following error:

[open-ocr-1] [Jun 13 23:45:10.237] 23:45:10.237667 [31mERROR: open /dev/urandom: no such file or directory[0m[2m -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:42[0m

Here are the full container logs:

https://app.tutum.co/container/apps/show/e287283d-df41-4fd5-86d2-2a60d71bb499/#container-logs

I'm trying to get OpenOCR running per these instructions:

https://github.com/tleyden/open-ocr/wiki/Installation-on-Tutum

The code for the http listener is here:

https://github.com/tleyden/open-ocr/blob/master/cli-httpd/main.go

Unable to connect from public dns (AWS)

While testing connection it passes on public ip
curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://172.31.43.136:8080/ocr
if run from httpd.service instance.

But when I try do it from the same machine but via public DNS:
curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://ec2-52-18-254-60.eu-west-1.compute.amazonaws.com:8080/ocr

from local machine, I've got curl:
(7) Failed to connect to ec2-52-17-28-26.eu-west-1.compute.amazonaws.com port 8080: Connection refused

Any ideas?

what is the CURL command for local image file at /tmp/ocrimage.png

how do you run a CURL command for local image file at /tmp/ocrimage.png

the docs are misleading / confusing for the CURL with local file

http://docs.openocr.apiary.io/

engine_args is ignored when specify preprocessor

I try to use preproceccors. I post request like following:

curl -X 8080 -H "Content-Type: application/json" -d '{"img_url":"http://foo.bar/baz.jpg", "engine":"tesseract", "preprocessors": ["stroke-width-transform"], "engine_args":{"lang":"jpn"}}' http://10.0.2.15:8080/ocr

But, result I got is invalid, it look like English traineddata is used for Japanese text.

win u.-

63 ll}? "

2013fﬁn238 4:32



u n.on
HI ‘ ID“-
~' In v11. 50
E TC 5 V250
‘é‘ﬁ ¥11. 400}?
I 3 "IL Yucca
I '  «12732
M515. :hbun‘cac'a‘w "a.
as-‘n'imagurx

Bea-ﬁﬁw
\I

Rm

909 uy9

s oa-asna-oaoo

v

‘ t

"(mi

In server log:

 00:56:05.939277 OCR_TESSERACT: cmdArgs: [/tmp/4ff50eb5-c8db-47b8-651f-293becc5641e /tmp/4ff50eb5-c8db-47b8-651f-293becc5641e]

There is no -l jpn args.

Then I post request like following:

curl -X 8080 -H "Content-Type: application/json" -d '{"img_url":"http://foo.bar/baz.jpg", "engine":"tesseract", "engine_args":{"lang":"jpn"}}' http://10.0.2.15:8080/ocr

Result seems good.

*g image 口 因

領収 書

犯ー3 彙m 月 惣日 捌
メー夕丶遣賃 \ーL伽 円
遠距離 創引 一 \2ー0 円
固定迎車料金 + \伽 円

遣賃料金計 \ーー'ー印 円
ETc 料金 + \剛円

合計 \ーL 400円

現 金 支払 \ーー,伽円

車輌番号 m2732

毎盧こ`乗車ぁりがとうごさいます〟
ぉ忘れ物は当杜へ

日興 動車鎚爛

ご要望は当社又は

(財凍京タクシ一セン夕一

I think engine_args is ignored when specify preprocessor.

I use OpenOCR with Docker-Compose on Ubuntu 14.04.

Ability to provide a character whitelist

Feature request from @barrypitman on twitter:

@OpenOCR cool project! How would I go about providing a character whitelist? And it'd be great to be able to upload a file, not just URL
— Barry Pitman (@barrypitman) June 25, 2014

How do I add more steps such as rescaling, textcleaning etc to improve the performance of the ocr?

I read this

https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality#examples

and let say I want to add the TEXTCLEANER as part of the preprocessing step.

How do I go about adding that to this?

connect to rabbitmq error on Ubuntu 14.04

Hi,
I install it following README step by step,but when I test the API,I got a error :Unable to perform OCR decode. Error: dial tcp 127.0.0.1:5672: connection refused.

here is my docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
113840c08653 tleyden5iwx/open-ocr:latest open-ocr-worker -amq 11 seconds ago Up 10 seconds stoic_pike
5b7c8a8d9340 tleyden5iwx/open-ocr:latest open-ocr-httpd -amqp 14 seconds ago Up 12 seconds 0.0.0.0:8080->8080/tcp evil_lalande
090b2f66c809 tutum/rabbitmq:latest /run.sh 46 seconds ago Up 45 seconds 0.0.0.0:5672->5672/tcp, 0.0.0.0:15672->15672/tcp nostalgic_brown

Android

How can I write a android client to consume open-ocr service that is Launching in Google compute engine?

OCR precision vs Image accuracy - Tuning

As stated on the comments of the related docker repository page (https://registry.hub.docker.com/u/tleyden5iwx/open-ocr/):

I see that ocr precision increases with image size without changing the original image accuracy.

Is it possible to tune ocr precision through any rest parameter?

Tested on a short italian text captured through different screenshots of the same web page.

I have just checked the REST API documentation but I haven't seen such a thing.

Regards, Giovanni

always return the same

I have my server running, when i enter to 192.168.0.164 i can see

OpenOCR is running!
Need docs?

same with localhost.
But when i send requests like this:
curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://192.168.0.164:8080

the response is:

OpenOCR is running!
Need docs?
am i doing anything wrong?

Thanks

Add retry ability if cannot connect to rabbitmq

As seen in #32, if an httpd or worker cannot connect to rabbitmq, it should do a backoff / retry loop rather than getting stuck in a broken state.

Error: Unable to unmarshal json while running open-ocr on Tutum

Hi,

I was trying to run open-ocr on Tutum for a personal project.

I've followed the instructions on https://github.com/tleyden/open-ocr/wiki/Installation-on-Tutum

I've completed all the steps but when I finally try to use it, I get the following error:

Unable to unmarshal json

The command I am running is:

C:\Users\sarkar\Downloads\curl-7.45.0-win64-mingw\bin>curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://open-ocr-httpd-1.avinashsonee.cont.tutum.io:32772/ocr

In tutum logs, I get the following error:

open-ocr-httpd-1 | 2015-11-19T15:33:22.424910294Z 15:33:22.424668 OCR_HTTP: Starting listener on :8080
open-ocr-httpd-1 | 2015-11-19T15:34:23.939174296Z 15:34:23.938799 OCR_HTTP: serveHttp called
open-ocr-httpd-1 | 2015-11-19T15:34:23.939211461Z 15:34:23.938850 ERROR: invalid character ''' looking for beginning of value -- open-ocr.(*OcrHttpHandler).ServeHTTP() at ocr_http_handler.go:30

Attached below are my configurations on Tutum for your reference.

Let me know if you require any more details.

Thanks :)

fleetifying tleyden/open-ocr

I have trouble getting the open-ocr-worker up with fleet.

I have taken "docker pull" out of the fleet files, since those calls have been error prone when getting httpd and rabbitmq up. So for test purpose I do the pull manually before fleeting. Also, I'm running two nodes hence the Conflict configurations.

These are my fleet files (beware: hard coded config for now):

core@core-02 ~/share $ more httpd.service
[Unit]
Description=httpd
Requires=rabbitmq.service

[Service]
TimeoutStartSec=1
ExecStartPre=-/usr/bin/docker kill httpd
ExecStartPre=-/usr/bin/docker rm httpd
ExecStart=/usr/bin/docker run -p 8080:8080 --name httpd tleyden5iwx/open-ocr open-ocr-httpd -amqp_uri "amqp://admin:[email protected]" -http_port 8080
ExecStop=/usr/bin/docker stop httpd

core@core-02 ~/share $ more rabbitmq.service
[Unit]
Description=rabbitmq

[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill rabbitmq
ExecStartPre=-/usr/bin/docker rm rabbitmq
ExecStartPre=/usr/bin/docker pull tutum/rabbitmq
ExecStart=/usr/bin/docker run --name rabbitmq -p 5672:5672 -p 15672:15672 -e RABBITMQ_PASS=rabbit1 tutum/rabbitmq
ExecStop=/usr/bin/docker stop rabbitmq

core@core-02 ~/share $ more tesseract.service
[Unit]
Description=tesseract
After=rabbitmq.service
Requires=rabbitmq.service

[Service]
TimeoutStartSec=30
ExecStartPre=-/usr/bin/docker kill tesseract
ExecStartPre=-/usr/bin/docker rm tesseract
ExecStart=/usr/bin/docker run --name tesseract tleyden5iwx/open-ocr open-ocr-worker -amqp_uri "amqp://admin:[email protected]"
ExecStop=/usr/bin/docker stop tesseract

[X-Fleet]
Conflicts=rabbitmq*
Conflicts=httpd*

Integrating libreoffice/pdftk for pdf, docx, etc support

Realted to #17.

I wrote a bash script awhile ago (https://github.com/jlyon/ocr-anything) that would analize the mimetype headers on the document and:

For images, run tesseract
For image pdfs, split each page out into an image and run tesseract
For pdfs with embedded text, extract the text with pdftk
For Office documents (docx, Excel, etc), extract the text with Libre Office (run as headless)

I would like to try to integrate the efforts with open-ocr. What do you think is the best way to go about this? This is my thinking:

Mimetype analyzer preprocessor: analyzes the mimetype, sets additional preprocessors and the engine
PDF splitter preprocessor: queues up multiple images as their own engine task (for multipage scanned pdfs)
PDF engine: extracts embedded pdf text
LibreOffice engine: for docx, etc files

Does this seem right?

Rabbit abuse

One picture worth a thousand words, most of which start w/ the letter F.

http://tleyden-misc.s3.amazonaws.com/blog_images/rabbit_mq.png

Updage Home Page for New RabbitMQ var

export RABBITMQ_PASS=supersecret2 HTTP_PORT=8080 RABBITMQ_HOST=tcp://10.0.2.15:2375

Bounding boxes for characters or words

I don't know tesseract very well, but is it possible to get bounding boxes for words or characters?

feature request: howto add custom preprocessor

I would like to add custom preprocessors (like croppping, rotation of page, splitting pdf with multiple pages....)

Have you some HOWTO documentation on adding custom preprocessors ?

What come to my mind would be:

creating a custom docker container based on a openocr-preprocessor base image.
adding my custom binary converter
some way to register my custom preprocessor on running my container.
the base container will call my custom binary using the incoming data

This way we could create custom preprocessor in any language (shell, go, ruby...)

RPC Timeout

This may be an easy one. Im simply trying to run the stoke width transform. The setup works great for straight tesseract, but when I run the test curl for the stoke width transform, I get an RPC timeout. My suspicion is my box is taking too long with the transform so the RPC is kicking back that general network error thinking the call had failed (it not knowing if its a slow network or what).

Here is the error:
john@...$ curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage-swt","engine":"tesseract", "preprocessors":["stroke-width-transform"]}' http://${DOCKER_HOST}:${HTTP_PORT}/ocr

Unable to perform OCR decode. Error: Timeout waiting for RPC response

Thoughts?

Thanks!

ERROR: Get http://bit.ly/ocrimage: dial tcp: lookup bit.ly: no such host

After setting up the docker containers, the following error is given while testing

ERROR: Get http://bit.ly/ocrimage: dial tcp: lookup bit.ly: no such host

Seems to be that docker container has no access to outside network.

REST API call with local image curl: (52) Empty reply from server

Could you please help with the CURL command for open-ocr for a local image (than a remote URL) ?

I tried the reference API doc too, but it gave me no results.

Please help

Test the REST API -- Connection Refused Issue

Everything good in install up to point of "Test the REST API" -- where I get a "connection refused"

For the curl command, I am substituting my local IP address (at this time, 10.58.105.52) for the example's IP, so:
curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://10.58.105.52:$HTTP_PORT/ocr

I get a:
Failed to connect to 10.58.105.52 port 8080: Connection refused

Any clues?

Cannot loading Chinese language

Try passing: "engine_args": {"lang": "chi-sim"} to the REST API.
You should see this:

openocrworker_1 | 02:50:41.532795 OCR_WORKER: got 255962 byte delivery: [1]. Routing key: decode-ocr Reply to: amq.gen-kOrQgLhd316ODB9hX370sg
openocrworker_1 | 02:50:41.540446 OCR_TESSERACT: cmdArgs: [/tmp/2c9ed9c6-81d3-4a86-76e0-bb84d99cdbb2 /tmp/2c9ed9c6-81d3-4a86-76e0-bb84d99cdbb2 -l chi-sim]
openocrworker_1 | 02:50:41.546023 OCR_TESSERACT: Error exec tesseract: exit status 1 Tesseract Open Source OCR Engine v3.03 with Leptonica
openocrworker_1 | Error opening data file /usr/share/tesseract-ocr/tessdata/chi-sim.traineddata
openocrworker_1 | Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
openocrworker_1 | Failed loading language 'chi-sim'
openocrworker_1 | Tesseract couldn't load any languages!
openocrworker_1 | Could not initialize tesseract.
openocrworker_1 | 02:50:41.546128 ERROR: Error processing image url: . Error: exit status 1 -- open-ocr.(_OcrRpcWorker).resultForDelivery() at ocr_rpc_worker.go:182
openocrworker_1 | 02:50:41.546143 ERROR: Error generating ocr result. Error: exit status 1 -- open-ocr.(_OcrRpcWorker).handle() at ocr_rpc_worker.go:144
openocrworker_1 | 02:50:41.546153 OCR_WORKER: Sending rpc response: {Error processing image url: . Error: exit status 1}
openocrworker_1 | 02:50:41.546168 OCR_WORKER: sendRpcResponse to: amq.gen-kOrQgLhd316ODB9hX370sg
openocr_1 | 02:50:41.546578 OCR_CLIENT: got 51B delivery: [1] "Error processing image url: . Error: exit status 1". Reply to:
openocr_1 | 02:50:41.546597 OCR_CLIENT: send result to rpcResponseChan
openocr_1 | 02:50:41.546602 OCR_CLIENT: sent result to rpcResponseChan
openocr_1 | 02:50:41.547511 OCR_HTTP: ocrResult: {Error processing image url: . Error: exit status 1}
openocrworker_1 | 02:50:41.548791 OCR_WORKER: sendRpcResponse succeeded

Open-OCR with preprocessor doesnt work

Open-OCR with preprocessor doesnt work - i tried the below CURL

curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract", "preprocessors":["stroke-width-transform"]}' http://172.16.0.20:8080/ocr

There is no result, but running indefinitely - resulting in TImeout, , but without pre-procesor it works

I followed the following steps, on our local docker Host

docker run -d -p 5672:5672 -p 15672:15672 -e RABBITMQ_PASS=mypass tutum/rabbitmq

docker run -d -p 8080:8080 tleyden5iwx/open-ocr open-ocr-httpd -amqp_uri "amqp://admin:[email protected]/" -http_port 8080

docker run -d tleyden5iwx/open-ocr open-ocr-worker -amqp_uri "amqp://admin:[email protected]/"

curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://172.16.0.20:8080/ocr

But This Works

curl -X POST -H "Content-Type: application/json" -d '{"img_url":"http://bit.ly/ocrimage","engine":"tesseract"}' http://172.16.0.20:8080/ocr

Support file upload in addition to a file URL

Feature request from @barrypitman on twitter:

@OpenOCR cool project! How would I go about providing a character whitelist? And it'd be great to be able to upload a file, not just URL
— Barry Pitman (@barrypitman) June 25, 2014

Memory leak

After running for a while on a Docker host with low memory limits (256 MB), the container running the ocr-worker. From looking at the memory usage graphs, it's clear that it's leaking memory.

feature: support hocr configfile for tesseract

how to pass configfileparameter to tesseract engine ?

tesseract imagename|stdin outputbase|stdout [options…] [configfile…]

see tesseract doc

I use the hocr config file

hOCR is an open standard of data representation for formatted text obtained from OCR

tleyden / open-ocr Goto Github PK

open-ocr's Introduction

Features

Launching OpenOCR on a Docker PAAS

Launching OpenOCR on Ubuntu 14.04

Install Docker

Find out your host address

Launching OpenOCR command run.sh

Launching OpenOCR with Docker Compose on OSX

How to test the REST API after turning on the docker-compose up

Test the REST API

With image url

With image base64

The REST API also supports:

Client Libraries

Uploading local files using curl

Community

License

open-ocr's People

Contributors

Stargazers

Watchers

Forkers

open-ocr's Issues

How did I start up?

Recommend Projects

Recommend Topics

Recommend Org