dcgm / pero-ocr Goto Github PK

License: BSD 3-Clause "New" or "Revised" License

Python 98.96% Dockerfile 0.11% Shell 0.93%

pero-ocr's Introduction

pero-ocr

The package provides a full OCR pipeline including text paragraph detection, text line detection, text transcription, and text refinement using a language model. The package can be used as a command line application or as a python package which provides a document processing class and a class which represents document page content.

Please cite

If you use pero-ocr, please cite:

O Kodym, M Hradiš: Page Layout Analysis System for Unconstrained Historic Documents. ICDAR, 2021.
M Kišš, K Beneš, M Hradiš: AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. ICDAR, 2021.
J Kohút, M Hradiš: TS-Net: OCR Trained to Switch Between Text Transcription Styles. ICDAR, 2021.

Running stuff

Scripts (as well as tests) assume that it is possible to import pero_ocr and its components.

For the current shell session, this can be achieved by setting PYTHONPATH up:

export PYTHONPATH=/path/to/the/repo:$PYTHONPATH

As a more permanent solution, a very simplistic setup.py is prepared:

python setup.py develop

Beware that the setup.py does not promise to bring all the required stuff, e.g. setting CUDA up is up to you.

Pero can be later removed from your Python distribution by running:

python setup.py develop --uninstall

Available models

General layout analysis (printed and handwritten) with european printed OCR specialized to czech newspapers can be downloaded here. The OCR engine is suitable for most european printed documents. It is specialized for low-quality czech newspapers digitized from microfilms, but it provides very good results for almast all types of printed documents in most languages. If you are interested in processing printed fraktur fonts, handwritten documents or medieval manuscripts, feel free to contact the authors. The newest OCR engines are available at pero-ocr.fit.vutbr.cz. OCR engines are available also through API runing at pero-ocr.fit.vutbr.cz/api, github repository.

Command line application

A command line application is ./user_scripts/parse_folder.py. It is able to process images in a directory using an OCR engine. It can render detected lines in an image and provide document content in Page XML and ALTO XML formats. Additionally, it is able to crop all text lines as rectangular regions of normalized size and save them into separate image files.

Running command line application in container

A docker container can be built from the sourcecode to run scripts and programs based on the pero-ocr. Example of running the parse_folder.py script to generate page-xml files for images in input directory:

docker run --rm --tty --interactive \
     --volume path/to/input/dir:/input \
     --volume path/to/output/dir:/output \
     --volume path/to/ocr/engine:/engine \
     --gpus all \
     pero-ocr /usr/bin/python3 user_scripts/parse_folder.py \
          --config /engine/config.ini \
          --input-image-path /input \
          --output-xml-path /output

Be sure to use container internal paths for passed in data in the command. All input and output data locations have to be passed to container via --volume argument due to container isolation. See docker run command reference for more information.

Container can be built like this:

docker build -f Dockerfile -t pero-ocr .

Integration of the pero-ocr python module

This example shows how to directly use the OCR pipeline provided by pero-ocr package. This shows how to integrate pero-ocr into other applications. Class PageLayout represents content of a single document page and can be loaded from Page XMl and exported to Page XML and ALTO XML formats. The OCR pipeline is represented by the PageParser class.

import os
import configparser
import cv2
import numpy as np
from pero_ocr.core.layout import PageLayout
from pero_ocr.document_ocr.page_parser import PageParser

# Read config file.
config_path = "./config_file.ini"
config = configparser.ConfigParser()
config.read(config_path)

# Init the OCR pipeline. 
# You have to specify config_path to be able to use relative paths
# inside the config file.
page_parser = PageParser(config, config_path=os.path.dirname(config_path))

# Read the document page image.
input_image_path = "page_image.jpg"
image = cv2.imread(input_image_path, 1)

# Init empty page content. 
# This object will be updated by the ocr pipeline. id can be any string and it is used to identify the page.
page_layout = PageLayout(id=input_image_path,
     page_size=(image.shape[0], image.shape[1]))

# Process the image by the OCR pipeline
page_layout = page_parser.process_page(image, page_layout)

page_layout.to_pagexml('output_page.xml') # Save results as Page XML.
page_layout.to_altoxml('output_ALTO.xml') # Save results as ALTO XML.

# Render detected text regions and text lines into the image and
# save it into a file.
rendered_image = page_layout.render_to_image(image) 
cv2.imwrite('page_image_render.jpg', rendered_image)

# Save each cropped text line in a separate .jpg file.
for region in page_layout.regions:
  for line in region.lines:
     cv2.imwrite(f'file_id-{line.id}.jpg', line.crop.astype(np.uint8))

Contributing

Working changes are expected to happen on develop branch, so if you plan to contribute, you better check it out right during cloning:

git clone -b develop [email protected]:DCGM/pero-ocr.git pero-ocr

Testing

Currently, only unittests are provided with the code. Some of the code. So simply run your preferred test runner, e.g.:

~/pero-ocr $ green

Simple regression testing

Regression testing can be done by test/processing_test.sh. Script calls containerized parser_folder.py to process input images and page-xml files and calls user suplied comparison script to compare outputs to example outputs suplied by user. PERO-OCR container have to be built in advance to run the test, see 'Running command line application in container' chapter. Script can be called like this:

sh test/processing_test.sh \
     --input-images path/to/input/image/directory \
     --input-xmls path/to/input/page-xml/directory \
     --output-dir path/to/output/dir \
     --configuration path/to/ocr/engine/config.ini \
     --example path/to/example/output/data \
     --test-utility path/to/test/script \
     --test-output path/to/testscript/output/dir \
     --gpu-ids gpu ids for docker container

First 4 arguments are manadatory, --gpu-ids is preset by value 'all' which passes all gpus to the container. Test utility, example outputs and test output folder have to be set only if comparison of results should be performed. Test utility is expected to be path to eval_ocr_pipeline_xml.py script from pero repository. Be sure to correctly set PYTHONPATH and install dependencies for pero repository for the utility to work. Other script can be used if takes the same arguments. In other cases output data can be of course compared manually after processing.

pero-ocr's People

Contributors

Stargazers

Watchers

Forkers

crosslangnv davidhribek diegosiqueir4 anguelos jchazalon xraurp pavless stweil ub-mannheim witiko kba songkq adamkankovsky wallbloggerbeing georgjr2 mrhunsaker vlachvojta mathiaszinnen msab1k

pero-ocr's Issues

.

Layout analysis crashes

Crashed on two files in my new collection. Problem in live system.

Job ID: fb48773658124afab23ac9854ea5e56d
Document ID: 1e4d33dc189c4a2bb93eaebf722432e4
Image: 9823218f-12c1-4ede-ba68-897e055e5580
Errors:
Processing 9823218f-12c1-4ede-ba68-897e055e5580
ERROR: Failed to process file 9823218f-12c1-4ede-ba68-897e055e5580.
The operation 'GEOSUnion_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f249c0cd050>

7Traceback (most recent call last):
File "/home/pero/pero/pero-ocr/user_scripts/parse_folder.py", line 205, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/pero/pero/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 372, in process_page
page_layout = layout_parser.process_page(image, page_layout)
File "/home/pero/pero/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 169, in process_page
p_list, b_list, h_list, t_list = self.engine.detect(img, rot=rot)
File "/home/pero/pero/pero-ocr/pero_ocr/layout_engines/cnn_layout_engine.py", line 127, in detect
region_poly = helpers.region_from_textlines(region_textlines)
File "/home/pero/pero/pero-ocr/pero_ocr/layout_engines/layout_helpers.py", line 100, in region_from_textlines
region_poly = region_poly.union(textline_poly)
File "/home/pero/python_environment/pero_ocr_web_clients/lib/python3.7/site-packages/shapely/geometry/base.py", line 658, in union
return geom_factory(self.impl['union'](self, other))
File "/home/pero/python_environment/pero_ocr_web_clients/lib/python3.7/site-packages/shapely/topology.py", line 70, in call
self._check_topology(err, this, other)
File "/home/pero/python_environment/pero_ocr_web_clients/lib/python3.7/site-packages/shapely/topology.py", line 38, in _check_topology
self.fn.name, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSUnion_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f249c0cd050>
TopologyException: Input geom 1 is invalid: Self-intersection at or near point 2347.0777238895662 -44.069123013668701 at 2347.0777238895662 -44.069123013668701

Website typo Layout Analysis

I suppose website related issues can also be mentioned here.

I noticed a typo for selecting the layout analysis.
Shouldn't Select baseline detector be Select layout detector?

PERO generates 0kB ALTO files

checking for zero ALTO file size would probably help

Out of memory in layout parsing

~/PERO/ocr_client_data/97a7fb57-93ed-4b02-a7d4-de313daca5bf/images/a0126ddf-ed27-4ab5-90d1-4257b1fa6d23.jpeg

Problem with the pretrained model not available

File "/usr/local/lib/python3.9/dist-packages/torch/jit/_serialization.py", line 149, in load
raise ValueError(f"The provided filename {f} does not exist") # type: ignore[str-bytes-safe]
ValueError: The provided filename /opt/pero/pero-ocr/ocr_model/checkpoint_646000.ckpt does not exist

Line crop fails probably due empty mapping

Error log:
line_coords = self.get_crop_inputs(baseline, height, self.line_height)
Traceback (most recent call last):
File "/home/pero/PERO/pero-ocr/user_scripts/parse_folder.py", line 176, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 408, in process_page
page_layout = self.line_cropper.process_page(image, page_layout)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 348, in process_page
line.crop = self.crop_engine.crop(img, line.baseline, line.heights)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/crop_engine.py", line 78, in crop
interpolation=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT)
cv2.error: OpenCV(4.2.0) /io/opencv/modules/imgproc/src/imgwarp.cpp:1703: error: (-215:Assertion failed) !_map1.empty() in function 'remap'

Clustering layout probably fails on pages/regions with no lines?

Data in BUGS/a69eb9c4-ae17-4429-aa70-c636ee0051b0
log:
ERROR: Failed to process file 9d24471a-280b-4e2b-a175-d65910c7c548.
need at least one array to concatenate
Traceback (most recent call last):
File "/home/pero/PERO/pero-ocr/user_scripts/parse_folder.py", line 176, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 404, in process_page
page_layout = self.layout_parser.process_page(image, page_layout)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 141, in process_page
polygons_list, baselines_list, heights_list, textlines_list = self.region_engine.detect(img)
File "/home/pero/PERO/pero-ocr/pero_ocr/region_engine/region_engine_splic.py", line 65, in detect
region_poly_points = np.concatenate(region_textlines, axis=0)
File "<array_function internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

Transcription

For old latin transcription, which model should i select to generate the OCR of the below image please?

Getting KeyError

I was trying the pero-ocr on a png image with table and text but got the error below. Please, how do I resolve this?

training model

Hello again, just wondering where I can find the code that can be used to train a handwritten text recognition model.
I only find in this repository code which can be used to score an existing image, not for training a model.

Music pull request feedback

@vlachvojta :

@vlachvojta with @ikiss-fit :

Check API and web compatibility (after adding line confidence in the OCR engine)

add support for gif format

Do not mix lines when exporting txt

When exporting txt (at least txt), export the lines in the order they appear on page or in column to column order. Currently, they are sometimes ordered rather strangely. See https://pero-ocr.fit.vutbr.cz/ocr/show_results/2269d1ae-d61c-4129-9c7c-3c78bc81cd8b

ALTO export BUG

Export fails when text line has no points?

For exmple document c1951833-8440-4851-93b5-6dfc6c3663bf, second page fe55b56c-341e-48d3-82ac-e3a971a0a124.

Error:
Aug 31 07:59:00 pero-ocr gunicorn[12175]: Traceback (most recent call last):
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
Aug 31 07:59:00 pero-ocr gunicorn[12175]: response = self.full_dispatch_request()
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
Aug 31 07:59:00 pero-ocr gunicorn[12175]: rv = self.handle_user_exception(e)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
Aug 31 07:59:00 pero-ocr gunicorn[12175]: reraise(exc_type, exc_value, tb)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
Aug 31 07:59:00 pero-ocr gunicorn[12175]: raise value
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
Aug 31 07:59:00 pero-ocr gunicorn[12175]: rv = self.dispatch_request()
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
Aug 31 07:59:00 pero-ocr gunicorn[12175]: return self.view_functionsrule.endpoint
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask_login/utils.py", line 272, in decorated_view
Aug 31 07:59:00 pero-ocr gunicorn[12175]: return func(*args, **kwargs)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/pero/pero_ocr_web/app/document/routes.py", line 185, in get_alto_xml
Aug 31 07:59:00 pero-ocr gunicorn[12175]: return create_string_response(filename, page_layout.to_altoxml_string(), minetype='text/xml')
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/pero/pero-ocr/pero_ocr/document_ocr/layout.py", line 335, in to_altoxml_string
Aug 31 07:59:00 pero-ocr gunicorn[12175]: string.set("HEIGHT", str(int((np.max(all_y) - np.min(all_y)))))
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "<array_function internals>", line 6, in amax
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2668, in amax
Aug 31 07:59:00 pero-ocr gunicorn[12175]: keepdims=keepdims, initial=initial, where=where)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 90, in _wrapreduction
Aug 31 07:59:00 pero-ocr gunicorn[12175]: return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: ValueError: zero-size array to reduction operation maximum which has no identity

where can we find the pretrained models?

are there checkpoints of models which can be downloaded available somewhere?

OMR transformers produce nonsense transcriptions

Could be due to different input size

Test if OCR Transformers work.
Train OCR Transformer with different input size and test it.
Re-check network input.
If 2 works and 3 is not conclusive, re-train OMR models.

Add support for upside-down text

example: https://pero-ocr.fit.vutbr.cz/ocr/show_results/b7cd8304-aed4-4857-89a1-2410826478f3

Where does model for region detector place?

I run script with layout detection.
In the class EngineRegionDetector
It has error
Cannot interpret feed_dict key as Tensor: The name 'inference_input:0' refers to a Tensor which does not exist. The operation, 'inference_input', does not exist in the graph. in line 75

ALTO - word blocks are shifted

Add region categories

Internal export: (pseudo PageXML)

All regions are RegionLayout with category attribute (saved to XML as TextRegion element with category in custom attribute)
Set OCR/OMR Engines to work only with some types of lines
Set Layout Engines to work only with some types of regions
Merging overlapping regions. (Text layout engine which detects region/line inside of other region, adds its lines the given region. Using geometry and coords to determine if some region/line is inside of some region) - not usefull feature

layout detection not good on exercise books

The layout detector identifies a lot of small regions instead of few larger ones.

problem of numpy version

Hello, when running the Integration of the pero-ocr python module, I encountered a problem with the numpy version, the error showed:

AttributeError: module 'numpy' has no attribute 'float'.
np.float was a deprecated alias for the builtin float. To avoid this error in existing code, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

If I want to lower the numpy version, scipy, numba, etc. also need to lower the version for compatibility, but many lower versions cannot be installed on my computer. What suggestions do you have? Thanks in advance!

Layout for one of the pages not displayed on preview

set# 4cd39c80-6d50-43d2-8389-50efd3404714

Rename "Download Pages" to "Download transcriptions"

Website: correct textlines

We can correct the layout model (text regions) and the OCR.
Isn't there also a need to be able to correct the text lines?

I understand that this is difficult as text line detection is done together with OCR'ing and I will now use Transkribus to correct the text lines as a post-correction.

How to correct erroneous line identification?

Failed line cropping in page_parser

Line crop fails. Job saved at /mnt/matylda1/hradis/PERO/BUGS/a9ccd42b-9b26-40ae-9c3b-6e4d26c21ee0

Processing 4/24 (16.67 %) [id: b0a89e97-5c8a-4511-94db-7fed583bcba9]
Traceback (most recent call last):
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/user_scripts/parse_folder.py", line 172, in
main()
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/user_scripts/parse_folder.py", line 150, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 256, in process_page
page_layout = self.line_cropper.process_page(image, page_layout)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 201, in process_page
line.crop = self.crop_engine.crop(img, line.baseline, line.heights)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/crop_engine.py", line 70, in crop
line_crop = cv2.remap(img_crop, coords[:, :, 0], coords[:, :, 1], interpolation=cv2.INTER_LINEAR, borderMode=cv2.BORDER_TRANSPARENT)
cv2.error: OpenCV(4.0.0) /io/opencv/modules/imgproc/src/imgwarp.cpp:666: error: (-215:Assertion failed) !ssize.empty() in function 'remapBilinear'

FIX: Switch WIDTH and HIGHT in ALTO export

XML headers

As mentioned in issue #49, Pero generates ALTO files without proper XML headers (<?xml version='1.0' encoding='utf-8'?>). Was that intended, or could that be fixed?

Add support for vertical text

example: see "tabulka - text svisle" job

Page color does not switch to DONE if some lines were deleted.

Page processing fail in line detection

Processing 20/25 (80.00 %) [id: 371eaaf3-a3e7-45c9-8410-0e0f9ac872da]
Traceback (most recent call last):
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/user_scripts/parse_folder.py", line 172, in
main()
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/user_scripts/parse_folder.py", line 150, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 246, in process_page
page_layout = self.line_parser.process_page(image, page_layout)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 129, in process_page
region = self.assign_lines_to_region(baseline_list, heights_list, textline_list, region)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 115, in assign_lines_to_region
baseline_intersection, textline_intersection = linepp.mask_textline_by_region(baseline, textline, region.polygon)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/line_engine/line_postprocessing.py", line 179, in mask_textline_by_region
baseline_is = region_shpl.intersection(baseline_shpl)
File "/home/ihradis/env/tf/lib/python3.6/site-packages/shapely/geometry/base.py", line 620, in intersection
return geom_factory(self.impl['intersection'](self, other))
File "/home/ihradis/env/tf/lib/python3.6/site-packages/shapely/topology.py", line 70, in call
self._check_topology(err, this, other)
File "/home/ihradis/env/tf/lib/python3.6/site-packages/shapely/topology.py", line 38, in _check_topology
self.fn.name, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f57dc052be0>

Add support for mixed handwritten and printed text

... as well as combined with gothic etc.

Can't install through pip

Hi, I'm trying to use this repository in a college project, but I'm can't seem to do pip install pero-ocr.

I'm getting the following error

The conflict is caused by:
    pero-ocr 0.5 depends on tensorflow-gpu==1.15
    pero-ocr 0.4 depends on tensorflow-gpu==1.15
    pero-ocr 0.3 depends on tensorflow-gpu==1.15
    pero-ocr 0.2 depends on tensorflow-gpu==1.14
    pero-ocr 0.1.1 depends on tensorflow-gpu==1.14

But when trying to install that version of tensorflow-gpu, I can't seem to get a valid version.

Thank you.