Coder Social home page Coder Social logo

pero-ocr's Introduction

pero-ocr

The package provides a full OCR pipeline including text paragraph detection, text line detection, text transcription, and text refinement using a language model. The package can be used as a command line application or as a python package which provides a document processing class and a class which represents document page content.

Please cite

If you use pero-ocr, please cite:

  • O Kodym, M Hradiš: Page Layout Analysis System for Unconstrained Historic Documents. ICDAR, 2021.
  • M Kišš, K Beneš, M Hradiš: AT-ST: Self-Training Adaptation Strategy for OCR in Domains with Limited Transcriptions. ICDAR, 2021.
  • J Kohút, M Hradiš: TS-Net: OCR Trained to Switch Between Text Transcription Styles. ICDAR, 2021.

Running stuff

Scripts (as well as tests) assume that it is possible to import pero_ocr and its components.

For the current shell session, this can be achieved by setting PYTHONPATH up:

export PYTHONPATH=/path/to/the/repo:$PYTHONPATH

As a more permanent solution, a very simplistic setup.py is prepared:

python setup.py develop

Beware that the setup.py does not promise to bring all the required stuff, e.g. setting CUDA up is up to you.

Pero can be later removed from your Python distribution by running:

python setup.py develop --uninstall

Available models

General layout analysis (printed and handwritten) with european printed OCR specialized to czech newspapers can be downloaded here. The OCR engine is suitable for most european printed documents. It is specialized for low-quality czech newspapers digitized from microfilms, but it provides very good results for almast all types of printed documents in most languages. If you are interested in processing printed fraktur fonts, handwritten documents or medieval manuscripts, feel free to contact the authors. The newest OCR engines are available at pero-ocr.fit.vutbr.cz. OCR engines are available also through API runing at pero-ocr.fit.vutbr.cz/api, github repository.

Command line application

A command line application is ./user_scripts/parse_folder.py. It is able to process images in a directory using an OCR engine. It can render detected lines in an image and provide document content in Page XML and ALTO XML formats. Additionally, it is able to crop all text lines as rectangular regions of normalized size and save them into separate image files.

Running command line application in container

A docker container can be built from the sourcecode to run scripts and programs based on the pero-ocr. Example of running the parse_folder.py script to generate page-xml files for images in input directory:

docker run --rm --tty --interactive \
     --volume path/to/input/dir:/input \
     --volume path/to/output/dir:/output \
     --volume path/to/ocr/engine:/engine \
     --gpus all \
     pero-ocr /usr/bin/python3 user_scripts/parse_folder.py \
          --config /engine/config.ini \
          --input-image-path /input \
          --output-xml-path /output

Be sure to use container internal paths for passed in data in the command. All input and output data locations have to be passed to container via --volume argument due to container isolation. See docker run command reference for more information.

Container can be built like this:

docker build -f Dockerfile -t pero-ocr .

Integration of the pero-ocr python module

This example shows how to directly use the OCR pipeline provided by pero-ocr package. This shows how to integrate pero-ocr into other applications. Class PageLayout represents content of a single document page and can be loaded from Page XMl and exported to Page XML and ALTO XML formats. The OCR pipeline is represented by the PageParser class.

import os
import configparser
import cv2
import numpy as np
from pero_ocr.core.layout import PageLayout
from pero_ocr.document_ocr.page_parser import PageParser

# Read config file.
config_path = "./config_file.ini"
config = configparser.ConfigParser()
config.read(config_path)

# Init the OCR pipeline. 
# You have to specify config_path to be able to use relative paths
# inside the config file.
page_parser = PageParser(config, config_path=os.path.dirname(config_path))

# Read the document page image.
input_image_path = "page_image.jpg"
image = cv2.imread(input_image_path, 1)

# Init empty page content. 
# This object will be updated by the ocr pipeline. id can be any string and it is used to identify the page.
page_layout = PageLayout(id=input_image_path,
     page_size=(image.shape[0], image.shape[1]))

# Process the image by the OCR pipeline
page_layout = page_parser.process_page(image, page_layout)

page_layout.to_pagexml('output_page.xml') # Save results as Page XML.
page_layout.to_altoxml('output_ALTO.xml') # Save results as ALTO XML.

# Render detected text regions and text lines into the image and
# save it into a file.
rendered_image = page_layout.render_to_image(image) 
cv2.imwrite('page_image_render.jpg', rendered_image)

# Save each cropped text line in a separate .jpg file.
for region in page_layout.regions:
  for line in region.lines:
     cv2.imwrite(f'file_id-{line.id}.jpg', line.crop.astype(np.uint8))

Contributing

Working changes are expected to happen on develop branch, so if you plan to contribute, you better check it out right during cloning:

git clone -b develop [email protected]:DCGM/pero-ocr.git pero-ocr

Testing

Currently, only unittests are provided with the code. Some of the code. So simply run your preferred test runner, e.g.:

~/pero-ocr $ green

Simple regression testing

Regression testing can be done by test/processing_test.sh. Script calls containerized parser_folder.py to process input images and page-xml files and calls user suplied comparison script to compare outputs to example outputs suplied by user. PERO-OCR container have to be built in advance to run the test, see 'Running command line application in container' chapter. Script can be called like this:

sh test/processing_test.sh \
     --input-images path/to/input/image/directory \
     --input-xmls path/to/input/page-xml/directory \
     --output-dir path/to/output/dir \
     --configuration path/to/ocr/engine/config.ini \
     --example path/to/example/output/data \
     --test-utility path/to/test/script \
     --test-output path/to/testscript/output/dir \
     --gpu-ids gpu ids for docker container

First 4 arguments are manadatory, --gpu-ids is preset by value 'all' which passes all gpus to the container. Test utility, example outputs and test output folder have to be set only if comparison of results should be performed. Test utility is expected to be path to eval_ocr_pipeline_xml.py script from pero repository. Be sure to correctly set PYTHONPATH and install dependencies for pero repository for the utility to work. Other script can be used if takes the same arguments. In other cases output data can be of course compared manually after processing.

pero-ocr's People

Contributors

ibenes avatar ikiss-fit avatar kohuthonza avatar lachubcz avatar matusbako avatar michal-hradis avatar oldakodym avatar xraurp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pero-ocr's Issues

.

.

.

.

Layout analysis crashes

Crashed on two files in my new collection. Problem in live system.

Job ID: fb48773658124afab23ac9854ea5e56d
Document ID: 1e4d33dc189c4a2bb93eaebf722432e4
Image: 9823218f-12c1-4ede-ba68-897e055e5580
Errors:
Processing 9823218f-12c1-4ede-ba68-897e055e5580
ERROR: Failed to process file 9823218f-12c1-4ede-ba68-897e055e5580.
The operation 'GEOSUnion_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f249c0cd050>

7Traceback (most recent call last):
File "/home/pero/pero/pero-ocr/user_scripts/parse_folder.py", line 205, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/pero/pero/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 372, in process_page
page_layout = layout_parser.process_page(image, page_layout)
File "/home/pero/pero/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 169, in process_page
p_list, b_list, h_list, t_list = self.engine.detect(img, rot=rot)
File "/home/pero/pero/pero-ocr/pero_ocr/layout_engines/cnn_layout_engine.py", line 127, in detect
region_poly = helpers.region_from_textlines(region_textlines)
File "/home/pero/pero/pero-ocr/pero_ocr/layout_engines/layout_helpers.py", line 100, in region_from_textlines
region_poly = region_poly.union(textline_poly)
File "/home/pero/python_environment/pero_ocr_web_clients/lib/python3.7/site-packages/shapely/geometry/base.py", line 658, in union
return geom_factory(self.impl['union'](self, other))
File "/home/pero/python_environment/pero_ocr_web_clients/lib/python3.7/site-packages/shapely/topology.py", line 70, in call
self._check_topology(err, this, other)
File "/home/pero/python_environment/pero_ocr_web_clients/lib/python3.7/site-packages/shapely/topology.py", line 38, in _check_topology
self.fn.name, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSUnion_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f249c0cd050>
TopologyException: Input geom 1 is invalid: Self-intersection at or near point 2347.0777238895662 -44.069123013668701 at 2347.0777238895662 -44.069123013668701

Website typo Layout Analysis

I suppose website related issues can also be mentioned here.

I noticed a typo for selecting the layout analysis.
Shouldn't Select baseline detector be Select layout detector?


Capture

Problem with the pretrained model not available

File "/usr/local/lib/python3.9/dist-packages/torch/jit/_serialization.py", line 149, in load
raise ValueError(f"The provided filename {f} does not exist") # type: ignore[str-bytes-safe]
ValueError: The provided filename /opt/pero/pero-ocr/ocr_model/checkpoint_646000.ckpt does not exist

Line crop fails probably due empty mapping

Error log:
line_coords = self.get_crop_inputs(baseline, height, self.line_height)
Traceback (most recent call last):
File "/home/pero/PERO/pero-ocr/user_scripts/parse_folder.py", line 176, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 408, in process_page
page_layout = self.line_cropper.process_page(image, page_layout)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 348, in process_page
line.crop = self.crop_engine.crop(img, line.baseline, line.heights)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/crop_engine.py", line 78, in crop
interpolation=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT)
cv2.error: OpenCV(4.2.0) /io/opencv/modules/imgproc/src/imgwarp.cpp:1703: error: (-215:Assertion failed) !_map1.empty() in function 'remap'

Clustering layout probably fails on pages/regions with no lines?

Data in BUGS/a69eb9c4-ae17-4429-aa70-c636ee0051b0
log:
ERROR: Failed to process file 9d24471a-280b-4e2b-a175-d65910c7c548.
need at least one array to concatenate
Traceback (most recent call last):
File "/home/pero/PERO/pero-ocr/user_scripts/parse_folder.py", line 176, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 404, in process_page
page_layout = self.layout_parser.process_page(image, page_layout)
File "/home/pero/PERO/pero-ocr/pero_ocr/document_ocr/page_parser.py", line 141, in process_page
polygons_list, baselines_list, heights_list, textlines_list = self.region_engine.detect(img)
File "/home/pero/PERO/pero-ocr/pero_ocr/region_engine/region_engine_splic.py", line 65, in detect
region_poly_points = np.concatenate(region_textlines, axis=0)
File "<array_function internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

Transcription

For old latin transcription, which model should i select to generate the OCR of the below image please?
image

Getting KeyError

I was trying the pero-ocr on a png image with table and text but got the error below. Please, how do I resolve this?

Screenshot 2023-09-07 at 2 42 19 PM

training model

Hello again, just wondering where I can find the code that can be used to train a handwritten text recognition model.
I only find in this repository code which can be used to score an existing image, not for training a model.

Music pull request feedback

@vlachvojta :

  • Layout_parser: Line categories (LINE_CATEGORIES)
  • Layout_parser: Filter output categories (CATEGORIES)
  • decoder filter categories
  • music_dictionary -> output_substitution_table (change order of key, value)
  • Add minimalistic CLI for export music to user_scripts
  • Render with categories box (just name of category)
  • Normalize category names in render using code from pero/unicode_normalization
  • OCR Engine get line confidence
  • Test new YOLO model
  • Check all changed logging statements
  • Test OCR with old configs
  • check backward compatibility of custom tag in page-xml
  • Add Atomic option to output substitution + add setting options to config

@vlachvojta with @ikiss-fit :

  • Check API and web compatibility (after adding line confidence in the OCR engine)

ALTO export BUG

Export fails when text line has no points?

For exmple document c1951833-8440-4851-93b5-6dfc6c3663bf, second page fe55b56c-341e-48d3-82ac-e3a971a0a124.

Error:
Aug 31 07:59:00 pero-ocr gunicorn[12175]: Traceback (most recent call last):
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
Aug 31 07:59:00 pero-ocr gunicorn[12175]: response = self.full_dispatch_request()
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
Aug 31 07:59:00 pero-ocr gunicorn[12175]: rv = self.handle_user_exception(e)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
Aug 31 07:59:00 pero-ocr gunicorn[12175]: reraise(exc_type, exc_value, tb)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
Aug 31 07:59:00 pero-ocr gunicorn[12175]: raise value
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
Aug 31 07:59:00 pero-ocr gunicorn[12175]: rv = self.dispatch_request()
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
Aug 31 07:59:00 pero-ocr gunicorn[12175]: return self.view_functionsrule.endpoint
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/flask_login/utils.py", line 272, in decorated_view
Aug 31 07:59:00 pero-ocr gunicorn[12175]: return func(*args, **kwargs)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/pero/pero_ocr_web/app/document/routes.py", line 185, in get_alto_xml
Aug 31 07:59:00 pero-ocr gunicorn[12175]: return create_string_response(filename, page_layout.to_altoxml_string(), minetype='text/xml')
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/pero/pero-ocr/pero_ocr/document_ocr/layout.py", line 335, in to_altoxml_string
Aug 31 07:59:00 pero-ocr gunicorn[12175]: string.set("HEIGHT", str(int((np.max(all_y) - np.min(all_y)))))
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "<array_function internals>", line 6, in amax
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 2668, in amax
Aug 31 07:59:00 pero-ocr gunicorn[12175]: keepdims=keepdims, initial=initial, where=where)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: File "/home/pero/env/pero-ocr/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 90, in _wrapreduction
Aug 31 07:59:00 pero-ocr gunicorn[12175]: return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
Aug 31 07:59:00 pero-ocr gunicorn[12175]: ValueError: zero-size array to reduction operation maximum which has no identity

OMR transformers produce nonsense transcriptions

  • Could be due to different input size
  1. Test if OCR Transformers work.
  2. Train OCR Transformer with different input size and test it.
  3. Re-check network input.
  4. If 2 works and 3 is not conclusive, re-train OMR models.

Where does model for region detector place?

I run script with layout detection.
In the class EngineRegionDetector
It has error
Cannot interpret feed_dict key as Tensor: The name 'inference_input:0' refers to a Tensor which does not exist. The operation, 'inference_input', does not exist in the graph. in line 75

Add region categories

Internal export: (pseudo PageXML)

  • All regions are RegionLayout with category attribute (saved to XML as TextRegion element with category in custom attribute)
  • Set OCR/OMR Engines to work only with some types of lines
  • Set Layout Engines to work only with some types of regions
    Merging overlapping regions. (Text layout engine which detects region/line inside of other region, adds its lines the given region. Using geometry and coords to determine if some region/line is inside of some region) - not usefull feature

problem of numpy version

Hello, when running the Integration of the pero-ocr python module, I encountered a problem with the numpy version, the error showed:

AttributeError: module 'numpy' has no attribute 'float'.
np.float was a deprecated alias for the builtin float. To avoid this error in existing code, use float by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.float64 here.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

If I want to lower the numpy version, scipy, numba, etc. also need to lower the version for compatibility, but many lower versions cannot be installed on my computer. What suggestions do you have? Thanks in advance!

Website: correct textlines

We can correct the layout model (text regions) and the OCR.
Isn't there also a need to be able to correct the text lines?

I understand that this is difficult as text line detection is done together with OCR'ing and I will now use Transkribus to correct the text lines as a post-correction.

Failed line cropping in page_parser

Line crop fails. Job saved at /mnt/matylda1/hradis/PERO/BUGS/a9ccd42b-9b26-40ae-9c3b-6e4d26c21ee0

Processing 4/24 (16.67 %) [id: b0a89e97-5c8a-4511-94db-7fed583bcba9]
Traceback (most recent call last):
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/user_scripts/parse_folder.py", line 172, in
main()
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/user_scripts/parse_folder.py", line 150, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 256, in process_page
page_layout = self.line_cropper.process_page(image, page_layout)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 201, in process_page
line.crop = self.crop_engine.crop(img, line.baseline, line.heights)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/crop_engine.py", line 70, in crop
line_crop = cv2.remap(img_crop, coords[:, :, 0], coords[:, :, 1], interpolation=cv2.INTER_LINEAR, borderMode=cv2.BORDER_TRANSPARENT)
cv2.error: OpenCV(4.0.0) /io/opencv/modules/imgproc/src/imgwarp.cpp:666: error: (-215:Assertion failed) !ssize.empty() in function 'remapBilinear'

XML headers

As mentioned in issue #49, Pero generates ALTO files without proper XML headers (<?xml version='1.0' encoding='utf-8'?>). Was that intended, or could that be fixed?

Page processing fail in line detection

Processing 20/25 (80.00 %) [id: 371eaaf3-a3e7-45c9-8410-0e0f9ac872da]
Traceback (most recent call last):
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/user_scripts/parse_folder.py", line 172, in
main()
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/user_scripts/parse_folder.py", line 150, in main
page_layout = page_parser.process_page(image, page_layout)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 246, in process_page
page_layout = self.line_parser.process_page(image, page_layout)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 129, in process_page
region = self.assign_lines_to_region(baseline_list, heights_list, textline_list, region)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/document_ocr/page_parser.py", line 115, in assign_lines_to_region
baseline_intersection, textline_intersection = linepp.mask_textline_by_region(baseline, textline, region.polygon)
File "/home/ihradis/projects/2018-01-15_PERO/pero-ocr-live/pero_ocr/line_engine/line_postprocessing.py", line 179, in mask_textline_by_region
baseline_is = region_shpl.intersection(baseline_shpl)
File "/home/ihradis/env/tf/lib/python3.6/site-packages/shapely/geometry/base.py", line 620, in intersection
return geom_factory(self.impl['intersection'](self, other))
File "/home/ihradis/env/tf/lib/python3.6/site-packages/shapely/topology.py", line 70, in call
self._check_topology(err, this, other)
File "/home/ihradis/env/tf/lib/python3.6/site-packages/shapely/topology.py", line 38, in _check_topology
self.fn.name, repr(geom)))
shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.polygon.Polygon object at 0x7f57dc052be0>

Can't install through pip

Hi, I'm trying to use this repository in a college project, but I'm can't seem to do pip install pero-ocr.

I'm getting the following error

The conflict is caused by:
    pero-ocr 0.5 depends on tensorflow-gpu==1.15
    pero-ocr 0.4 depends on tensorflow-gpu==1.15
    pero-ocr 0.3 depends on tensorflow-gpu==1.15
    pero-ocr 0.2 depends on tensorflow-gpu==1.14
    pero-ocr 0.1.1 depends on tensorflow-gpu==1.14

But when trying to install that version of tensorflow-gpu, I can't seem to get a valid version.

Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.