ocr-d / ocrd_calamari Goto Github PK

View Code? Open in Web Editor NEW

12.0 6.0 6.0 1.1 MB

Recognize text using Calamari OCR and the OCR-D framework

License: Apache License 2.0

Makefile 8.22% Python 90.45% Dockerfile 1.33%

ocr-d calamari-ocr ocr

ocrd_calamari's Introduction

ocrd_calamari

Recognize text using Calamari OCR.

Introduction

ocrd_calamari offers a OCR-D compliant workspace processor for the functionality of Calamari OCR. It uses OCR-D workspaces (METS) with PAGE XML documents as input and output.

This processor only operates on the text line level and so needs a line segmentation (and by extension a binarized image) as its input.

In addition to the line text it may also output word and glyph segmentation including per-glyph confidence values and per-glyph alternative predictions as provided by the Calamari OCR engine, using a textequiv_level of word or glyph. Note that while Calamari does not provide word segmentation, this processor produces word segmentation inferred from text segmentation and the glyph positions. The provided glyph and word segmentation can be used for text extraction and highlighting, but is probably not useful for further image-based processing.

Installation

From PyPI

pip install ocrd_calamari

From the git repository

pip install .

Install models

Download models trained on GT4HistOCR data:

make qurator-gt4histocr-1.0
ls .local/share/ocrd-resources/ocrd-calamari-recognize/*

Manual download: model.tar.xz

Example Usage

Before using ocrd-calamari-recognize get some example data and model:

# Download model and example data
make qurator-gt4histocr-1.0
make example

The example already contains a binarized and line-segmented page, so we are ready to go. Recognize the text using ocrd_calamari and the downloaded model:

cd actevedef_718448162.first-page+binarization+segmentation
ocrd-calamari-recognize \
  -P checkpoint_dir qurator-gt4histocr-1.0 \
  -I OCR-D-SEG-LINE-SBB -O OCR-D-OCR-CALAMARI

You may want to have a look at the ocrd-tool.json descriptions for additional parameters and default values.

Development & Testing

For information regarding development and testing, please see README-DEV.md.

ocrd_calamari's People

Contributors

Stargazers

Watchers

Forkers

mikegerber qurator-spk bertsky maxnth jparajuli mrnovelty

ocrd_calamari's Issues

Line confidences

Check if line confidence values are generated.

Release v1.0.4

checkpoint-dir changes
- Release info should not the (breaking) change in options
#75

Support Python 3.9

In a Python 3.9 venv:

% python --version                                                                                      master 
Python 3.9.5
% pip install -e .                                                                                          master 
Obtaining file:///home/mike/devel/ocrd_calamari
Collecting h5py<3
  Using cached h5py-2.10.0.tar.gz (301 kB)
ERROR: Could not find a version that satisfies the requirement tensorflow<2.5.0,>=2.3.0rc2 (from ocrd-calamari) (from versions: 2.5.0rc0, 2.5.0rc1, 2.5.0rc2, 2.5.0rc3, 2.5.0, 2.6.0rc0)
ERROR: No matching distribution found for tensorflow<2.5.0,>=2.3.0rc2

Related to #64 and possibly #61.

Build (and Test) Docker container according to specs

Provide a more useful usage example

README should provide a more useful and ready-to-run usage example, including:

A downloadable test workspace
compatible binarization
compatible segmentation
use the GT4HistOCR model

Fix CircleCI tests

https://circleci.com/gh/OCR-D/ocrd_calamari/65 fails because https://file.spk-berlin.de:8443/ is offline.

Adjust word segmentation to the expectations of the OCR-D validator

As @bertsky suggested in #9:

Should not be too difficult – just use sentence.split() instead of uniseg.wordbreak.words(sentence).

Split on spaces
Add test to test for word and glyph segmentation conforming to the OCR-D PAGE specs

Review tests

Remove existing TextEquivs

TextEquivs not generated by this processor should be removed. This already broke a test here because we are using GT segmentation with GT text. While ocrd_calamari was overwriting line and region texts, the words were GT and the test wrongly asserted success.

Using the ocrd/all:maximum-cuda Docker image: No supported devices found for platform CUDA

I am using ocrd-calamari-recognize in my workflow using the ocrd/all:maximum-cuda Docker image. I have NVIDIA driver 455.23.05, CUDA 11.1, and two Tesla T4 GPUs. I can successfully run the following command:

$ docker run --rm --gpus all ocrd/all:maximum-cuda nvidia-smi

However, when I run my workflow, I can see in nvidia-smi that my GPUs are not used by ocrd-calamari-recognize. Do you have any idea why that could be?

Handle empty images gracefully

OK, I've installed OCR-d for the first time, it worked in most parts out of the box and I was able to reproduce the problem. Your errors seem to be caused by OCR-d processors, not by calamari.
Somehow the line segmentation produces empty lines or lines that are outside of text regions. When the empty images are converted to numpy (by ocrd_calamari, not by calamari), numpy throws an uncaught exception. You could fix it by inserting before line 77 in ocrd_calamari/recognize.py something like line_image = line_image if all(line_image.size) else [[0]], but that's only a temporary hack to avoid the error. I'm also not sure if their workspace.image_from_segment or even the line segmentation processor is supposed to produce empty lines at all, so maybe the real problem is somewhere deeper in the guts of the OCR-d machinery.

Originally posted by @andbue in Calamari-OCR/calamari#193 (comment)

Fix word coordinates when using textequiv_level = "word"

Hi all,

When using ocrd-calamary-recognize with textequiv_level word, pc:Word-spans appear to have wrong y-coordinates in the Coords-spans. It looks like all words are lowered to the bottom of the text region they belong to.

For instance :
When drawing the line polygons, the coords are right :

But when drawing the word polygons, the coords are wrong :

I am using cv2 to draw the polygons, but I double-checked in the PAGE xml file, and words of a text-region (sometimes the entire page) all have the same y-coordinates.

Here is the entire code used to generate the OCR :

docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd process \
  "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
  "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P level-of-operation page" \
  "cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /data/calamari_model/\*.ckpt.json -P textequiv_level word"

License template not completely filled out

   Copyright [yyyy] [name of copyright owner]

Test Calamari 1.0.x

Slightly worse line text results

The way we do the line text prediction now (https://github.com/OCR-D/ocrd_calamari/blob/master/ocrd_calamari/recognize.py#L96-L128) we get slightly worse (and different) results than when using simply the Calamari prediction.

We should investigate the issue here, as it seems related: https://github.com/OCR-D/ocrd_calamari/blob/master/ocrd_calamari/recognize.py#L99

-P checkpoint_dir does not seem to work

ocrd-calamari-recognize -P checkpoint_dir qurator-gt4histocr-1.0 -I OCR-D-SEG-LINE-SBB -O OCR-D-OCR-CALA

with a model in qurator-gt4histocr-1.0 does not seem to work:

resolve_resource - Could not find resource 'qurator-gt4histocr-1.0' for executable 'ocrd-calamari-recognize'. Try 'ocrd resmgr download ocrd-calamari-recognize qurator-gt4histocr-1.0' to download this resource.

I had some confusion due to the bevahior of resolve_resource (OCR-D/core#727) so I have to test this again if this is indeed the case.

dependencies correct?

Is it possible that calamari-ocr and tensorflow-gpu is too broad a range of versions supported by this repo?

With Tensorflow 2.0 I get on Calamari 0.3.1:

File "ocrd_calamari/recognize.py", line 31, in _init_calamari
    self.predictor = MultiPredictor(checkpoints=checkpoints)
  File "calamari_ocr/ocr/predictor.py", line 203, in __init__
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "calamari_ocr/ocr/predictor.py", line 203, in <listcomp>
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "calamari_ocr/ocr/predictor.py", line 100, in __init__
    ckpt = Checkpoint(checkpoint, auto_update=self.auto_update_checkpoints)
  File "calamari_ocr/ocr/checkpoint.py", line 35, in __init__
    self.checkpoint = json_format.Parse(f.read(), CheckpointParams())
  File "protobuf/json_format.py", line 406, in Parse
    return ParseDict(js, message, ignore_unknown_fields)
  File "protobuf/json_format.py", line 421, in ParseDict
    parser.ConvertMessage(js_dict, message)
  File "protobuf/json_format.py", line 452, in ConvertMessage
    self._ConvertFieldValuePair(value, message)
  File "protobuf/json_format.py", line 552, in _ConvertFieldValuePair
    raise ParseError('Failed to parse {0} field: {1}'.format(name, e))
google.protobuf.json_format.ParseError: Failed to parse model field: Failed to parse network field: Failed to parse backend field: Message type "BackendParams" has no field named "shuffleBufferSize".
 Available Fields(except extensions): <MessageFields sequence>

... and on Calamari 0.3.5:

File "ocrd_calamari/recognize.py", line 31, in _init_calamari
    self.predictor = MultiPredictor(checkpoints=checkpoints)
  File "calamari_ocr/ocr/predictor.py", line 220, in __init__
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "calamari_ocr/ocr/predictor.py", line 220, in <listcomp>
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "calamari_ocr/ocr/predictor.py", line 106, in __init__
    backend = create_backend_from_proto(self.network_params, restore=self.checkpoint, processes=processes)
  File "calamari_ocr/ocr/backends/factory.py", line 28, in create_backend_from_proto
    from calamari_ocr.ocr.backends.tensorflow_backend.tensorflow_backend import TensorflowBackend
  File "calamari_ocr/ocr/backends/tensorflow_backend/tensorflow_backend.py", line 4, in <module>
    from calamari_ocr.ocr.backends.tensorflow_backend.tensorflow_model import TensorflowModel
  File "calamari_ocr/ocr/backends/tensorflow_backend/tensorflow_model.py", line 3, in <module>
    import tensorflow.contrib.cudnn_rnn as cudnn_rnn
  File "tensorflow/contrib/__init__.py", line 33, in <module>
    from tensorflow.contrib import cudnn_rnn
  File "tensorflow/contrib/cudnn_rnn/__init__.py", line 34, in <module>
    from tensorflow.contrib.cudnn_rnn.python.layers import *
  File "tensorflow/contrib/cudnn_rnn/python/layers/__init__.py", line 23, in <module>
    from tensorflow.contrib.cudnn_rnn.python.layers.cudnn_rnn import *
  File "tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 20, in <module>
    from tensorflow.contrib.cudnn_rnn.python.ops import cudnn_rnn_ops
  File "tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 20, in <module>
    from tensorflow.contrib.eager.python import checkpointable_utils
  File "tensorflow/contrib/eager/python/checkpointable_utils.py", line 38, in <module>
    from tensorflow.python.training import checkpointable as core_checkpointable
ImportError: cannot import name 'checkpointable'

...when trying to recognize with the checkpoint files provided by @mikegerber

Glyph segmentation produces invalid results

<report valid="false">
  <error>INCONSISTENCY in Word ID 'l5_word0000' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'ſecund.' != concatenated 'ſecund'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0001' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Fatin.' != concatenated 'Fatin'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0002' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Tit.' != concatenated 'Tit'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0003' of file 'OCR-D-OCR-CALAMARI_00000024': text results '9.' != concatenated '9'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0004' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'qu.' != concatenated 'qu'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results '6.' != concatenated '6'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'p.' != concatenated 'p'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0008' of file 'OCR-D-OCR-CALAMARI_00000024': text results '320.' != concatenated '320'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0000' of file 'OCR-D-OCR-CALAMARI_00000024': text results '.35)' != concatenated '.35'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0001' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'So' != concatenated 'S'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0002' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'viel' != concatenated 'vie'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0003' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'die' != concatenated 'di'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0004' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'von' != concatenated 'vo'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'der' != concatenated 'de'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Inquiſitin' != concatenated 'Inquiſiti'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0000' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Der' != concatenated 'D'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0001' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Schnitheiß' != concatenated 'rSchnithe'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0002' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'zu' != concatenated 'ß'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0003' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Oberrod,' != concatenated 'uOberro'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0004' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'der' != concatenated ',d'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Wirth' != concatenated 'rWir'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Krebs' != concatenated 'hKre'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0007' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'und' != concatenated 'su'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0008' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Hr.' != concatenated 'dH'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0009' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Notarius' != concatenated '.Notari'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0010' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Tribert' != concatenated 'sTribe'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0011' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'ſind' != concatenated 'tſi'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0012' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'bereits' != concatenated 'dberei'</error>
  <error>INCONSISTENCY in Word ID 'l26_word0000' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Die' != concatenated 'Di'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0001' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'auf' != concatenated 'a'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0002' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'dieſem' != concatenated 'fdieſ'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0003' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Fall' != concatenated 'mFa'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0004' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'inioid.' != concatenated 'linioi'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Cr.' != concatenated '.C'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'art.' != concatenated '.ar'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0007' of file 'OCR-D-OCR-CALAMARI_00000024': text results '12.' != concatenated '.1'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0008' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'vom' != concatenated '.v'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0009' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'peinlichen' != concatenated 'mpeinlich'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0010' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Klaͤger' != concatenated 'nKlaͤg'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0011' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'erforderte' != concatenated 'rerforder'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'ad' != concatenated 'a'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'L.' != concatenated 'L'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0007' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Corn.' != concatenated 'Corn'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0008' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'de' != concatenated 'd'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0009' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'fali.' != concatenated 'fali'</error>
</report>

On some documents, the new glyph segmentation produces invalid results. ~~The problem seems to be wrong glyph positions from the Calamari engine.~~

Reproducer:

#!/bin/bash
set -e
export TF_FORCE_GPU_ALLOW_GROWTH=true

cd `mktemp -d`
wget https://qurator-data.de/examples/actevedef_718448162.first-page+binarization+segmentation.zip
unzip actevedef_718448162.first-page+binarization+segmentation
cd actevedef_718448162.first-page+binarization+segmentation

ocrd workspace remove-group -rf OCR-D-OCR-CALAMARI
ocrd-calamari-recognize \
  -p '{ "checkpoint": "/home/mike/devel/ocrd_calamari/gt4histocr-calamari/*.ckpt.json", "textequiv_level": "glyph" }' \
  -I OCR-D-SEG-LINE-SBB -O OCR-D-OCR-CALAMARI

ocrd workspace validate --skip dimension --skip pixel_density --page-coordinate-consistency off

Calamari 2.2

Calamari 2.0 is out.

I don't see benefits from updating the dependency, other than staying uptodate/compatible.

Search path for model files

@bertsky in #6:

Moreover, I think it would be useful to not rely on the CWD for relative paths, because this is not reliable across the many layers (e.g. a script which calls ocrd-calamari-recognize which calls calamari.ocr.MultiPredictor). Instead, like TESSDATA for Tesseract one could define an installation prefix via setuptools (overridable via environment variable), or simply use os.path.dirname(os.path.abspath(file)) as reference, i.e. the directory where ocrd_calamari is installed. Absolute pathnames should stay untouched, however.

Tests broken since last update

Since the last update, the tests are broken:

------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------
11:00:07.844 INFO processor.CalamariRecognize - INPUT FILE 0 / phys_0001
--------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------
INFO     processor.CalamariRecognize:recognize.py:81 INPUT FILE 0 / phys_0001
================================================================== short test summary info ==================================================================
FAILED test/test_recognize.py::test_recognize - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you...
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model - requests.exceptions.MissingSchema: Invalid URL 'OC...
FAILED test/test_recognize.py::test_word_segmentation - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Per...
FAILED test/test_recognize.py::test_glyphs - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you me...
==================================================================== 4 failed in 16.04s =====================================================================
make: *** [Makefile:77: test] Error 1

Observations:

The new code from @bertsky's change in 1f0252d should download OCR-D-IMG/INPUT_0017.tif but doesn't:

% ls /tmp/test-ocrd-calamari/OCR-D-IMG 
OCR-D-IMG_0001.tif  OCR-D-IMG_0002.tif

--version prints None

$ ocrd-calamari-recognize --version
Version None, ocrd/core 2.0.1

Add results below the line level

Other OCR processors like Tesseract and Ocropy go out of their way to provide annotation on the Word or even Glyph level if requested. (They use the line result with precise character-wise confidences and relative positions for that.)

I know this will be a bit harder for Calamari with its text post-processing and voting capabilities. But IMO this is worth the effort.

Decide how to handle textline without text

@kba showed us this pc:TextLine from an ocrd_calamari output:

<pc:TextLine id="l2">                                     
    <pc:Coords points="302,655 532,653 533,728 302,732"/> 
    <pc:TextEquiv conf="0.">                              
        <pc:Unicode></pc:Unicode>                         
    </pc:TextEquiv>                                       
</pc:TextLine>

Because this raises the issue of how to handle this in a subsequent transformation to ALTO - which requires text - we should think about how to handle this.

Pro removing the line:

No problem with ALTO

Con removing the line:

The line disappears and users can't just compare layout detection output with OCR output (i.e. "In this line no text was detected")

I'm leaning towards keeping the line and let the ALTO conversion handle it. But I'll check if ocrd workspace validate considers a line with empty text valid output (I hope it does, I see no reason why it should be invalid).

@kba What are your thoughts on this?

Include default Output format to Documentation

The documentation should include a hint to the default outputformat to be PAGE-XML.

Test calamari_models

I have never ever gotten a result from the official models at https://github.com/Calamari-OCR/calamari_models.

Remove them from Makefile and README.
Test them again

Include alternative predictions

Calamari provides alternative predictions for characters. These should be included in the PAGE output.

See also #17 (extended prediction data) and #9 (results below line level)

Failing tests due to incompatible Pillow upgrade

https://app.circleci.com/pipelines/github/OCR-D/ocrd_calamari/175/workflows/945cc406-80f6-41b6-a24b-ff58f1a12c78/jobs/166

18:00:35.751 INFO processor.CalamariRecognize - INPUT FILE 0 / phys_0001
18:00:35.793 INFO ocrd.workspace.image_from_page - Cropping original image for page 'phys_0001'
18:00:35.815 INFO processor.CalamariRecognize - About to recognize 1 lines of region 'r_1_1'
18:00:35.817 WARNING processor.CalamariRecognize - Using raw image for line 'tl_1' in region 'r_1_1'
------------------------------ Captured log call -------------------------------
INFO     processor.CalamariRecognize:recognize.py:76 INPUT FILE 0 / phys_0001
INFO     ocrd.workspace.image_from_page:workspace.py:968 Cropping original image for page 'phys_0001'
INFO     processor.CalamariRecognize:recognize.py:88 About to recognize 1 lines of region 'r_1_1'
WARNING  processor.CalamariRecognize:recognize.py:100 Using raw image for line 'tl_1' in region 'r_1_1'
=========================== short test summary info ============================
FAILED test/test_recognize.py::test_recognize - TypeError: __array__() takes ...
FAILED test/test_recognize.py::test_recognize_with_checkpoint_dir - TypeError...
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model
FAILED test/test_recognize.py::test_word_segmentation - TypeError: __array__(...
FAILED test/test_recognize.py::test_glyphs - TypeError: __array__() takes 1 p...
============================== 5 failed in 46.54s ==============================
Makefile:77: recipe for target 'test' failed
make[1]: *** [test] Error 1
make[1]: Leaving directory '/root/project'
Makefile:82: recipe for target 'coverage' failed
make: *** [coverage] Error 2

Test on Python 3.11

Python 3.11 is scheduled to be released in 2022-10, and Fedora Linux 37 is going to be released with it. So we should test on it.

Documentation: README completeness, debug ocrd-tool.json

Please debug your ocrd_tool.json file.
I found some errors:

<report valid="false">
  <error>[tools.ocrd-calamari-recognize.parameters.checkpoint] 'description' is a required property</error>
  <error>[tools.ocrd-calamari-recognize.parameters.voter] 'description' is a required property</error>
</report>

You can find the ocrd-tool.json documentation: https://ocr-d.github.io/ocrd_tool

Please check your README file and complet them. An ideal README file look like:

# Name of application


## Introduction
...

## Installation
...

## Usage
...

## Testing
...

Thank you very much.

Does ocrd_calamari use GPU using ocrd/all:maximum-cuda image?

Reported by @jbarth-ubhd in Gitter (Oct 06 10:50)

calamari-recognize does not run on GPU, even with maximum-cuda

This seems to be concluded from this memory usage graph:

gpu memory usage, red=sbb-binarize, blue=eynollah-segment, green=calamari-recognize

Does not set pcGtsId

% ocrd workspace validate --skip dimension --skip pixel_density --page-strictness lax --page-coordinate-consistency off
<report valid="false">
  <warning>pc:PcGts/@pcGtsId differs from mets:file/@ID: "OCR-D-SEG-LINE_00000024" !== "OCR-D-OCR-CALAMARI_00000024"</warning>
</report>

Missing PAGE processingStep

Review docs regarding glyph and word segmentation

Docs and logging should reflect that glyph and word segmentation are from LSTM positions and that those are not suitable for any image based processing

Add parameter to control output granularity below textline

ocrd-calamari-recognize now supports #9 (adding words and glyphs from the textline decoder results).

It would be favourable to allow controlling the level of segmentation detail to be added below line, as do ocrd-tesserocr-recognize and ocrd-cis-ocropy-recognize with textequiv_level:

line: do not add further segmentation (e.g. to save computation)
word: add Word elements
glyph: add Word and Glyph elements

NB: In the Tesseract wrapper, that parameter is also responsible for controlling the level of operation (page segmentation mode) if segmentation is already present at the lower levels. For example, with textequiv_level=word, if there are already Word elements below a TextLine, Tesseract only gets images cropped around these words and runs in PSM.SINGLE_WORD instead of PSM.SINGLE_LINE. IMHO there is no need/use in attempting something like this for engines that do not natively provide modes other than textline processing.

ValueError: Error when checking input: expected input_1 to have shape (..., ..., ...) but got array with shape (..., ..., ...)

ValueError: Error when checking input: expected input_1 to have shape (448, 896, 3) but got array with shape (448, 4, 3)

https://digi.ub.uni-heidelberg.de/diglitData/v/ocrd/lichtwark1932bd2_-_h.tif

workflow:

ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-001 -P model $HOME/ocrd_models/sbb/binarization/models
ocrd-cis-ocropy-deskew -I OCR-D-001 -O OCR-D-002
ocrd-sbb-textline-detector -I OCR-D-002 -O OCR-D-003 -P model $HOME/ocrd_models/sbb/textline
ocrd-calamari-recognize -I OCR-D-003 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/calamari_models/gt4histocr/*.ckpt.json"

ocrd-calamari-recognize loses trained formatting characters

I have a model trained to represent italics and bold face with opening and closing characters. Angular brackets for italics ´<´, ´>´ and curly brackets for bold face ´{´, ´}´. For example this would be recognized as <this>. A similar case is for example https://github.com/poke1024/origami_models. Problems arise with lines where the line terminates with italics of bold face. In these cases the expected recognition results should include the closing character, > or }, before the line breaks, but these characters don't show up. For other positions this is not a problem. In line beginning and inside the line italics and bold face are succesfully recognized with the opening and closing characters. Further, this problem doesn't appear when using calamari-predict from calamari-ocr (with the same model). This calamari-ocr is the same used with the ocrd venv in question. That is, recognising the same files with calamari-predict gives correct closing characters at the end of the line.

@bertsky has suggested the reason for this behaviour in a conversation in OCRD Lobby, pointing to possible part of the code.

ocrd==2.23.0
calamari-ocr==1.0.5
ocrd_calamari==1.0.2
tensorflow==2.3.0

I have used the same method previously without this problem, but perhaps with some other versions of calamari-ocr and ocrd_calamari.

Review preprocessing of text lines

Private email from @andbue to @kba, copied with permission:

Was ich dann noch bedenklich finde, ist, dass die Zeilenbilder nicht durch den Standard-MultiDataProcessor laufen. Ich überblicke nicht ganz, was workspace.image_from_segment alles tut, aber Calamari skaliert, normalisiert, padded (16px weiß) und lässt die Daten durch einen CenterNormalizer wie beim guten alten Ocropus laufen. Meine eigene Erfahrung ist, dass der Output nur dann optimal ist, wenn bei der Prediction das gleiche Preprocessing verwendet wird wie im Training. Wie gesagt, ich übrblicke image_from_segment gerade nicht, aber vielleicht solltet ihr da mal einen Blick hinein werfen. Als Beispiel, wie man den Standard-Preprocessor einbauen könnte, verlinke ich mal meinen Code aus dem Client:
Instantiierung des DataPreprocessors (line 426-436):

https://github.com/andbue/nashi/blob/dd533d193264472a4cfc96aab69fadd9ca52732c/ocr/nashi_ocr/nashi_client.py#L426

Verwendung:

https://github.com/andbue/nashi/blob/dd533d193264472a4cfc96aab69fadd9ca52732c/ocr/nashi_ocr/nashi_client.py#L211

Recognize more than one line at a time

Calamari segfaults

Using the ocrd/maximum docker image from 5 days ago (2020-09-18, 9165ddaf96bc), I am receiving a segfault when running ocrd-calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -p checkpoint /models/\*.ckpt.json, where the models are those from the model.tar.xz archive as suggested by the OCR-D project web site. Only 5 xml files are produced before the segfault, so I am assuming the issue is with the sixth image (OCR-D-SEG-LINE-RESEG-DEWARP_f103.xml) and its line segments. Would you like me to submit the problematic line segments for reproduction?

I noticed that a new ocrd/maximum image has been published a day later. Do you suppose the changes may affect calamari?

Check prediction runtime performance

Last time I checked, ocrd_calamari based on Calamari 1 was 20% slower than ocrd_calamari based on Calamari 0.3.5. This should be checked again.

Configurable cut-off confidence value for alternative glyph predictions

Currently, we include all character predictions for a position, as returned by Calamari. In some cases, the engine returns over 20 possibilites, most of them with very low probability. As this needlessy bloats the PAGE output, there should be a configurable cut-off value and it should have a reasonable default, i.e. 1e-04.

As suggested by @bertsky in #9 (comment).

Fix or disable Docker build

Docker builds are failing, the "Details" link to a 404 error page.

Fix the Docker build
Do not use calamari/build make target, we are using the version from PyPI
~~Check if this Docker build is useful (not using it personally)~~ (→ #36)
Fix the build on dockercloud?

Support Calamari's "extended prediction data" output

This could be useful to:

provide more than the best result (postcorrection!)
provide word or even glyph segmentation (see also #9)

Support single model prediction

Currently, the processor only supports a prediction using confidence voting of multiple models. While this is superior, it makes sense to support single model prediction, too.

Test CPU support

Fix CircleCI tests (again...)

On CircleCI, neither tensorflow-gpu==0.15.2 or tensorflow==0.15.x can be installed:

Collecting tensorflow==1.15.* (from ocrd-calamari==0.0.3)
  Could not find a version that satisfies the requirement tensorflow==1.15.* (from ocrd-calamari==0.0.3) (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0rc0, 1.1.0rc1, 1.1.0rc2, 1.1.0, 1.2.0rc0, 1.2.0rc1, 1.2.0rc2, 1.2.0, 1.2.1, 1.3.0rc0, 1.3.0rc1, 1.3.0rc2, 1.3.0, 1.4.0rc0, 1.4.0rc1, 1.4.0, 1.4.1, 1.5.0rc0, 1.5.0rc1, 1.5.0, 1.5.1, 1.6.0rc0, 1.6.0rc1, 1.6.0, 1.7.0rc0, 1.7.0rc1, 1.7.0, 1.7.1, 1.8.0rc0, 1.8.0rc1, 1.8.0, 1.9.0rc0, 1.9.0rc1, 1.9.0rc2, 1.9.0, 1.10.0rc0, 1.10.0rc1, 1.10.0, 1.10.1, 1.11.0rc0, 1.11.0rc1, 1.11.0rc2, 1.11.0, 1.12.0rc0, 1.12.0rc1, 1.12.0rc2, 1.12.0, 1.12.2, 1.12.3, 1.13.0rc0, 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 2.0.0a0, 2.0.0b0, 2.0.0b1)
No matching distribution found for tensorflow==1.15.* (from ocrd-calamari==0.0.3)
Makefile:34: recipe for target 'install' failed
make: *** [install] Error 1

Full log: https://circleci.com/gh/OCR-D/ocrd_calamari/77

The install works on my local machine, using a virtualenv.

A thing that could be tried is an update of pip itself?

Provide more useful info than "No checkpoints provided."

From one of @kba's conversations in Gitter:

"checkpoint": "/home/najem/testa/calamari_model/*.ckpt.json"
Exception: No checkpoints provided.

I see this error from time to time and it would be lot clearer for the users if we could help with messages like:

"File /home/najem/testa/calamari_model/*.ckpt.json not found"
"/home/najem/testa/calamari_model/*.ckpt.json" are Calamari 0.3 models, we need Calamari 1.0 models

Word/glyph segmentation coordinates are broken

There is a cluster of bounding boxes elsewhere on the page, so I think the coordinates are broken probably.

ocrd_calamari PAGE-XML output not valid PAGE-XML?

It appears the PAGE-XML output produced by ocrd_calamari is not valid PAGE-XML - at least the files do not open in Aletheia, the PRImA official tool for working with PAGE-XML files (tested with this output file from ocrd_calamari and Aletheia v4.0.