Coder Social home page Coder Social logo

ocrd_calamari's Introduction

ocrd_calamari

Recognize text using Calamari OCR.

image image image

Introduction

ocrd_calamari offers a OCR-D compliant workspace processor for the functionality of Calamari OCR. It uses OCR-D workspaces (METS) with PAGE XML documents as input and output.

This processor only operates on the text line level and so needs a line segmentation (and by extension a binarized image) as its input.

In addition to the line text it may also output word and glyph segmentation including per-glyph confidence values and per-glyph alternative predictions as provided by the Calamari OCR engine, using a textequiv_level of word or glyph. Note that while Calamari does not provide word segmentation, this processor produces word segmentation inferred from text segmentation and the glyph positions. The provided glyph and word segmentation can be used for text extraction and highlighting, but is probably not useful for further image-based processing.

Example output as viewed in PAGE Viewer

Installation

From PyPI

pip install ocrd_calamari

From the git repository

pip install .

Install models

Download models trained on GT4HistOCR data:

make qurator-gt4histocr-1.0
ls .local/share/ocrd-resources/ocrd-calamari-recognize/*

Manual download: model.tar.xz

Example Usage

Before using ocrd-calamari-recognize get some example data and model:

# Download model and example data
make qurator-gt4histocr-1.0
make example

The example already contains a binarized and line-segmented page, so we are ready to go. Recognize the text using ocrd_calamari and the downloaded model:

cd actevedef_718448162.first-page+binarization+segmentation
ocrd-calamari-recognize \
  -P checkpoint_dir qurator-gt4histocr-1.0 \
  -I OCR-D-SEG-LINE-SBB -O OCR-D-OCR-CALAMARI

You may want to have a look at the ocrd-tool.json descriptions for additional parameters and default values.

Development & Testing

For information regarding development and testing, please see README-DEV.md.

ocrd_calamari's People

Contributors

bertsky avatar dependabot[bot] avatar kba avatar mikegerber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ocrd_calamari's Issues

Release v1.0.4

  • checkpoint-dir changes
    • Release info should not the (breaking) change in options
  • #75

Support Python 3.9

In a Python 3.9 venv:

% python --version                                                                                      master 
Python 3.9.5
% pip install -e .                                                                                          master 
Obtaining file:///home/mike/devel/ocrd_calamari
Collecting h5py<3
  Using cached h5py-2.10.0.tar.gz (301 kB)
ERROR: Could not find a version that satisfies the requirement tensorflow<2.5.0,>=2.3.0rc2 (from ocrd-calamari) (from versions: 2.5.0rc0, 2.5.0rc1, 2.5.0rc2, 2.5.0rc3, 2.5.0, 2.6.0rc0)
ERROR: No matching distribution found for tensorflow<2.5.0,>=2.3.0rc2

Related to #64 and possibly #61.

Provide a more useful usage example

README should provide a more useful and ready-to-run usage example, including:

  • A downloadable test workspace
  • compatible binarization
  • compatible segmentation
  • use the GT4HistOCR model

Review tests

  • Run on all targeted Python versions (to avoid surprises with TensorFlow versions)
  • We're working with GT. Make the GT text removal more robust and implement safeguards ("No TextLine in the GT!")
  • I also noticed an issue where I had an old checkout of repo/assets which would NOT trigger test fails but a fresh checkout would
  • test/assets are copied once and are then reused
  • We reuse the temporary workspace which is also potentially a problem
    • Don't reuse the workspace directory name
  • Run scheduled tests (There have been subtle changes, e.g. in OCR-D, that changed the filenames of created files)
  • Make pytest work - currently only make test works
  • Why is the CircleCI result not accessible/private?
  • Consider dropping Python 3.6 support: It's EOL,
    • and tests now spend most time on compiling OpenCV for Python 3.6...

Remove existing TextEquivs

TextEquivs not generated by this processor should be removed. This already broke a test here because we are using GT segmentation with GT text. While ocrd_calamari was overwriting line and region texts, the words were GT and the test wrongly asserted success.

Using the ocrd/all:maximum-cuda Docker image: No supported devices found for platform CUDA

I am using ocrd-calamari-recognize in my workflow using the ocrd/all:maximum-cuda Docker image. I have NVIDIA driver 455.23.05, CUDA 11.1, and two Tesla T4 GPUs. I can successfully run the following command:

$ docker run --rm --gpus all ocrd/all:maximum-cuda nvidia-smi

However, when I run my workflow, I can see in nvidia-smi that my GPUs are not used by ocrd-calamari-recognize. Do you have any idea why that could be?

Handle empty images gracefully

OK, I've installed OCR-d for the first time, it worked in most parts out of the box and I was able to reproduce the problem. Your errors seem to be caused by OCR-d processors, not by calamari.
Somehow the line segmentation produces empty lines or lines that are outside of text regions. When the empty images are converted to numpy (by ocrd_calamari, not by calamari), numpy throws an uncaught exception. You could fix it by inserting before line 77 in ocrd_calamari/recognize.py something like line_image = line_image if all(line_image.size) else [[0]], but that's only a temporary hack to avoid the error. I'm also not sure if their workspace.image_from_segment or even the line segmentation processor is supposed to produce empty lines at all, so maybe the real problem is somewhere deeper in the guts of the OCR-d machinery.

Originally posted by @andbue in Calamari-OCR/calamari#193 (comment)

Fix word coordinates when using textequiv_level = "word"

Hi all,

When using ocrd-calamary-recognize with textequiv_level word, pc:Word-spans appear to have wrong y-coordinates in the Coords-spans. It looks like all words are lowered to the bottom of the text region they belong to.

For instance :
When drawing the line polygons, the coords are right :
Capture d’écran 2021-02-24 à 16 03 32

But when drawing the word polygons, the coords are wrong :
Capture d’écran 2021-02-24 à 16 03 06

I am using cv2 to draw the polygons, but I double-checked in the PAGE xml file, and words of a text-region (sometimes the entire page) all have the same y-coordinates.

Here is the entire code used to generate the OCR :

docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd process \
  "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \
  "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \
  "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \
  "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \
  "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \
  "cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P level-of-operation page" \
  "cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP" \
  "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint /data/calamari_model/\*.ckpt.json -P textequiv_level word"

-P checkpoint_dir does not seem to work

ocrd-calamari-recognize -P checkpoint_dir qurator-gt4histocr-1.0 -I OCR-D-SEG-LINE-SBB -O OCR-D-OCR-CALA                                                                                                                                                                 

with a model in qurator-gt4histocr-1.0 does not seem to work:

resolve_resource - Could not find resource 'qurator-gt4histocr-1.0' for executable 'ocrd-calamari-recognize'. Try 'ocrd resmgr download ocrd-calamari-recognize qurator-gt4histocr-1.0' to download this resource.

I had some confusion due to the bevahior of resolve_resource (OCR-D/core#727) so I have to test this again if this is indeed the case.

dependencies correct?

Is it possible that calamari-ocr and tensorflow-gpu is too broad a range of versions supported by this repo?

With Tensorflow 2.0 I get on Calamari 0.3.1:

File "ocrd_calamari/recognize.py", line 31, in _init_calamari
    self.predictor = MultiPredictor(checkpoints=checkpoints)
  File "calamari_ocr/ocr/predictor.py", line 203, in __init__
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "calamari_ocr/ocr/predictor.py", line 203, in <listcomp>
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "calamari_ocr/ocr/predictor.py", line 100, in __init__
    ckpt = Checkpoint(checkpoint, auto_update=self.auto_update_checkpoints)
  File "calamari_ocr/ocr/checkpoint.py", line 35, in __init__
    self.checkpoint = json_format.Parse(f.read(), CheckpointParams())
  File "protobuf/json_format.py", line 406, in Parse
    return ParseDict(js, message, ignore_unknown_fields)
  File "protobuf/json_format.py", line 421, in ParseDict
    parser.ConvertMessage(js_dict, message)
  File "protobuf/json_format.py", line 452, in ConvertMessage
    self._ConvertFieldValuePair(value, message)
  File "protobuf/json_format.py", line 552, in _ConvertFieldValuePair
    raise ParseError('Failed to parse {0} field: {1}'.format(name, e))
google.protobuf.json_format.ParseError: Failed to parse model field: Failed to parse network field: Failed to parse backend field: Message type "BackendParams" has no field named "shuffleBufferSize".
 Available Fields(except extensions): <MessageFields sequence>

... and on Calamari 0.3.5:

File "ocrd_calamari/recognize.py", line 31, in _init_calamari
    self.predictor = MultiPredictor(checkpoints=checkpoints)
  File "calamari_ocr/ocr/predictor.py", line 220, in __init__
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "calamari_ocr/ocr/predictor.py", line 220, in <listcomp>
    data_preproc=data_preproc, processes=processes) for cp in checkpoints]
  File "calamari_ocr/ocr/predictor.py", line 106, in __init__
    backend = create_backend_from_proto(self.network_params, restore=self.checkpoint, processes=processes)
  File "calamari_ocr/ocr/backends/factory.py", line 28, in create_backend_from_proto
    from calamari_ocr.ocr.backends.tensorflow_backend.tensorflow_backend import TensorflowBackend
  File "calamari_ocr/ocr/backends/tensorflow_backend/tensorflow_backend.py", line 4, in <module>
    from calamari_ocr.ocr.backends.tensorflow_backend.tensorflow_model import TensorflowModel
  File "calamari_ocr/ocr/backends/tensorflow_backend/tensorflow_model.py", line 3, in <module>
    import tensorflow.contrib.cudnn_rnn as cudnn_rnn
  File "tensorflow/contrib/__init__.py", line 33, in <module>
    from tensorflow.contrib import cudnn_rnn
  File "tensorflow/contrib/cudnn_rnn/__init__.py", line 34, in <module>
    from tensorflow.contrib.cudnn_rnn.python.layers import *
  File "tensorflow/contrib/cudnn_rnn/python/layers/__init__.py", line 23, in <module>
    from tensorflow.contrib.cudnn_rnn.python.layers.cudnn_rnn import *
  File "tensorflow/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 20, in <module>
    from tensorflow.contrib.cudnn_rnn.python.ops import cudnn_rnn_ops
  File "tensorflow/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 20, in <module>
    from tensorflow.contrib.eager.python import checkpointable_utils
  File "tensorflow/contrib/eager/python/checkpointable_utils.py", line 38, in <module>
    from tensorflow.python.training import checkpointable as core_checkpointable
ImportError: cannot import name 'checkpointable'

...when trying to recognize with the checkpoint files provided by @mikegerber

Glyph segmentation produces invalid results

<report valid="false">
  <error>INCONSISTENCY in Word ID 'l5_word0000' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'ſecund.' != concatenated 'ſecund'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0001' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Fatin.' != concatenated 'Fatin'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0002' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Tit.' != concatenated 'Tit'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0003' of file 'OCR-D-OCR-CALAMARI_00000024': text results '9.' != concatenated '9'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0004' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'qu.' != concatenated 'qu'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results '6.' != concatenated '6'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'p.' != concatenated 'p'</error>
  <error>INCONSISTENCY in Word ID 'l5_word0008' of file 'OCR-D-OCR-CALAMARI_00000024': text results '320.' != concatenated '320'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0000' of file 'OCR-D-OCR-CALAMARI_00000024': text results '.35)' != concatenated '.35'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0001' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'So' != concatenated 'S'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0002' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'viel' != concatenated 'vie'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0003' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'die' != concatenated 'di'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0004' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'von' != concatenated 'vo'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'der' != concatenated 'de'</error>
  <error>INCONSISTENCY in Word ID 'l7_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Inquiſitin' != concatenated 'Inquiſiti'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0000' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Der' != concatenated 'D'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0001' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Schnitheiß' != concatenated 'rSchnithe'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0002' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'zu' != concatenated 'ß'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0003' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Oberrod,' != concatenated 'uOberro'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0004' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'der' != concatenated ',d'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Wirth' != concatenated 'rWir'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Krebs' != concatenated 'hKre'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0007' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'und' != concatenated 'su'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0008' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Hr.' != concatenated 'dH'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0009' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Notarius' != concatenated '.Notari'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0010' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Tribert' != concatenated 'sTribe'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0011' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'ſind' != concatenated 'tſi'</error>
  <error>INCONSISTENCY in Word ID 'l24_word0012' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'bereits' != concatenated 'dberei'</error>
  <error>INCONSISTENCY in Word ID 'l26_word0000' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Die' != concatenated 'Di'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0001' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'auf' != concatenated 'a'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0002' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'dieſem' != concatenated 'fdieſ'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0003' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Fall' != concatenated 'mFa'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0004' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'inioid.' != concatenated 'linioi'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Cr.' != concatenated '.C'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'art.' != concatenated '.ar'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0007' of file 'OCR-D-OCR-CALAMARI_00000024': text results '12.' != concatenated '.1'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0008' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'vom' != concatenated '.v'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0009' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'peinlichen' != concatenated 'mpeinlich'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0010' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Klaͤger' != concatenated 'nKlaͤg'</error>
  <error>INCONSISTENCY in Word ID 'l42_word0011' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'erforderte' != concatenated 'rerforder'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0005' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'ad' != concatenated 'a'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0006' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'L.' != concatenated 'L'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0007' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'Corn.' != concatenated 'Corn'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0008' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'de' != concatenated 'd'</error>
  <error>INCONSISTENCY in Word ID 'l61_word0009' of file 'OCR-D-OCR-CALAMARI_00000024': text results 'fali.' != concatenated 'fali'</error>
</report>

On some documents, the new glyph segmentation produces invalid results. The problem seems to be wrong glyph positions from the Calamari engine.

image

Reproducer:

#!/bin/bash
set -e
export TF_FORCE_GPU_ALLOW_GROWTH=true

cd `mktemp -d`
wget https://qurator-data.de/examples/actevedef_718448162.first-page+binarization+segmentation.zip
unzip actevedef_718448162.first-page+binarization+segmentation
cd actevedef_718448162.first-page+binarization+segmentation

ocrd workspace remove-group -rf OCR-D-OCR-CALAMARI
ocrd-calamari-recognize \
  -p '{ "checkpoint": "/home/mike/devel/ocrd_calamari/gt4histocr-calamari/*.ckpt.json", "textequiv_level": "glyph" }' \
  -I OCR-D-SEG-LINE-SBB -O OCR-D-OCR-CALAMARI

ocrd workspace validate --skip dimension --skip pixel_density --page-coordinate-consistency off

Calamari 2.2

Calamari 2.0 is out.

I don't see benefits from updating the dependency, other than staying uptodate/compatible.

Search path for model files

@bertsky in #6:

Moreover, I think it would be useful to not rely on the CWD for relative paths, because this is not reliable across the many layers (e.g. a script which calls ocrd-calamari-recognize which calls calamari.ocr.MultiPredictor). Instead, like TESSDATA for Tesseract one could define an installation prefix via setuptools (overridable via environment variable), or simply use os.path.dirname(os.path.abspath(file)) as reference, i.e. the directory where ocrd_calamari is installed. Absolute pathnames should stay untouched, however.

Tests broken since last update

Since the last update, the tests are broken:

------------------------------------------------------------------- Captured stderr call --------------------------------------------------------------------
11:00:07.844 INFO processor.CalamariRecognize - INPUT FILE 0 / phys_0001
--------------------------------------------------------------------- Captured log call ---------------------------------------------------------------------
INFO     processor.CalamariRecognize:recognize.py:81 INPUT FILE 0 / phys_0001
================================================================== short test summary info ==================================================================
FAILED test/test_recognize.py::test_recognize - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you...
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model - requests.exceptions.MissingSchema: Invalid URL 'OC...
FAILED test/test_recognize.py::test_word_segmentation - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Per...
FAILED test/test_recognize.py::test_glyphs - requests.exceptions.MissingSchema: Invalid URL 'OCR-D-IMG/INPUT_0017.tif': No scheme supplied. Perhaps you me...
==================================================================== 4 failed in 16.04s =====================================================================
make: *** [Makefile:77: test] Error 1

Observations:

The new code from @bertsky's change in 1f0252d should download OCR-D-IMG/INPUT_0017.tif but doesn't:

% ls /tmp/test-ocrd-calamari/OCR-D-IMG 
OCR-D-IMG_0001.tif  OCR-D-IMG_0002.tif

Add results below the line level

Other OCR processors like Tesseract and Ocropy go out of their way to provide annotation on the Word or even Glyph level if requested. (They use the line result with precise character-wise confidences and relative positions for that.)

I know this will be a bit harder for Calamari with its text post-processing and voting capabilities. But IMO this is worth the effort.

Decide how to handle textline without text

@kba showed us this pc:TextLine from an ocrd_calamari output:

<pc:TextLine id="l2">                                     
    <pc:Coords points="302,655 532,653 533,728 302,732"/> 
    <pc:TextEquiv conf="0.">                              
        <pc:Unicode></pc:Unicode>                         
    </pc:TextEquiv>                                       
</pc:TextLine>   

Because this raises the issue of how to handle this in a subsequent transformation to ALTO - which requires text - we should think about how to handle this.

Pro removing the line:

  • No problem with ALTO

Con removing the line:

  • The line disappears and users can't just compare layout detection output with OCR output (i.e. "In this line no text was detected")

I'm leaning towards keeping the line and let the ALTO conversion handle it. But I'll check if ocrd workspace validate considers a line with empty text valid output (I hope it does, I see no reason why it should be invalid).

@kba What are your thoughts on this?

Include alternative predictions

Calamari provides alternative predictions for characters. These should be included in the PAGE output.

See also #17 (extended prediction data) and #9 (results below line level)

Failing tests due to incompatible Pillow upgrade

https://app.circleci.com/pipelines/github/OCR-D/ocrd_calamari/175/workflows/945cc406-80f6-41b6-a24b-ff58f1a12c78/jobs/166

18:00:35.751 INFO processor.CalamariRecognize - INPUT FILE 0 / phys_0001
18:00:35.793 INFO ocrd.workspace.image_from_page - Cropping original image for page 'phys_0001'
18:00:35.815 INFO processor.CalamariRecognize - About to recognize 1 lines of region 'r_1_1'
18:00:35.817 WARNING processor.CalamariRecognize - Using raw image for line 'tl_1' in region 'r_1_1'
------------------------------ Captured log call -------------------------------
INFO     processor.CalamariRecognize:recognize.py:76 INPUT FILE 0 / phys_0001
INFO     ocrd.workspace.image_from_page:workspace.py:968 Cropping original image for page 'phys_0001'
INFO     processor.CalamariRecognize:recognize.py:88 About to recognize 1 lines of region 'r_1_1'
WARNING  processor.CalamariRecognize:recognize.py:100 Using raw image for line 'tl_1' in region 'r_1_1'
=========================== short test summary info ============================
FAILED test/test_recognize.py::test_recognize - TypeError: __array__() takes ...
FAILED test/test_recognize.py::test_recognize_with_checkpoint_dir - TypeError...
FAILED test/test_recognize.py::test_recognize_should_warn_if_given_rgb_image_and_single_channel_model
FAILED test/test_recognize.py::test_word_segmentation - TypeError: __array__(...
FAILED test/test_recognize.py::test_glyphs - TypeError: __array__() takes 1 p...
============================== 5 failed in 46.54s ==============================
Makefile:77: recipe for target 'test' failed
make[1]: *** [test] Error 1
make[1]: Leaving directory '/root/project'
Makefile:82: recipe for target 'coverage' failed
make: *** [coverage] Error 2

Documentation: README completeness, debug ocrd-tool.json

Please debug your ocrd_tool.json file.
I found some errors:

<report valid="false">
  <error>[tools.ocrd-calamari-recognize.parameters.checkpoint] 'description' is a required property</error>
  <error>[tools.ocrd-calamari-recognize.parameters.voter] 'description' is a required property</error>
</report>

You can find the ocrd-tool.json documentation: https://ocr-d.github.io/ocrd_tool

Please check your README file and complet them. An ideal README file look like:

# Name of application


## Introduction
...

## Installation
...

## Usage
...

## Testing
...

Thank you very much.

Does not set pcGtsId

% ocrd workspace validate --skip dimension --skip pixel_density --page-strictness lax --page-coordinate-consistency off
<report valid="false">
  <warning>pc:PcGts/@pcGtsId differs from mets:file/@ID: "OCR-D-SEG-LINE_00000024" !== "OCR-D-OCR-CALAMARI_00000024"</warning>
</report>

Add parameter to control output granularity below textline

ocrd-calamari-recognize now supports #9 (adding words and glyphs from the textline decoder results).

It would be favourable to allow controlling the level of segmentation detail to be added below line, as do ocrd-tesserocr-recognize and ocrd-cis-ocropy-recognize with textequiv_level:

  • line: do not add further segmentation (e.g. to save computation)
  • word: add Word elements
  • glyph: add Word and Glyph elements

NB: In the Tesseract wrapper, that parameter is also responsible for controlling the level of operation (page segmentation mode) if segmentation is already present at the lower levels. For example, with textequiv_level=word, if there are already Word elements below a TextLine, Tesseract only gets images cropped around these words and runs in PSM.SINGLE_WORD instead of PSM.SINGLE_LINE. IMHO there is no need/use in attempting something like this for engines that do not natively provide modes other than textline processing.

ValueError: Error when checking input: expected input_1 to have shape (..., ..., ...) but got array with shape (..., ..., ...)

ValueError: Error when checking input: expected input_1 to have shape (448, 896, 3) but got array with shape (448, 4, 3)

https://digi.ub.uni-heidelberg.de/diglitData/v/ocrd/lichtwark1932bd2_-_h.tif

workflow:

ocrd-sbb-binarize -I OCR-D-IMG -O OCR-D-001 -P model $HOME/ocrd_models/sbb/binarization/models
ocrd-cis-ocropy-deskew -I OCR-D-001 -O OCR-D-002
ocrd-sbb-textline-detector -I OCR-D-002 -O OCR-D-003 -P model $HOME/ocrd_models/sbb/textline
ocrd-calamari-recognize -I OCR-D-003 -O OCR-D-OCR -P checkpoint "$HOME/ocrd_models/calamari/calamari_models/gt4histocr/*.ckpt.json"

ocrd-calamari-recognize loses trained formatting characters

I have a model trained to represent italics and bold face with opening and closing characters. Angular brackets for italics ´<´, ´>´ and curly brackets for bold face ´{´, ´}´. For example this would be recognized as <this>. A similar case is for example https://github.com/poke1024/origami_models. Problems arise with lines where the line terminates with italics of bold face. In these cases the expected recognition results should include the closing character, > or }, before the line breaks, but these characters don't show up. For other positions this is not a problem. In line beginning and inside the line italics and bold face are succesfully recognized with the opening and closing characters. Further, this problem doesn't appear when using calamari-predict from calamari-ocr (with the same model). This calamari-ocr is the same used with the ocrd venv in question. That is, recognising the same files with calamari-predict gives correct closing characters at the end of the line.

@bertsky has suggested the reason for this behaviour in a conversation in OCRD Lobby, pointing to possible part of the code.

ocrd==2.23.0
calamari-ocr==1.0.5
ocrd_calamari==1.0.2
tensorflow==2.3.0

I have used the same method previously without this problem, but perhaps with some other versions of calamari-ocr and ocrd_calamari.

Review preprocessing of text lines

Private email from @andbue to @kba, copied with permission:

Was ich dann noch bedenklich finde, ist, dass die Zeilenbilder nicht durch den Standard-MultiDataProcessor laufen. Ich überblicke nicht ganz, was workspace.image_from_segment alles tut, aber Calamari skaliert, normalisiert, padded (16px weiß) und lässt die Daten durch einen CenterNormalizer wie beim guten alten Ocropus laufen. Meine eigene Erfahrung ist, dass der Output nur dann optimal ist, wenn bei der Prediction das gleiche Preprocessing verwendet wird wie im Training. Wie gesagt, ich übrblicke image_from_segment gerade nicht, aber vielleicht solltet ihr da mal einen Blick hinein werfen. Als Beispiel, wie man den Standard-Preprocessor einbauen könnte, verlinke ich mal meinen Code aus dem Client:
Instantiierung des DataPreprocessors (line 426-436):

https://github.com/andbue/nashi/blob/dd533d193264472a4cfc96aab69fadd9ca52732c/ocr/nashi_ocr/nashi_client.py#L426

Verwendung:

https://github.com/andbue/nashi/blob/dd533d193264472a4cfc96aab69fadd9ca52732c/ocr/nashi_ocr/nashi_client.py#L211

Calamari segfaults

Using the ocrd/maximum docker image from 5 days ago (2020-09-18, 9165ddaf96bc), I am receiving a segfault when running ocrd-calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -p checkpoint /models/\*.ckpt.json, where the models are those from the model.tar.xz archive as suggested by the OCR-D project web site. Only 5 xml files are produced before the segfault, so I am assuming the issue is with the sixth image (OCR-D-SEG-LINE-RESEG-DEWARP_f103.xml) and its line segments. Would you like me to submit the problematic line segments for reproduction?

I noticed that a new ocrd/maximum image has been published a day later. Do you suppose the changes may affect calamari?

Check prediction runtime performance

Last time I checked, ocrd_calamari based on Calamari 1 was 20% slower than ocrd_calamari based on Calamari 0.3.5. This should be checked again.

Fix or disable Docker build

Docker builds are failing, the "Details" link to a 404 error page.

image

  • Fix the Docker build
  • Do not use calamari/build make target, we are using the version from PyPI
  • Check if this Docker build is useful (not using it personally) (→ #36)
  • Fix the build on dockercloud?

Support single model prediction

Currently, the processor only supports a prediction using confidence voting of multiple models. While this is superior, it makes sense to support single model prediction, too.

Fix CircleCI tests (again...)

On CircleCI, neither tensorflow-gpu==0.15.2 or tensorflow==0.15.x can be installed:

Collecting tensorflow==1.15.* (from ocrd-calamari==0.0.3)
  Could not find a version that satisfies the requirement tensorflow==1.15.* (from ocrd-calamari==0.0.3) (from versions: 0.12.1, 1.0.0, 1.0.1, 1.1.0rc0, 1.1.0rc1, 1.1.0rc2, 1.1.0, 1.2.0rc0, 1.2.0rc1, 1.2.0rc2, 1.2.0, 1.2.1, 1.3.0rc0, 1.3.0rc1, 1.3.0rc2, 1.3.0, 1.4.0rc0, 1.4.0rc1, 1.4.0, 1.4.1, 1.5.0rc0, 1.5.0rc1, 1.5.0, 1.5.1, 1.6.0rc0, 1.6.0rc1, 1.6.0, 1.7.0rc0, 1.7.0rc1, 1.7.0, 1.7.1, 1.8.0rc0, 1.8.0rc1, 1.8.0, 1.9.0rc0, 1.9.0rc1, 1.9.0rc2, 1.9.0, 1.10.0rc0, 1.10.0rc1, 1.10.0, 1.10.1, 1.11.0rc0, 1.11.0rc1, 1.11.0rc2, 1.11.0, 1.12.0rc0, 1.12.0rc1, 1.12.0rc2, 1.12.0, 1.12.2, 1.12.3, 1.13.0rc0, 1.13.0rc1, 1.13.0rc2, 1.13.1, 1.13.2, 1.14.0rc0, 1.14.0rc1, 1.14.0, 2.0.0a0, 2.0.0b0, 2.0.0b1)
No matching distribution found for tensorflow==1.15.* (from ocrd-calamari==0.0.3)
Makefile:34: recipe for target 'install' failed
make: *** [install] Error 1

Full log: https://circleci.com/gh/OCR-D/ocrd_calamari/77

The install works on my local machine, using a virtualenv.

A thing that could be tried is an update of pip itself?

Provide more useful info than "No checkpoints provided."

From one of @kba's conversations in Gitter:

"checkpoint": "/home/najem/testa/calamari_model/*.ckpt.json"
Exception: No checkpoints provided.

I see this error from time to time and it would be lot clearer for the users if we could help with messages like:

  • "File /home/najem/testa/calamari_model/*.ckpt.json not found"
  • "/home/najem/testa/calamari_model/*.ckpt.json" are Calamari 0.3 models, we need Calamari 1.0 models

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.