ocr-d / ocrd_keraslm Goto Github PK

6.0 7.0 6.0 291 KB

Simple character-based language model using keras

License: Apache License 2.0

Python 95.94% Makefile 2.82% Shell 0.50% Dockerfile 0.74%

ocr-d

ocrd_keraslm's Introduction

ocrd_keraslm

character-level language modelling using Keras

Introduction
Installation
Usage
Testing

Introduction

This is a tool for statistical language modelling (predicting text from context) with recurrent neural networks. It models probabilities not on the word level but the character level so as to allow open vocabulary processing (avoiding morphology, historic orthography and word segmentation problems). It manages a vocabulary of mapped characters, which can be easily extended by training on more text. Above that, unmapped characters are treated with underspecification.

In addition to character sequences, (meta-data) context variables can be configured as extra input.

Architecture

The model consists of:

an input layer: characters are represented as indexes from the vocabulary mapping, in windows of a number length of characters,
a character embedding layer: window sequences are converted into dense vectors by looking up the indexes in an embedding weight matrix,
a context embedding layer: context variables are converted into dense vectors by looking up the indexes in an embedding weight matrix,
character and context vector sequences are concatenated,
a number depth of hidden layers: each with a number width of hidden recurrent units of LSTM cells (Long Short-term Memory) connected on top of each other,
an output layer derived from the transposed character embedding matrix (weight tying): hidden activations are projected linearly to vectors of dimensionality equal to the character vocabulary size, then softmax is applied returning a probability for each possible value of the next character, respectively.

The model is trained by feeding windows of text in index representation to the input layer, calculating output and comparing it to the same text shifted backward by 1 character, and represented as unit vectors ("one-hot coding") as target. The loss is calculated as the (unweighted) cross-entropy between target and output. Backpropagation yields error gradients for each layer, which is used to iteratively update the weights (stochastic gradient descent).

This is implemented in Keras with Tensorflow as backend. It automatically uses a fast CUDA-optimized LSTM implementation (Nividia GPU and Tensorflow installation with GPU support, see below), both in learning and in prediction phase, if available.

Modes of operation

Notably, this model (by default) runs statefully, i.e. by implicitly passing hidden state from one window (batch of samples) to the next. That way, the context available for predictions can be arbitrarily long (above length, e.g. the complete document up to that point), or short (below length, e.g. at the start of a text). (However, this is a passive perspective above length, because errors are never back-propagated any further in time during gradient-descent training.) This is favourable to stateless mode because all characters can be output in parallel, and no partial windows need to be presented during training (which slows down).

Besides stateful mode, the model can also be run incrementally, i.e. by explicitly passing hidden state from the caller. That way, multiple alternative hypotheses can be processed together. This is used for generation (sampling from the model) and alternative decoding (finding the best path through a sequence of alternatives).

Context conditioning

Every text has meta-data like time, author, text type, genre, production features (e.g. print vs typewriter vs digital born rich text, OCR version), language, structural element (e.g. title vs heading vs paragraph vs footer vs marginalia), font family (e.g. Antiqua vs Fraktura) and font shape (e.g. bold vs letter-spaced vs italic vs normal) etc.

This information (however noisy) can be very useful to facilitate stochastic modelling, since language has an extreme diversity and complexity. To that end, models can be conditioned on extra inputs here, termed context variables. The model learns to represent these high-dimensional discrete values as low-dimensional continuous vectors (embeddings), also entering the recurrent hidden layers (as a form of simple additive adaptation).

Underspecification

Index zero is reserved for unmapped characters (unseen contexts). During training, its embedding vector is regularised to occupy a center position of all mapped characters (all other contexts), and the hidden layers get to see it every now and then by random degradation. At runtime, therefore, some unknown character (some unknown context) represented as zero does not disturb follow-up predictions too much.

Installation

Required Ubuntu packages:

Python (python or python3)
pip (python-pip or python3-pip)
virtualenv (python-virtualenv or python3-virtualenv)

Create and activate a virtualenv as usual.

If you need a custom version of keras or tensorflow (like GPU support), install them via pip now.

To install Python dependencies and this module, then do:

make deps install

Which is the equivalent of:

pip install -r requirements.txt
pip install -e .

Useful environment variables are:

TF_CPP_MIN_LOG_LEVEL (set to 1 to suppress most of Tensorflow's messages
CUDA_VISIBLE_DEVICES (set empty to force CPU even in a GPU installation)

Usage

This packages has two user interfaces:

command line interface `keraslm-rate`

To be used with string arguments and plain-text files.

Usage: keraslm-rate [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  train                           train a language model
  test                            get overall perplexity from language model
  apply                           get individual probabilities from language model
  generate                        sample characters from language model
  print-charset                   Print the mapped characters
  prune-charset                   Delete one character from mapping
  plot-char-embeddings-similarity
                                  Paint a heat map of character embeddings
  plot-context-embeddings-similarity
                                  Paint a heat map of context embeddings
  plot-context-embeddings-projection
                                  Paint a 2-d PCA projection of context embeddings

Examples:

keraslm-rate train --width 64 --depth 4 --length 256 --model model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/*.tcf.txt
keraslm-rate generate -m model_dta_64_4_256.h5 --number 6 "für die Wiſſen"
keraslm-rate apply -m model_dta_64_4_256.h5 "so schädlich ist es Borkickheile zu pflanzen"
keraslm-rate test -m model_dta_64_4_256.h5 dta_komplett_2017-09-01/txt/grimm_*.tcf.txt

OCR-D processor interface `ocrd-keraslm-rate`

To be used with PageXML documents in an OCR-D annotation workflow. Input could be anything with a textual annotation (TextEquiv on the given textequiv_level). The LM rater could be used for both quality control (without alternative decoding, using only each first index TextEquiv) and part of post-correction (with alternative_decoding=True, finding the best path among TextEquiv indexes).

Usage: ocrd-keraslm-rate [worker|server] [OPTIONS]

  Rate elements of the text with a character-level LSTM language model in Keras

  > Rate text with the language model, either for scoring or finding the
  > best path across alternatives.

  > Open and deserialise PAGE input files, then iterate over the segment
  > hierarchy down to the requested `textequiv_level`, making sequences
  > of first TextEquiv objects (if `alternative_decoding` is false), or
  > of lists of all TextEquiv objects (otherwise) as a linear graph for
  > input to the LM. If the level is above glyph, then insert artificial
  > whitespace TextEquiv where implicit tokenisation rules require it.

  > Next, if `alternative_decoding` is false, then pass the concatenated
  > string of the page text to the LM and map the returned sequence of
  > probabilities to the substrings in the input TextEquiv. For each
  > TextEquiv, calculate the average character probability (LM score)
  > and combine that with the input confidence (OCR score) by applying
  > `lm_weight`. Assign the resulting probability as new confidence to
  > the TextEquiv, and ensure no other TextEquiv remain on the segment.
  > Finally, calculate the overall average LM probability,  and the
  > character and segment-level perplexity, and print it on the logger.

  > Otherwise (i.e with `alternative_decoding=true`), search for the
  > best paths through the input graph of the page (with TextEquiv
  > string alternatives as edges) by applying the LM successively via
  > beam search using `beam_width` (keeping a traceback of LM state
  > history at each node, passing and updating LM state explicitly). As
  > in the above trivial case without `alternative_decoding`, then
  > combine LM scores weighted by `lm_weight` with input confidence on
  > the graph's edges. Also, prune worst paths and apply LM state
  > history clustering to avoid expanding all possible combinations.
  > Finally, look into the current best overall path, traversing back to
  > the last node of the previous page's graph. Lock into that node by
  > removing all current paths that do not derive from it, and making
  > its history path the final decision for the previous page: Apply
  > that path by removing all but the chosen TextEquiv alternatives,
  > assigning the resulting confidences, and making the levels above
  > `textequiv_level` consistent with that textual result (via
  > concatenation joined by whitespace). Also, calculate the overall
  > average LM probability, and the character and segment-level
  > perplexity, and print it on the logger. Moreover, at the last page
  > at the end of the document, lock into the current best path
  > analogously.

  > Produce new output files by serialising the resulting hierarchy for
  > each page.

Subcommands:
    worker      Start a processing worker rather than do local processing
    server      Start a processor server rather than do local processing

Options for processing:
  -m, --mets URL-PATH             URL or file path of METS to process [./mets.xml]
  -w, --working-dir PATH          Working directory of local workspace [dirname(URL-PATH)]
  -I, --input-file-grp USE        File group(s) used as input
  -O, --output-file-grp USE       File group(s) used as output
  -g, --page-id ID                Physical page ID(s) to process instead of full document []
  --overwrite                     Remove existing output pages/images
                                  (with "--page-id", remove only those)
  --profile                       Enable profiling
  --profile-file PROF-PATH        Write cProfile stats to PROF-PATH. Implies "--profile"
  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string
                                  or JSON file path
  -P, --param-override KEY VAL    Override a single JSON object key-value pair,
                                  taking precedence over --parameter
  -U, --mets-server-url URL       URL of a METS Server for parallel incremental access to METS
                                  If URL starts with http:// start an HTTP server there,
                                  otherwise URL is a path to an on-demand-created unix socket
  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
                                  Override log level globally [INFO]

Options for information:
  -C, --show-resource RESNAME     Dump the content of processor resource RESNAME
  -L, --list-resources            List names of processor resources
  -J, --dump-json                 Dump tool description as JSON
  -D, --dump-module-dir           Show the 'module' resource location path for this processor
  -h, --help                      Show this message
  -V, --version                   Show version

Parameters:
   "model_file" [string - REQUIRED]
    path of h5py weight/config file for model trained with keraslm
   "textequiv_level" [string - "glyph"]
    PAGE XML hierarchy level to evaluate TextEquiv sequences on
    Possible values: ["region", "line", "word", "glyph"]
   "alternative_decoding" [boolean - true]
    whether to process all TextEquiv alternatives, finding the best path
    via beam search, and delete each non-best alternative
   "beam_width" [number - 10]
    maximum number of best partial paths to consider during search with
    alternative_decoding
   "lm_weight" [number - 0.5]
    share of the LM scores over the input confidences

Examples:

make deps-test # installs ocrd_tesserocr
make test/assets # downloads GT, imports PageXML, builds workspaces
ocrd workspace -d ws1 clone -a test/assets/kant_aufklaerung_1784/mets.xml
cd ws1
ocrd-tesserocr-segment-region -I OCR-D-IMG -O OCR-D-SEG-BLOCK
ocrd-tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-WORD -P textequiv_level word -P model Fraktur
ocrd-tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESS-GLYPH -P textequiv_level glyph -P model deu-frak
# download Deutsches Textarchiv language model
ocrd resmgr download ocrd-keraslm-rate model_dta_full.h5
# get confidences and perplexity:
ocrd-keraslm-rate -I OCR-D-OCR-TESS-WORD -O OCR-D-OCR-LM-WORD -P model_file model_dta_full.h5 -P textequiv_level word -P alternative_decoding false
# also get best path:
ocrd-keraslm-rate -I OCR-D-OCR-TESS-GLYPH -O OCR-D-OCR-LM-GLYPH -P model_file model_dta_full.h5 -P textequiv_level glyph -P alternative_decoding true -P beam_width 10

Models

Pretrained models will be published under Github release assets and made visible via OCR-D Resource Manager.

So far, the only published models are:

model_dta_full.h5
This LM was configured as stateful contiguous LSTM model (2 layers, 128 hidden nodes each, window length 256), and trained on the complete Deutsches Textarchiv fulltext (80%/20% split).
It achieves a perplexity of 2.51 on the validation subset after 4 epochs.

Testing

make deps-test test

Which is the equivalent of:

pip install -r requirements_test.txt
test -e test/assets || test/prepare_gt.bash test/assets
test -f model_dta_test.h5 || keraslm-rate train -m model_dta_test.h5 test/assets/*.txt
keraslm-rate test -m model_dta_test.h5 test/assets/*.txt
python -m pytest test $(PYTEST_ARGS)

Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).

ocrd_keraslm's People

Contributors

Stargazers

Watchers

Forkers

wrznr kba bertsky amitdo stweil shahul01

ocrd_keraslm's Issues

AttributeError: 'KerasRate' object has no attribute 'rater'

In docker, ocrd/all:maximum

jb@pers16:~/workspace/ocrd-keras> date
Fr 22. Mär 10:37:52 CET 2024

jb@pers16:~/workspace/ocrd-keras> docker pull ocrd/all:maximum
maximum: Pulling from ocrd/all
Digest: sha256:f0321d84bdb293294e6a36efb5d4addca8acf305ee218a4860a54763d9f253d2
Status: Image is up to date for ocrd/all:maximum
docker.io/ocrd/all:maximum

jb@pers16:~/workspace/ocrd-keras> cat run.sh
#!/bin/bash
set -x
set -e
# docker-ocrd ocrd-import .
# docker-ocrd ocrd-tesserocr-recognize -I OCR-D-IMG -O OCR-D-OCR -P segmentation
►_level region -P textequiv_level word -P model deu
docker-ocrd ocrd-keraslm-rate -I OCR-D-OCR -O OCR-D-KERAS -P model_file /home/jb
►/ocrd-models/ocrd-keraslm-rate/model_dta_full.h5 -P textequiv_level word -P
► alternative_decoding false

jb@pers16:~/workspace/ocrd-keras> ./run.sh
+ docker-ocrd ocrd-keraslm-rate -I OCR-D-OCR -O OCR-D-KERAS -P model_file /home/
►jb/ocrd-models/ocrd-keraslm-rate/model_dta_full.h5 -P textequiv_level word -P
► alternative_decoding false
09:36:11.559 INFO processor.KerasRate - INPUT FILE 0 / p0002
09:36:11.636 INFO processor.KerasRate - Scoring text in page 'OCR-D-OCR_test-
►fouche10_5' at the word level
09:36:11.637 INFO ocrd.page_validator.validate - Validating input file 'OCR-D-
►OCR_test-fouche10_5'
09:36:11.870 INFO processor.KerasRate - Rating 1003 elements with a total of
► 3383 characters
09:36:11.870 ERROR ocrd.processor.helpers.run_processor - Failure in processor '
►ocrd-keraslm-rate'
Traceback (most recent call last):
  File "/build/core/src/ocrd/processor/helpers.py", line 130, in run_processor
    processor.process()
  File "/build/ocrd_keraslm/ocrd_keraslm/wrapper/rate.py", line 110, in process
    confidences = self.rater.rate(textstring, context) # much faster
AttributeError: 'KerasRate' object has no attribute 'rater'
Traceback (most recent call last):
  File "/usr/local/sub-venv/headless-tf1/bin/ocrd-keraslm-rate", line 33, in <
►module>
    sys.exit(load_entry_point('ocrd-keraslm', 'console_scripts', 'ocrd-keraslm-
►rate')())
  File "/usr/local/sub-venv/headless-tf1/lib/python3.8/site-packages/click/core.
►py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/sub-venv/headless-tf1/lib/python3.8/site-packages/click/core.
►py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/sub-venv/headless-tf1/lib/python3.8/site-packages/click/core.
►py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/sub-venv/headless-tf1/lib/python3.8/site-packages/click/core.
►py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/build/ocrd_keraslm/ocrd_keraslm/wrapper/cli.py", line 9, in ocrd_
►keraslm_rate
    return ocrd_cli_wrap_processor(KerasRate, *args, **kwargs)
  File "/build/core/src/ocrd/decorators/__init__.py", line 133, in ocrd_cli_wrap
►_processor
    run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
  File "/build/core/src/ocrd/processor/helpers.py", line 133, in run_processor
    raise err
  File "/build/core/src/ocrd/processor/helpers.py", line 130, in run_processor
    processor.process()
  File "/build/ocrd_keraslm/ocrd_keraslm/wrapper/rate.py", line 110, in process
    confidences = self.rater.rate(textstring, context) # much faster
AttributeError: 'KerasRate' object has no attribute 'rater'

Hyphenated words

Dear reader,
does keraslm-rate take hyphenated words into account?

Using this demo file https://digi.ub.uni-heidelberg.de/diglitData/v/keraslm/test-fouche10,5-s1.pdf

It seems that many of the low rated words have hyphens:

With hyphenation:

# median: 0.962098 0.622701 ; mean: 0.948695 0.625144, correlation: 0.315179
# OCR-D-OCR OCR-D-KERAS
0.693236 0.410939  # region0002_line0021_word0003 daf3
0.927003 0.468318  # region0002_line0029_word0006 Rä-
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.931271 0.489822  # region0002_line0000_word0005 pas-
0.928169 0.491138  # region0000_line0004_word0007 sozia-
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494978  # region0003_line0001_word0005 Lyon,
0.926153 0.495819  # region0003_line0000_word0004 Kon-
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496326  # region0002_line0001_word0004 Rousseaus
0.967390 0.496934  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498529  # region0002_line0017_word0006 Lyon
0.910209 0.499826  # region0002_line0018_word0002 Instinktiv
...

Without (manually removed) hyphenation:

# median: 0.962198 0.623943 ; mean: 0.949162 0.628181, correlation: 0.278264
# OCR-D-OCRNOHYP OCR-D-KERNOHYP
0.693236 0.411037  # region0002_line0021_word0003 daf3
0.932888 0.480686  # region0002_line0021_word0002 Lyon,
0.904642 0.484226  # region0002_line0032_word0001 Kerker.
0.909297 0.484817  # region0002_line0032_word0004 klaubt
0.927566 0.492916  # region0002_line0014_word0003 Pythia;
0.958217 0.494058  # region0000_line0002_word0003 Lyon,
0.963757 0.494945  # region0003_line0001_word0005 Lyon,
0.960306 0.496031  # region0002_line0010_word0007 Lyon
0.911557 0.496306  # region0002_line0001_word0004 Rousseaus
0.967390 0.496923  # region0000_line0011_word0003 1792
0.929831 0.497394  # region0002_line0004_word0003 im
0.960453 0.498542  # region0002_line0017_word0006 Lyon
0.910209 0.499822  # region0002_line0018_word0002 Instinktiv
...

Require Keras versions which work with TensorFlow 1

ocrd_keraslm must limit the Keras version, because newer versions of Keras require TensorFlow 2. It currently uses keras >= 2.3.1.

documentation: debug ocrd-tool.json

Please debug your ocrd_tool.json file.
I found an error:

<report valid="false">
  <error>[tools.ocrd-keraslm-rate.parameters.model_file.content-type] 'application/x-hdf;subtype=bag' does not match '^[a-z0-9\\._-]+/[A-Za-z0-9\\._\\+-]+$'</error>
</report>

You can find the ocrd-tool.json documentation: https://ocr-d.github.io/ocrd_tool

Thank you very much.

networkx fails to reach an end node

NB: This issue refers to bertsky/ocrd_keraslm state 87015de, but there is no possibility to open an issue there and the code is already supposed to be used by cor-asv-fst.

The attached test script fails (error message below), although a path from node 0 to 8 clearly exists in the input graph (e.g. 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8).

$ python3 test-keraslm.py
Using TensorFlow backend.
2019-07-02 16:55:47.105686: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
16:55:47.112 INFO ocrd_keraslm.lib.rating - using CPU LSTM implementation to compile stateless incremental model of depth 2 width 512 length 512 size 362
Traceback (most recent call last):
  File "test-keraslm.py", line 112, in <module>
    beam_clustering_dist = 5)
  File "/home/msumalvico/venv/ocrd/lib/python3.5/site-packages/ocrd_keraslm-0.3.1-py3.5.egg/ocrd_keraslm/lib/rating.py", line 823, in rate_best
AssertionError: breadth-first search failed to reach true end node (7 instead of 8)

Versions:
python 3.5.3 (from debian-stable)
networkx 2.3

test-keraslm.py.txt
model.zip

(change the path in line 5 of the script to the path of the attached model file)

start dockerfile and add circleci configuration

...then start releasing on PyPI.

Add ocrd-tool.json

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.