Coder Social home page Coder Social logo

qurator-spk / ocrd_repair_inconsistencies Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 3.0 46 KB

Automatically re-order lines, words and glyphs to become textually consistent with their parents.

License: Apache License 2.0

Python 96.78% Makefile 3.22%
ocr ocr-d page-xml page

ocrd_repair_inconsistencies's Introduction

Caution

This was a one-off script, useful to solve a specific problem. We do not maintain it anymore, but in case you want to use it, we appreciate an e-mail to [email protected] ๐Ÿ•ธ

ocrd_repair_inconsistencies

Automatically re-order lines, words and glyphs to become textually consistent with their parents.

Introduction

PAGE-XML elements with textual annotation are re-ordered by their centroid coordinates iff such re-ordering fixes the inconsistency between their appropriately concatenated TextEquiv texts with their parent's TextEquiv text.

If TextEquiv is missing, skip the respective elements.

Where available, respect the annotated visual order:

  • For regions vs lines, sort in top-to-bottom fashion, unless another textLineOrder is annotated.
    (Both left-to-right and right-to-left will be skipped currently.)
  • For lines vs words and words vs glyphs, sort in left-to-right fashion, unless another readingDirection is annotated.
    (Both top-to-bottom and bottom-to-top will be skipped currently.)

This processor does not affect ReadingOrder between regions, just the order of the XML elements below the region level, and only if not contradicting the annotated textLineOrder/readingDirection.

We wrote this as a one-shot script to fix some files. Use with caution.

Installation

(In your venv, run:)

make deps     # or pip install -r requirements.txt
make install  # or pip install .

Usage

Offers the following user interfaces:

OCR-D processor CLI ocrd-repair-inconsistencies

To be used with PageXML documents in an OCR-D annotation workflow.

Example

Use the following script to repair OCR-D-GT-PAGE annotation in workspaces, and then replace it with the output on success:

#!/bin/bash
set -e

tmp_fg=FIXED_$RANDOM

ocrd-repair-inconsistencies -I OCR-D-GT-PAGE -O $tmp_fg

for f in "$tmp_fg"/*; do
  g="OCR-D-GT-PAGE/OCR-D-GT-PAGE_${f#${tmp_fg}/${tmp_fg}_}"
  cp "$f" "$g"
done

ocrd workspace remove-group -rf $tmp_fg

ocrd_repair_inconsistencies's People

Contributors

bertsky avatar kba avatar mikegerber avatar stweil avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

bertsky kba stweil

ocrd_repair_inconsistencies's Issues

Find a better name

It repairs inconsistencies in segment texts vs their children texts vs XML order and the name should reflect that.

generalize to other textLineOrder

Currently the repair does not apply when the region's @textLineOrder is something other than top-to-bottom. But the code could easily be generalized to all allowed values (while still assuming top-to-bottom when nothing is annotated).

Moreover, IMO readingDirection should not be checked at all on the region vs line level (only on the line vs word and word vs glyph levels).

add tests and CI

I imagine this can fail in many ways. Do you have good example data? Or rather, create them artificially by re-ordering segments in good GT ad-hoc?

As for negative tests, we could probably use kant_aufklaerung_1784 from OCR-D/assets because of its bad tokenization, plus some bags/filegrps without text or with missing text.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.