Coder Social home page Coder Social logo

qurator-spk / dinglehopper Goto Github PK

View Code? Open in Web Editor NEW
55.0 6.0 12.0 3.88 MB

An OCR evaluation tool

License: Apache License 2.0

Python 62.10% Jupyter Notebook 31.62% JavaScript 0.91% Jsonnet 0.34% Jinja 5.02%
ocr ocr-evaluation alto-xml alto page-xml page ocr-d qurator

dinglehopper's Introduction

dinglehopper

dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report. It also supports batch processing by generating, aggregating and summarizing multiple reports.

Tests GitHub tag License issues - dinglehopper

Goals

  • Useful
    • As a UI tool
    • For an automated evaluation
    • As a library
  • Unicode support

Installation

It's best to use pip to install the package from PyPI, e.g.:

pip install dinglehopper

Usage

Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX] [REPORTS_FOLDER]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results. In
  that case, use --no-metrics to disable the then meaningless metrics and also
  change the color scheme from green/red to blue.

  The comparison report will be written to
  $REPORTS_FOLDER/$REPORT_PREFIX.{html,json}, where $REPORTS_FOLDER defaults
  to the current working directory and $REPORT_PREFIX defaults to "report".
  The reports include the character error rate (CER) and the word error rate
  (WER).

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --differences BOOLEAN     Enable reporting character and word level
                            differences
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.

For example:

dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml

This generates report.html and report.json.

dinglehopper displaying metrics and character differences

Batch comparison between folders of GT and OCR files can be done by simply providing folders:

dinglehopper gt/ ocr/ report output_folder/

This assumes that you have files with the same name in both folders, e.g. gt/00000001.page.xml and ocr/00000001.alto.xml.

The example generates reports for each set of files, with the prefix report, in the (automatically created) folder output_folder/.

By default, the JSON report does not contain the character and word differences, only the calculated metrics. If you want to include the differences, use the --differences flag:

dinglehopper gt/ ocr/ report output_folder/ --differences

dinglehopper-summarize

A set of (JSON) reports can be summarized into a single set of reports. This is useful after having generated reports in batch. Example:

dinglehopper-summarize output_folder/

This generates summary.html and summary.json in the same output_folder.

If you are summarizing many reports and have used the --differences flag while generating them, it may be useful to limit the number of differences reported by using the --occurences-threshold parameter. This will reduce the size of the generated HTML report, making it easier to open and navigate. Note that the JSON report will still contain all differences. Example:

dinglehopper-summarize output_folder/ --occurences-threshold 10

dinglehopper-line-dirs

You also may want to compare a directory of GT text files (i.e. gt/line0001.gt.txt) with a directory of OCR text files (i.e. ocr/line0001.some-ocr.txt) with a separate CLI interface:

dinglehopper-line-dirs gt/ ocr/

dinglehopper-extract

The tool dinglehopper-extract extracts the text of the given input file on stdout, for example:

dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml

OCR-D

As a OCR-D processor:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL

This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

The OCR-D processor has these parameters:

Parameter Meaning
-P metrics false Disable metrics and the green-red color scheme (default: enabled)
-P textequiv_level line (PAGE) Extract text from TextLine level (default: TextRegion level)

For example:

ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false

Developer information

Please refer to README-DEV.md.

dinglehopper's People

Contributors

b2m avatar bertsky avatar circleci-config-suggestions-bot avatar cneud avatar kba avatar maxbachmann avatar mikegerber avatar neingeist avatar rfdj avatar sadra-barikbin avatar stweil avatar wrznr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dinglehopper's Issues

COMBINING LATIN SMALL LETTER O and COMBINING RING ABOVE can be ignored in compare

UTF-8 allows different representations for the same character. Dinglehoppers currently does not detect that such different representations are identical characters, but handles them like a recognition error.

This can be fixed by normalizing the text before doing the comparison.

Example: We just had a case where the GT transcription used zuͦſein (u + COMBINING RING ABOVE) while the OCR detected zůſein (LATIN SMALL LETTER U WITH RING ABOVE). See https://digi.bib.uni-mannheim.de/fileadmin/digi/477429599/dh_055.html.

Add --progress parameter

Please provide a --verbose parameter. I just ran a comparison of a text file with an XML file (both not part of any OCR-D process) and waited forever. Finally, I aborted the process. It would be nice to know what was the issue internally....

In #26 a progress bar was proposed by @mikegerber .

Check licenses of used libraries

dinglehopper is Apache-licensed. All libraries used as libraries need to have a compatible license, e.g. BSD, MIT, Apache or public domain. GPL-licensed programs used seem to be fine. See also #48 for a relevant discussion.

Checklist from requirements*.txt:

  • click
  • jinja2
  • lxml
  • uniseg
  • numpy
  • colorama
  • MarkupSafe
  • ocrd >= 2.20.1
  • attrs
  • multimethod == 1.3
  • tqdm
  • pytest
  • pytest-flake8
  • pytest-cov
  • pytest-mypy
  • black

Improve performance when calculating sequence alignment

Dinglehopper is using a custom Python implementation of the Levenshtein distance to calculate, score and show an alignment of two given texts.

According to my performance analysis done for #47 the distance and editops functions of this custom implementation is the main bottleneck when comparing explicitly bad or big OCR results.

In #48 I proposed to use the C based python-Levenshtein as a replacement, which we discarded for the following reasons:

  1. No support for aligning sequences of words (see comment by @mikegerber).
  2. Currently no active maintenance.
  3. Viral license (GPL 2)

One alternative and fast implementation for distance calculation is RapidFuzz, where @maxbachmann already started to adress the issue of the distance calculation for arbitrary sequences in rapidfuzz/RapidFuzz#100.

At the moment RapidFuzz is not supporting the calculcation of edit operations (see comment by @maxbachmann).

try cutting corners to become faster

I know the task of aligning concatenated pages is much harder than just aligning text lines from the same segmentation (as in ocrd-cor-asv-ann-evaluate). But here it's all the more pressing to get decent performance IMO.

Therefore I would suggest the following:

  1. replace the matrix calculation with some C library backend (there are many existing packages for Python that already do this). The algorithm will still be O(n²) but the linear factor will be an order of magnitude smaller IIRC.

  2. Instead of aligning the full page, try to cut corners by making the problem n smaller: first compare pairs of regions, taking the best fit, then compare pairs of lines, taking the best fit. This comparison above the line level could also be merely approximate (you only need a proportional score, no actual alignment, and you can have boundaries, no exhaustive search). For example, difflib.SequenceMatcher.quick_ratio can do this.

getLogger Irritation with regular CLI

Issue description

Using recent version (1778b3) of dinglehopper complains because of OCR-D-Logger

dinglehopper 1300565-gt.xml 1300565.xml

=> 

21:17:33.416 CRITICAL root - getLogger was called before initLogging. Source of the call:
21:17:33.416 CRITICAL root -   File "/home/hartwig/Projekte/work/mlu/ulb/ulb-sachsen-anhalt-dinglehopper/qurator/dinglehopper/extracted_text.py", line 243, in get_first_textequiv
21:17:33.416 CRITICAL root -     log = getLogger("processor.OcrdDinglehopperEvaluate")

Even though all report-files are being generated, the output is somehow irritating.

Steps to reproduce the issue

  1. call dinglehopper 1300565-gt.xml 1300565.xml (attached)

What's the expected result?

  • No logging error or no logging at all if no OCR-D is around

Additional details

The Problem could be worked around if you use OCR-D's initLogging also within the context of the non-OCR-D-CLI, adding in cli.py something like this:

initLogging()
Config.progress = progress
process(gt, ocr, report_prefix, metrics=metrics, textequiv_level=textequiv_level)

Does dinglehopper want to stick with the OCR-D-Logger also in potential non-OCR-D-contexts? Further, it looks like dinglehopper is currently missing any dedicated logging-configuration, which couples it rather strong not only to the OCR-D-Logging-Logic, but also to it's configuration.

1300565-test.zip

Horrible failure with large documents

@stweil reported in Gitter:

Improvements of dinglehopper are very welcome. The old version took more than 4 hours to process two text files with 1875 lines each and required about 30 GB RAM. The new version terminates after 2 minutes, but with out of memory: it was killed by the Linux kernel after using more than 60 GB RAM. :-(

@cneud also submitted a large document (a newspaper page).

  • Investigate why the new version uses even more memory
  • Consider falling back to more efficient algorithms if necessary
  • Consider a regression test for this

DingleHopper does not create results

Using ocrd, version 2.38.0

I have tried out ocrd-dinglehopper like this:

ocrd-dinglehopper -l DEBUG -I $gtfileGrp,$ocrFileGrp -O $dinglefolder -P textequiv_level line

Somehow, from one of the input files no text is taken:
image

{
    "gt": "OCR-D-SEG-KRAK/OCR-D-SEG-KRAK_4749_007817786_00157.xml",
    "ocr": "OCR-D-TESS-OCR-MOD-04/OCR-D-TESS-OCR-MOD-04-4749_007817786_00157.xml",

    "cer": 4.5,
    "wer": Infinity,

    "n_characters": 56,
    "n_words": 0
}

Any idea?

Display image

The tool should display an image corresponding to the text line/OCR error selected.

I'll probably use the local images to display, not IIIF as this seems more general.

  • Explore JSCSS possibilities to crop images
  • What about TIFF support in the browser? Or mandate converting to PNG/JPEG?
  • Implement it

dinglehopper keep hanging and test errors

running dinglehopper gt txt and dinglehopper-line-dirs keep hanging without message, and pytest returns errors:

collected 62 items / 18 deselected / 44 selected                                                   

qurator/dinglehopper/tests/extracted_text_test.py .............                              [ 29%]
qurator/dinglehopper/tests/test_align.py .......F..                                          [ 52%]
qurator/dinglehopper/tests/test_character_error_rate.py ..                                   [ 56%]
qurator/dinglehopper/tests/test_edit_distance.py .                                           [ 59%]
qurator/dinglehopper/tests/test_editops.py ..                                                [ 63%]
qurator/dinglehopper/tests/test_ocr_files.py .............                                   [ 93%]
qurator/dinglehopper/tests/test_word_error_rate.py ...                                       [100%]

============================================= FAILURES =============================================
__________________________________ test_with_some_fake_ocr_errors __________________________________

    def test_with_some_fake_ocr_errors():
>       result = list(
            align(
                "Über die vielen Sorgen wegen desselben vergaß",
                "SomeJunk MoreJunk Übey die vielen Sorgen wegen AdditionalJunk deffelben vcrgab",
            )
        )

qurator/dinglehopper/tests/test_align.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

s1 = ['Ü', 'b', 'e', 'r', ' ', 'd', ...], s2 = ['S', 'o', 'm', 'e', 'J', 'u', ...]

    def seq_align(s1, s2):
        """Align general sequences."""
        s1 = list(s1)
        s2 = list(s2)
        ops = levenshtein_editops(s1, s2)
        i = 0
        j = 0
    
        while i < len(s1) or j < len(s2):
            o = None
            try:
                ot = ops[0]
                if ot[1] == i and ot[2] == j:
                    ops = ops[1:]
                    o = ot
            except IndexError:
                pass
    
            if o:
                if o[0] == "insert":
                    yield None, s2[j]
                    j += 1
                elif o[0] == "delete":
                    yield s1[i], None
                    i += 1
                elif o[0] == "replace":
                    yield s1[i], s2[j]
                    i += 1
                    j += 1
            else:
>               yield s1[i], s2[j]
E               IndexError: list index out of range

qurator/dinglehopper/align.py:42: IndexError
===================================== short test summary info ======================================
FAILED qurator/dinglehopper/tests/test_align.py::test_with_some_fake_ocr_errors - IndexError: lis...
=========================== 1 failed, 43 passed, 18 deselected in 30.24s ===========================

also stuck with:
qurator/dinglehopper/tests/test_integ_table_extraction.py ..... [ 83%]
qurator/dinglehopper/tests/test_integ_word_error_rate_ocr.py ..

python version 3.9.0. Thanks.

Documentation: README completeness, debug ocrd-tool.json

Please debug your ocrd_tool.json file.
I found an error:

<report valid="false">
  <error>[] 'version' is a required property</error>
</report>

You can find the ocrd-tool.json documentation: https://ocr-d.github.io/ocrd_tool

Please check your README file and complet them. Your README is fine but look at your file and compare your's with the ideal README file.

# Name of application


## Introduction
...

## Installation
...

## Usage
...

## Testing
...

Thank you very much.

Improve printing HTML report

When printed, the HTML report has the following issues for long texts:

  • Page break between "Character differences" header and the differences
  • Long texts are cut off after one page in the report

Things to do when dropping Python 3.5 support

This issue is to collect stuff pertaining to dropping Python 3.5 support when it's possible:

  • str() on Path objects is not necessary anymore on Python 3.6+
    • in qurator/dinglehopper/tests/test_integ*.py
  • Use type annotations instead of type= for attr classes

Display document page metadata

ALTO files contains meta information like this:

<OCRProcessing ID="IdOcr">
  <ocrProcessingStep>
    <processingDateTime>2014-05-21</processingDateTime>
    <processingSoftware>
       <softwareCreator>ABBYY</softwareCreator>
       <softwareName>ABBYY FineReader Engine</softwareName>
      <softwareVersion>11</softwareVersion>
    </processingSoftware>
  </ocrProcessingStep>
</OCRProcessing>

The report should display it.

Offline use

The stylesheet for the report linked as <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"...> should be distributed with the sources in order to support the rendering even when the tool is used offline.

Use black code formatter

Joking aside, I think I'll just use the black code formatter in the future, reasonable results and no more arguing about bike sheds... eh code formatting.

Originally posted by @mikegerber in #37 (comment)

Note: black requires Python >= 3.6!

Todo:

Add naming parameter for output file

There should be a way to parameterize the name of the output file.

If I evaluate multiple files, with the dinglehopper command, it should be possible to name every output file. Currently, I have to run a script that renames each report.json or report.html.

Feature Request: Comparison options

It would be nice to have an option to ...

  • fold accented and uppercase characters to lowercase (Ä → a) and to ignore punctation.
  • let dinglehopper try to arrange paragraphs so that wrong segmentation order (perhaps not so important for full text search) can be ignored

Generate per-workspace CER + WER

  1. It's easy to calculate this from the individual CER/WER and the character/word counts.
  2. But how to save a global JSON report in the METS? It would not "manifest a physical page" which OCR-D seems to demand for any file

Support comparing line GT directories with line OCR directories

In #62, @stweil's original problem was - as I understand it - to compare a directory with line GT text files with a directory of line OCR text files. For now I've created fake test data to implement this fake-line-gt.zip. It looks like this:

% ls *
gt:
line001.gt.txt  line003.gt.txt  line005.gt.txt  line007.gt.txt  line009.gt.txt  line011.gt.txt
line002.gt.txt  line004.gt.txt  line006.gt.txt  line008.gt.txt  line010.gt.txt

some-ocr:
line001.some-ocr.txt  line003.some-ocr.txt  line005.some-ocr.txt  line007.some-ocr.txt  line009.some-ocr.txt  line011.some-ocr.txt
line002.some-ocr.txt  line004.some-ocr.txt  line006.some-ocr.txt  line008.some-ocr.txt  line010.some-ocr.txt

A first implementation should compare the text of pairs files (matching by filename) and produce a report of metrics & differences over all of the lines. First idea of the CLI interface:

dinglehopper-lines gt/ --gt-suffix .gt.txt some-ocr/ --ocr-suffix .some-ocr.txt

I'm not sure if this will be the final CLI interface but it's what seems necessary on first glance.

UnorderedGroup

@cneud reported problems with the ENP dataset. Example files:

example.zip

The GT file contains an UnorderedGroup which triggers an NotImplementedError:

% dinglehopper 00008061.gt.xml 00008061.eng.xml
Traceback (most recent call last):
  File "/home/mike/.virtualenvs/dinglehopper-github/bin/dinglehopper", line 11, in <module>
    load_entry_point('dinglehopper', 'console_scripts', 'dinglehopper')()
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/cli.py", line 180, in main
    process(gt, ocr, report_prefix, metrics=metrics, textequiv_level=textequiv_level)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/cli.py", line 93, in process
    gt_text = extract(gt, textequiv_level=textequiv_level)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/ocr_files.py", line 155, in extract
    return page_extract(tree, textequiv_level=textequiv_level)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/ocr_files.py", line 79, in page_extract
    raise NotImplementedError
NotImplementedError
  • Make this a warning and read UnorderedGroups in XML order
  • Check what other tools do with this
  • Find a proper solution (Hard!)

Add a parameter for selection of text level (PAGE XML)

Currently, dinglehopper extracts text from PAGE XML files on the region level (https://github.com/qurator-spk/dinglehopper/blob/master/qurator/dinglehopper/ocr_files.py#L50). It would be wonderful if you could add a level-of-operation parameter to allow for extraction from line or word level. (Manual OCR correction is often done on a specific level and propagation of text through the different levels is not widely implemented, i.e. I only know of the Aletheia pro edition which does it in both directions)

Must HTML-quote angle brackets etc.

I often get these:

htmlParseStartTag: invalid element name
iff">h</span>-<br>Berlin. In dieſem Jahre iſt no<span class="cdiff1975 diff"><
tmlParseStartTag: invalid element name
e;<br>| Ih ſ<span class="cdiff3399 diff">c</span><span class="cdiff3400 diff"><

Web browsers seem to go along with this, but tools don't.

Update RapidFuzz again

For now, we stick to rapidfuzz ~ 2.0.0, because later rapidfuzz seems to produce un-slice-able editops (see #67). I've reverted #67 for now, but, when fixed, the changes needs to be merged again.

Honor TextEquiv index

https://ocr-d.de/de/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextLineType.html#TextLineType_TextEquiv

@JKamlah wrote:

This is due to the work with LAREX. We did some corrections with LAREX on line level to produce GT files. LAREX kept both the original text and the corrected text in the result file and separated them by index. The original text got the index 1 and the corrected ones index 0, not corrected lines got no index at all. I don't know if that is a LAREX specific procedure(?). Link to the LAREX example.

PAGE specs:

Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content.

(See #5 (comment))

Feature request: list with error frequencies in report

A useful feature for further analysis of errors or for post-correction provided in other OCR evaluation tools are added statistics such as e.g. lists with the frequency of character/word errors per page and the types of errors (insertion/deletion/substitution). See e.g. the output example of ocrevalUAtion.

Support optional stopword list

A common use case for OCR evaluation (e.g. for search engine indexing, text- and data mining, asf.) is to omit stopwords from the word evaluation to get an understanding of the correctness of "significant words" only.

It would therefore be useful if dinglehopper would also support the optional use of a stopword list provided via parameter/config file. This is already supported in ocrevalUAtion.

Support disabling metrics + green/red

When comparing two OCR results it should be possible to disable metrics and the green/red color as they do not make sense.

Option/OCR-D parameter could be --no-metrics/no-metrics.

  • Add option in the CLI
  • Add parameter in the OCR-D processor
  • What to do about the JSON report?

Documentation Enhancement

Please consider explaining some details in your documentation:

  • What is WER and CER. If you are not familiar with these terms, you don't grasp it emediately although it is an easy concept.
  • Does dinglehopper automatically recognize the import format?
  • How are text files and XML files compared? Are the XML files simply stripped down to their text representation? How do you assure that there is no additional (or missing) empty paragraph screwing the evaluation?

Also, please provide a --verbose parameter. I just ran a comparison of a text file with an XML file (both not part of any OCR-D process) and waited for every. FInally, I aborted the process. It would be nice to know what was the issue internally.... (Moved to #30)

And, despite this critique, thank you for providing such a handy tool! :)

Edit:
I found even more:

  • How can I process a bunch of ground truth files that are not part of the OCR-D mets.xml. Or, how can I assign them their corresponding page in the mets.xml? There should be some way!

Release on PyPi

  • Create a GitHub Actions workflow to release on PyPI
  • After fixing qurator-spk/setuptools_ocrd#10, remove the workaround MANIFEST.in again
  • Review contents of sdist
  • Update README (to recommend installing from PyPI)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.