qurator-spk / dinglehopper Goto Github PK

An OCR evaluation tool

License: Apache License 2.0

Python 62.10% Jupyter Notebook 31.62% JavaScript 0.91% Jsonnet 0.34% Jinja 5.02%

ocr ocr-evaluation alto-xml alto page-xml page ocr-d qurator

dinglehopper's Introduction

dinglehopper

dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report. It also supports batch processing by generating, aggregating and summarizing multiple reports.

Goals

Useful
- As a UI tool
- For an automated evaluation
- As a library
Unicode support

Installation

It's best to use pip to install the package from PyPI, e.g.:

pip install dinglehopper

Usage

Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX] [REPORTS_FOLDER]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results. In
  that case, use --no-metrics to disable the then meaningless metrics and also
  change the color scheme from green/red to blue.

  The comparison report will be written to
  $REPORTS_FOLDER/$REPORT_PREFIX.{html,json}, where $REPORTS_FOLDER defaults
  to the current working directory and $REPORT_PREFIX defaults to "report".
  The reports include the character error rate (CER) and the word error rate
  (WER).

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --differences BOOLEAN     Enable reporting character and word level
                            differences
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.

For example:

dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml

This generates report.html and report.json.

Batch comparison between folders of GT and OCR files can be done by simply providing folders:

dinglehopper gt/ ocr/ report output_folder/

This assumes that you have files with the same name in both folders, e.g. gt/00000001.page.xml and ocr/00000001.alto.xml.

The example generates reports for each set of files, with the prefix report, in the (automatically created) folder output_folder/.

By default, the JSON report does not contain the character and word differences, only the calculated metrics. If you want to include the differences, use the --differences flag:

dinglehopper gt/ ocr/ report output_folder/ --differences

dinglehopper-summarize

A set of (JSON) reports can be summarized into a single set of reports. This is useful after having generated reports in batch. Example:

dinglehopper-summarize output_folder/

This generates summary.html and summary.json in the same output_folder.

If you are summarizing many reports and have used the --differences flag while generating them, it may be useful to limit the number of differences reported by using the --occurences-threshold parameter. This will reduce the size of the generated HTML report, making it easier to open and navigate. Note that the JSON report will still contain all differences. Example:

dinglehopper-summarize output_folder/ --occurences-threshold 10

dinglehopper-line-dirs

You also may want to compare a directory of GT text files (i.e. gt/line0001.gt.txt) with a directory of OCR text files (i.e. ocr/line0001.some-ocr.txt) with a separate CLI interface:

dinglehopper-line-dirs gt/ ocr/

dinglehopper-extract

The tool dinglehopper-extract extracts the text of the given input file on stdout, for example:

dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml

OCR-D

As a OCR-D processor:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL

This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

The OCR-D processor has these parameters:

Parameter	Meaning
`-P metrics false`	Disable metrics and the green-red color scheme (default: enabled)
`-P textequiv_level line`	(PAGE) Extract text from TextLine level (default: TextRegion level)

For example:

ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false

Developer information

Please refer to README-DEV.md.

dinglehopper's People

Contributors

Stargazers

Watchers

Forkers

stweil amitdo kba wrznr bobld jkamlah bertsky diegosiqueir4 trendingtechnology circleci-config-suggestions-bot inl sadra-barikbin

dinglehopper's Issues

COMBINING LATIN SMALL LETTER O and COMBINING RING ABOVE can be ignored in compare

UTF-8 allows different representations for the same character. Dinglehoppers currently does not detect that such different representations are identical characters, but handles them like a recognition error.

This can be fixed by normalizing the text before doing the comparison.

Example: We just had a case where the GT transcription used zuͦſein (u + COMBINING RING ABOVE) while the OCR detected zůſein (LATIN SMALL LETTER U WITH RING ABOVE). See https://digi.bib.uni-mannheim.de/fileadmin/digi/477429599/dh_055.html.

Skip when there is no file matching the pageId

ocrd-dinglehopper should issue a warning and skip a page if there is no matching GT or OCR file for a page.

Reported by @mnoelte in Gitter:
https://gitter.im/OCR-D/Lobby?at=5f76f0750dbbcf3dfa50648f

Add --progress parameter

Please provide a --verbose parameter. I just ran a comparison of a text file with an XML file (both not part of any OCR-D process) and waited forever. Finally, I aborted the process. It would be nice to know what was the issue internally....

In #26 a progress bar was proposed by @mikegerber .

Check licenses of used libraries

dinglehopper is Apache-licensed. All libraries used as libraries need to have a compatible license, e.g. BSD, MIT, Apache or public domain. GPL-licensed programs used seem to be fine. See also #48 for a relevant discussion.

Checklist from requirements*.txt:

More elegant handling of NFC conversion

Improve performance when calculating sequence alignment

Dinglehopper is using a custom Python implementation of the Levenshtein distance to calculate, score and show an alignment of two given texts.

According to my performance analysis done for #47 the distance and editops functions of this custom implementation is the main bottleneck when comparing explicitly bad or big OCR results.

In #48 I proposed to use the C based python-Levenshtein as a replacement, which we discarded for the following reasons:

No support for aligning sequences of words (see comment by @mikegerber).
Currently no active maintenance.
Viral license (GPL 2)

One alternative and fast implementation for distance calculation is RapidFuzz, where @maxbachmann already started to adress the issue of the distance calculation for arbitrary sequences in rapidfuzz/RapidFuzz#100.

At the moment RapidFuzz is not supporting the calculcation of edit operations (see comment by @maxbachmann).

No tests are run for PRs?

#67 caused the code to fail, and the tests weren't run for the PR. Why?

try cutting corners to become faster

I know the task of aligning concatenated pages is much harder than just aligning text lines from the same segmentation (as in ocrd-cor-asv-ann-evaluate). But here it's all the more pressing to get decent performance IMO.

Therefore I would suggest the following:

replace the matrix calculation with some C library backend (there are many existing packages for Python that already do this). The algorithm will still be O(n²) but the linear factor will be an order of magnitude smaller IIRC.
Instead of aligning the full page, try to cut corners by making the problem n smaller: first compare pairs of regions, taking the best fit, then compare pairs of lines, taking the best fit. This comparison above the line level could also be merely approximate (you only need a proportional score, no actual alignment, and you can have boundaries, no exhaustive search). For example, difflib.SequenceMatcher.quick_ratio can do this.

Feature request: fold to unaccented & lowercase

For search engine performance evaluation it would be nice to be able to compare text based on its base characters (e. g. Ä → a).

An option to ignore punctation would be nice, too.

Review error rate definitions etc.

Move Travis builds to CircleCI

getLogger Irritation with regular CLI

Issue description

Using recent version (1778b3) of dinglehopper complains because of OCR-D-Logger

dinglehopper 1300565-gt.xml 1300565.xml

=> 

21:17:33.416 CRITICAL root - getLogger was called before initLogging. Source of the call:
21:17:33.416 CRITICAL root -   File "/home/hartwig/Projekte/work/mlu/ulb/ulb-sachsen-anhalt-dinglehopper/qurator/dinglehopper/extracted_text.py", line 243, in get_first_textequiv
21:17:33.416 CRITICAL root -     log = getLogger("processor.OcrdDinglehopperEvaluate")

Even though all report-files are being generated, the output is somehow irritating.

Steps to reproduce the issue

call dinglehopper 1300565-gt.xml 1300565.xml (attached)

What's the expected result?

No logging error or no logging at all if no OCR-D is around

Additional details

The Problem could be worked around if you use OCR-D's initLogging also within the context of the non-OCR-D-CLI, adding in cli.py something like this:

initLogging()
Config.progress = progress
process(gt, ocr, report_prefix, metrics=metrics, textequiv_level=textequiv_level)

Does dinglehopper want to stick with the OCR-D-Logger also in potential non-OCR-D-contexts? Further, it looks like dinglehopper is currently missing any dedicated logging-configuration, which couples it rather strong not only to the OCR-D-Logging-Logic, but also to it's configuration.

1300565-test.zip

Horrible failure with large documents

@stweil reported in Gitter:

Improvements of dinglehopper are very welcome. The old version took more than 4 hours to process two text files with 1875 lines each and required about 30 GB RAM. The new version terminates after 2 minutes, but with out of memory: it was killed by the Linux kernel after using more than 60 GB RAM. :-(

@cneud also submitted a large document (a newspaper page).

Investigate why the new version uses even more memory
Consider falling back to more efficient algorithms if necessary
Consider a regression test for this

Do not hardcode equivalences/substitutions

2020-10 actevedef_718448162-GT-ORDER-WRONG

DingleHopper does not create results

Using ocrd, version 2.38.0

I have tried out ocrd-dinglehopper like this:

ocrd-dinglehopper -l DEBUG -I $gtfileGrp,$ocrFileGrp -O $dinglefolder -P textequiv_level line

Somehow, from one of the input files no text is taken:

{
    "gt": "OCR-D-SEG-KRAK/OCR-D-SEG-KRAK_4749_007817786_00157.xml",
    "ocr": "OCR-D-TESS-OCR-MOD-04/OCR-D-TESS-OCR-MOD-04-4749_007817786_00157.xml",

    "cer": 4.5,
    "wer": Infinity,

    "n_characters": 56,
    "n_words": 0
}

Any idea?

Test on Python 3.10 fails

https://app.circleci.com/pipelines/github/qurator-spk/dinglehopper/17/workflows/92b2aab0-1517-4b2f-8099-15f7190426ce/jobs/79

Display image

The tool should display an image corresponding to the text line/OCR error selected.

I'll probably use the local images to display, not IIIF as this seems more general.

Explore JSCSS possibilities to crop images
What about TIFF support in the browser? Or mandate converting to PNG/JPEG?
Implement it

dinglehopper keep hanging and test errors

running dinglehopper gt txt and dinglehopper-line-dirs keep hanging without message, and pytest returns errors:

collected 62 items / 18 deselected / 44 selected                                                   

qurator/dinglehopper/tests/extracted_text_test.py .............                              [ 29%]
qurator/dinglehopper/tests/test_align.py .......F..                                          [ 52%]
qurator/dinglehopper/tests/test_character_error_rate.py ..                                   [ 56%]
qurator/dinglehopper/tests/test_edit_distance.py .                                           [ 59%]
qurator/dinglehopper/tests/test_editops.py ..                                                [ 63%]
qurator/dinglehopper/tests/test_ocr_files.py .............                                   [ 93%]
qurator/dinglehopper/tests/test_word_error_rate.py ...                                       [100%]

============================================= FAILURES =============================================
__________________________________ test_with_some_fake_ocr_errors __________________________________

    def test_with_some_fake_ocr_errors():
>       result = list(
            align(
                "Über die vielen Sorgen wegen desselben vergaß",
                "SomeJunk MoreJunk Übey die vielen Sorgen wegen AdditionalJunk deffelben vcrgab",
            )
        )

qurator/dinglehopper/tests/test_align.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

s1 = ['Ü', 'b', 'e', 'r', ' ', 'd', ...], s2 = ['S', 'o', 'm', 'e', 'J', 'u', ...]

    def seq_align(s1, s2):
        """Align general sequences."""
        s1 = list(s1)
        s2 = list(s2)
        ops = levenshtein_editops(s1, s2)
        i = 0
        j = 0
    
        while i < len(s1) or j < len(s2):
            o = None
            try:
                ot = ops[0]
                if ot[1] == i and ot[2] == j:
                    ops = ops[1:]
                    o = ot
            except IndexError:
                pass
    
            if o:
                if o[0] == "insert":
                    yield None, s2[j]
                    j += 1
                elif o[0] == "delete":
                    yield s1[i], None
                    i += 1
                elif o[0] == "replace":
                    yield s1[i], s2[j]
                    i += 1
                    j += 1
            else:
>               yield s1[i], s2[j]
E               IndexError: list index out of range

qurator/dinglehopper/align.py:42: IndexError
===================================== short test summary info ======================================
FAILED qurator/dinglehopper/tests/test_align.py::test_with_some_fake_ocr_errors - IndexError: lis...
=========================== 1 failed, 43 passed, 18 deselected in 30.24s ===========================

also stuck with:
qurator/dinglehopper/tests/test_integ_table_extraction.py ..... [ 83%]
qurator/dinglehopper/tests/test_integ_word_error_rate_ocr.py ..

python version 3.9.0. Thanks.

Set code width to 90

For flake8
For vim

Documentation: README completeness, debug ocrd-tool.json

Please debug your ocrd_tool.json file.
I found an error:

<report valid="false">
  <error>[] 'version' is a required property</error>
</report>

You can find the ocrd-tool.json documentation: https://ocr-d.github.io/ocrd_tool

Please check your README file and complet them. Your README is fine but look at your file and compare your's with the ideal README file.

# Name of application


## Introduction
...

## Installation
...

## Usage
...

## Testing
...

Thank you very much.

Improve printing HTML report

When printed, the HTML report has the following issues for long texts:

Page break between "Character differences" header and the differences
Long texts are cut off after one page in the report

Things to do when dropping Python 3.5 support

This issue is to collect stuff pertaining to dropping Python 3.5 support when it's possible:

str() on Path objects is not necessary anymore on Python 3.6+
- in qurator/dinglehopper/tests/test_integ*.py
Use type annotations instead of type= for attr classes

Update dinglehopper version in ocrd_all

Wait for #10 (Display image)
Wait for #9 (Display segment ID)
Wait for #24 (Release on PyPI)

Display document page metadata

ALTO files contains meta information like this:

<OCRProcessing ID="IdOcr">
  <ocrProcessingStep>
    <processingDateTime>2014-05-21</processingDateTime>
    <processingSoftware>
       <softwareCreator>ABBYY</softwareCreator>
       <softwareName>ABBYY FineReader Engine</softwareName>
      <softwareVersion>11</softwareVersion>
    </processingSoftware>
  </ocrProcessingStep>
</OCRProcessing>

The report should display it.

Offline use

The stylesheet for the report linked as <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css"...> should be distributed with the sources in order to support the rendering even when the tool is used offline.

Improve visual alignment for longer documents

@stweil asked in #62:

Unrelated: in the result the lines from GT and OCR result are side by side at the beginning, but that synchronization gets lost later. Why?

Use black code formatter

Joking aside, I think I'll just use the black code formatter in the future, reasonable results and no more arguing about bike sheds... eh code formatting.

Originally posted by @mikegerber in #37 (comment)

Note: black requires Python >= 3.6!

Todo:

list black in developer requirements
Update setup.cfg und .editorconfig according to https://black.readthedocs.io/en/stable/compatible_configs.html
Maybe: Add pre-commit Hook https://black.readthedocs.io/en/stable/version_control_integration.html
[-] Maybe: Add GitHub Action: https://black.readthedocs.io/en/stable/github_actions.html
Maybe: Inform contributors about the code formatting choice and give a hint about editor integrations: https://black.readthedocs.io/en/stable/editor_integration.html
Reformat codebase without compromising git blame: https://black.readthedocs.io/en/stable/installation_and_usage.html#migrating-your-code-style-without-ruining-git-blame

Extend json report to allow evaluation of a series of pages

To analyze a series of pages as a whole, it would be helpful to include the number of characters/words in the created json file.
e.g.: noOfCharacters=200
noOfWords=43

Add naming parameter for output file

There should be a way to parameterize the name of the output file.

If I evaluate multiple files, with the dinglehopper command, it should be possible to name every output file. Currently, I have to run a script that renames each report.json or report.html.

Test using Python 3.9

Feature Request: Comparison options

It would be nice to have an option to ...

fold accented and uppercase characters to lowercase (Ä → a) and to ignore punctation.
let dinglehopper try to arrange paragraphs so that wrong segmentation order (perhaps not so important for full text search) can be ignored

Feature request: status line text with segment IDs

To make navigation in the source annotation easier, show the current TextRegion / TextLine / Word / Glyph ids in the browser's status bar.

Generate per-workspace CER + WER

It's easy to calculate this from the individual CER/WER and the character/word counts.
But how to save a global JSON report in the METS? It would not "manifest a physical page" which OCR-D seems to demand for any file

Support comparing line GT directories with line OCR directories

In #62, @stweil's original problem was - as I understand it - to compare a directory with line GT text files with a directory of line OCR text files. For now I've created fake test data to implement this fake-line-gt.zip. It looks like this:

% ls *
gt:
line001.gt.txt  line003.gt.txt  line005.gt.txt  line007.gt.txt  line009.gt.txt  line011.gt.txt
line002.gt.txt  line004.gt.txt  line006.gt.txt  line008.gt.txt  line010.gt.txt

some-ocr:
line001.some-ocr.txt  line003.some-ocr.txt  line005.some-ocr.txt  line007.some-ocr.txt  line009.some-ocr.txt  line011.some-ocr.txt
line002.some-ocr.txt  line004.some-ocr.txt  line006.some-ocr.txt  line008.some-ocr.txt  line010.some-ocr.txt

A first implementation should compare the text of pairs files (matching by filename) and produce a report of metrics & differences over all of the lines. First idea of the CLI interface:

dinglehopper-lines gt/ --gt-suffix .gt.txt some-ocr/ --ocr-suffix .some-ocr.txt

I'm not sure if this will be the final CLI interface but it's what seems necessary on first glance.

UnorderedGroup

@cneud reported problems with the ENP dataset. Example files:

example.zip

The GT file contains an UnorderedGroup which triggers an NotImplementedError:

% dinglehopper 00008061.gt.xml 00008061.eng.xml
Traceback (most recent call last):
  File "/home/mike/.virtualenvs/dinglehopper-github/bin/dinglehopper", line 11, in <module>
    load_entry_point('dinglehopper', 'console_scripts', 'dinglehopper')()
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/mike/.virtualenvs/dinglehopper-github/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/cli.py", line 180, in main
    process(gt, ocr, report_prefix, metrics=metrics, textequiv_level=textequiv_level)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/cli.py", line 93, in process
    gt_text = extract(gt, textequiv_level=textequiv_level)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/ocr_files.py", line 155, in extract
    return page_extract(tree, textequiv_level=textequiv_level)
  File "/home/mike/devel/dinglehopper-github/qurator/dinglehopper/ocr_files.py", line 79, in page_extract
    raise NotImplementedError
NotImplementedError

Make this a warning and read UnorderedGroups in XML order
Check what other tools do with this
Find a proper solution (Hard!)

Add a parameter for selection of text level (PAGE XML)

Currently, dinglehopper extracts text from PAGE XML files on the region level (https://github.com/qurator-spk/dinglehopper/blob/master/qurator/dinglehopper/ocr_files.py#L50). It would be wonderful if you could add a level-of-operation parameter to allow for extraction from line or word level. (Manual OCR correction is often done on a specific level and propagation of text through the different levels is not widely implemented, i.e. I only know of the Aletheia pro edition which does it in both directions)

Must HTML-quote angle brackets etc.

I often get these:

htmlParseStartTag: invalid element name
iff">h</span>-<br>Berlin. In dieſem Jahre iſt no<span class="cdiff1975 diff"><

tmlParseStartTag: invalid element name
e;<br>| Ih ſ<span class="cdiff3399 diff">c</span><span class="cdiff3400 diff"><

Web browsers seem to go along with this, but tools don't.

Update RapidFuzz again

For now, we stick to rapidfuzz ~ 2.0.0, because later rapidfuzz seems to produce un-slice-able editops (see #67). I've reverted #67 for now, but, when fixed, the changes needs to be merged again.

Honor TextEquiv index

https://ocr-d.de/de/gt-guidelines/pagexml/pagecontent_xsd_Complex_Type_pc_TextLineType.html#TextLineType_TextEquiv

@JKamlah wrote:

This is due to the work with LAREX. We did some corrections with LAREX on line level to produce GT files. LAREX kept both the original text and the corrected text in the result file and separated them by index. The original text got the index 1 and the corrected ones index 0, not corrected lines got no index at all. I don't know if that is a LAREX specific procedure(?). Link to the LAREX example.

PAGE specs:

Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content.

(See #5 (comment))

Monthly scheduled tests

Feature request: list with error frequencies in report

A useful feature for further analysis of errors or for post-correction provided in other OCR evaluation tools are added statistics such as e.g. lists with the frequency of character/word errors per page and the types of errors (insertion/deletion/substitution). See e.g. the output example of ocrevalUAtion.

Support optional stopword list

A common use case for OCR evaluation (e.g. for search engine indexing, text- and data mining, asf.) is to omit stopwords from the word evaluation to get an understanding of the correctness of "significant words" only.

It would therefore be useful if dinglehopper would also support the optional use of a stopword list provided via parameter/config file. This is already supported in ocrevalUAtion.

Support disabling metrics + green/red

When comparing two OCR results it should be possible to disable metrics and the green/red color as they do not make sense.

Option/OCR-D parameter could be --no-metrics/no-metrics.

Add option in the CLI
Add parameter in the OCR-D processor
What to do about the JSON report?

Review ALTO text extraction

PPN768641977, page 0002
Is "ist nutzlos geworden. [...]" at the right place?

Feature request: rearrange paragraphs for minimum difference

It would be nice if dinglehopper could try to arrange paragraphs so that wrong segmentation order (perhaps not so important for full text search) could be ignored.

Documentation Enhancement

Please consider explaining some details in your documentation:

What is WER and CER. If you are not familiar with these terms, you don't grasp it emediately although it is an easy concept.
Does dinglehopper automatically recognize the import format?
How are text files and XML files compared? Are the XML files simply stripped down to their text representation? How do you assure that there is no additional (or missing) empty paragraph screwing the evaluation?

Also, please provide a --verbose parameter. I just ran a comparison of a text file with an XML file (both not part of any OCR-D process) and waited for every. FInally, I aborted the process. It would be nice to know what was the issue internally.... (Moved to #30)

And, despite this critique, thank you for providing such a handy tool! :)

Edit:
I found even more:

How can I process a bunch of ground truth files that are not part of the OCR-D mets.xml. Or, how can I assign them their corresponding page in the mets.xml? There should be some way!

Discuss IIIF support

There has been some unresolved discussion with @cneud regarding IIIF support. This should be cleared up.

Release on PyPi

Create a GitHub Actions workflow to release on PyPI
After fixing qurator-spk/setuptools_ocrd#10, remove the workaround MANIFEST.in again
Review contents of sdist
Update README (to recommend installing from PyPI)

Warn if there is text missing in the ReadingOrder

For 00451941.gt.xml, dinglehopper-extract does not extract the header's text DE L'ESPRIT DE L'HOMME.