ub-mannheim / austriannewspapers Goto Github PK

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)

ocr fraktur ground-truth antiqua newspaper ocr-d-level-2

austriannewspapers's Introduction

NewsEye / READ OCR training dataset from Austrian Newspapers (19th C.)

About the original data set

Austrian Newspapers is a ground truth data set created with Transkribus from Austrian newspapers by the Library Labs of the Austrian National Library (Österreichische Nationalbibliothek). See this publication for details:

Günter Mühlberger, & Günter Hackl. (2019). NewsEye / READ OCR training dataset from Austrian Newspapers (19th C.) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3387369

The original data set was published under the Creative Commons Attribution 4.0 International license.

Austrian Newspapers 2.0.0 (April 2023)

A revision of the data set was carried out by Mannheim University Library from November 2022 to April 2023 using Transkribus. All transcriptions are provided as PAGE XML in the data folder. The original separation of the data set into TrainingSet_ONB_Newseye_GT_M1+ and ValidationSet_ONB_Newseye_GT_M1+ was kept.

The revision includes:

Layout correction of text regions, text lines and baselines.
Region labeling ("header", "headings", "paragraphs", "reference", "footer").
Correction and enhancement of transcriptions according to OCR-D Ground Truth Guidelines Level 2.

Statistics

Find more information about the revised dataset in our wiki.

Transcription guidelines:

The transcription rules are based on the OCR-D Ground Truth Guidelines Level 2 with some exceptions (see below):

Special characters:
- Long s (ſ)
- Fractions (¼ ½ ¾ ⅐ ⅑ ⅒ ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞)
- Fraction slash (⁄) (U+2044), if
  - can't be transcribed by a unicode fraction representation
  - numerator and denominator are not on the same baseline height
- R rotunda (ꝛ)
- Dagger (†)
- Black Right Pointing Index (☛)
- Black Left Pointing Index (☚)
- Superscript Numbers 0-9 (⁰ ¹ ² ³ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹)
- Subscript Numbers 0-9 (₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉)
- White square (□)
- White medium square (◻)
- Black square (■)
- White up-pointing triangle (△)
- Black up-pointing triangle (▲)
- Bullet (•)
- Black circle (●)
- Black large circle (⬤)
- Heavy four balloon-spoked asterisk (✤)
Additional characters transcribed true to original (contrary to OCR-D Level 2):
- Double oblique hyphen (⸗)
- Em dash (—) instead of En dash (–)

Funding

This revision is part of the OCR-D project and predominantly funded by the German Research Foundation (DFG).

Links

austriannewspapers's People

Contributors

Stargazers

Watchers

Forkers

wollmers jkamlah tboenig

austriannewspapers's Issues

Question: Format of pull request

As I understand, updates and corrections are applied on both, the whole page XML and line text files.

Will look, if a quick hack to update XML pays off. I use my own JSON format based on hOCR and add information like font. Page-XML and the other popular formats are on my TODO list.

Deskew the line images?

Some line images are more or less skewed containing fragments of the line before or after.

E. g. ONB_aze_18950706_1.jpg_tl_303.png

In the above image of TextLine id="tl_303" the Text

Mit den Fliekenden drangen wir durch die Thore von

isn't even represented complete.

So I would suggest to cut them out in a better way with improved image processing.

Some PNGs in gt/train are truncated

The line images in gt/train sometimes are too short, e. g.

ONB_ibn_19110701_018.tif_tl_6.gt.txt:###wertet. Das geringſte Gebot beträgt 10.020 Kronen.
ONB_ibn_19110701_018.tif_tl_63.gt.txt:b###chten Utenſilien, über welche die öffentliche Ver⸗
ONB_ibn_19110701_018.tif_tl_64.gt.txt:###ßerung ausgeſchrieben wird. Offerte bis 4. Juli.
ONB_ibn_19110701_018.tif_tl_69.gt.txt:lo###e der k. k. Forſt⸗ und Domänen ⸗ Direktion in
ONB_ibn_19110701_018.tif_tl_98.gt.txt:###war der Hausanteil auf 5538 Kronen und der Anteil

Just a reminder to explore this issue later.

Discussion: Transcription of decimal dot in numbers

There are some numbers in the original images, where the decimal dot is not sitting near the baseline. Either at the hight of the hyphen, or at the top edge (height of capitals).

IMHO for broader use of the GT files (OCR training, benchmark) an intermediate transcription should be used, i. e. Unicode without PUA and as near as possible to the original glyphs (long s), spelling etc. Conversion into basic level (current spelling, German keyboard) is easier than conversion in the other direction.

What dots are available in Unicode:

     cpoint  name
'.'  U+002E  FULL STOP (Other_Punctuation)
'·'  U+00B7  MIDDLE DOT (Other_Punctuation)

'˙'  U+02D9  DOT ABOVE (Modifier_Symbol)
'·'  U+0387  GREEK ANO TELEIA (Other_Punctuation)
'᛫'  U+16EB  RUNIC SINGLE PUNCTUATION (Other_Punctuation)
'․'  U+2024  ONE DOT LEADER (Other_Punctuation)
'‧'  U+2027  HYPHENATION POINT (Other_Punctuation)
'∙'  U+2219  BULLET OPERATOR (Math_Symbol)
'⋅'  U+22C5  DOT OPERATOR (Math_Symbol)
'⸱'  U+2E31  WORD SEPARATOR MIDDLE DOT (Other_Punctuation)
'⸳'  U+2E33  RAISED DOT (Other_Punctuation)
'・' U+30FB  KATAKANA MIDDLE DOT (Other_Punctuation)
'ꞏ'  U+A78F  LATIN LETTER SINOLOGICAL DOT (Other_Letter)

MIDDLE DOT appears frequently in current and old typography and is available in most fonts.

But I hesitate to use DOT ABOVE, because it's a modifier symbol. We can use it now and maybe convert later after consulting some opinions.

Official Announcement: Release of the revised version according to the OCR-D Level 2 guidelines

Hello everyone,
we plan to publish the revised version according to the OCR-D Level 2 guidelines in this repository in the coming days.
In the past months, the GroundTruth was corrected and upgraded by student assistants and project staff. The long s and the double oblique hyphen were updated consistently and any remaining transcription errors were corrected as far as possible. The polygons of the regions and text lines have been corrected, as well as the reading order. The regions were also tagged. Further information on the optimisations will be available at the time of publication.

The following steps are planned for the publication:

Tagging of the current status
The 'master' branch will be renamed to 'main'
Replace current files with revised files (OCR-D Level 2) in main branch. There will be only PAGE XML files and instructions on how to generate lines from PAGE XML files. The structure of the repository will follow the OCR-D GT Repo template and the README will also be adapted accordingly. The source of origin and other references will remain the same.

We hope you will enjoy the revised version.

Line images with more than one line in GT data set

Some line images contain more than one line.

See example.

J/I transcription in Fraktur

At the moment all Fraktur J before a vocal are transcribed as I in the text files.

I am not sure if it harms the quality of training. Maybe not, because training takes the adjacent characters into account.

On the other hand I myself prefer J in the image as J in the result. Most Blackletter fonts did not have an I. After ~1900 they began to cut I in some (~25 %) Blackletter fonts. The difference can only be seen if both appear in the same text.

AFAIK GT4Hist keeps J. It can be a problem if GT4Hist is combined with AustrianNewspapers.

Quick proof:

$ grep -R --exclude *.png 'J[bcdfghjklmnpqrsſtvwxz]' .
ONB_ibn_19110701_037.tif_tl_13.gt.txt:Seit Ende Mai ds. Js. ſind ſowohl die Zugänge als als auch die
ONB_ibn_18640702_003.tif_tl_13.gt.txt:machung : In Folge hoher k. k. Statthalterei⸗Kundmachung vom 31. Mai d. Js.
ONB_ibn_18640702_003.tif_tl_16.gt.txt:d. Js. beginnt, und es haben ſich die aus dem Civilſtande Eintretenden mit dem
ONB_ibn_18640702_012.tif_tl_38.gt.txt:d. Js. um ſo gewiſſer anher einzuzahlen, als ſonſt nach Ablauf dieſer Friſt die
ONB_ibn_18640702_012.tif_tl_16.gt.txt:Am 4. Juli d. Js. um 9 Uhr früh angefangen, werden im Hauſe Nr. 57
ONB_ibn_19110701_035.tif_tl_174.gt.txt:Kufſtein (Tirol) iſt mit 1. September ds. Js.
ONB_ibn_19110701_027.tif_tl_7.gt.txt:Das Schuljahr 1911|12 beginnt am 16. September ds. Js. Die Schüleraufnahme
ONB_ibn_18640702_009.tif_tl_17.gt.txt:Pachtliebhaber wollen ſich bis Jakobi d. Js. bei der gräfl. v. Enzenberg⸗
ONB_ibn_18640702_009.tif_tl_12.gt.txt:Auf Martini d. Js. kommt zu verpachten:

$ grep -R --exclude *.png 'I[bcdefghjklmnpqrsſtvwxz]' . | wc -l
    2536

Improve line images

As already mentioned in issues #29 #28 #3 and #2 there are problems with the line images as they don't contain 1:1 the text of the corresponding *.gt.txt files.

Problems are:

truncated left, top, right or bottom (parts of characters or complete characters are not in the image)
contain parts of other lines
skewed
some rotated 90, 180, or 270 degrees
warped

In short words: line segmentation should be improved.

To do it automatically the usual methods of a "best practice" OCR-workflow should be used, without repeating the manual step of page segmentation into regions. The regions in the Page-XML seem ok.

Rotate by multiples of 90 degrees

Using the Baseline tag of Page-XML there are 225 lines rotated within +/- 10 degrees around 90, 180 or 270 degrees, most within +/- 2 degrees. Rotation by exactly 90, 180 or 270 degrees would be lossless.

Deskew within 10 degrees

10,603 of ~57,000 images have a skew between 0.5 and 10 degrees. The majority is skewed within +/- 2 degrees. It would be better to calculate the lower threshold in pixels (0.5 or 1.0), because it makes no sense to rotate a difference less than 1 pixel at the left or right corner of a line. Would maybe need some tests of a larger amount of lines, because good image manipulation programs like ImageMagick work internally on subpixel level. Also Tesseract has better accuracy with exactly deskewed lines.

Baseline in XML isn't always reliable.

Looking at some samples between 2 and 10 degrees they seem to be short sequences with characters at different vertical positions, which maybe are also skewed within 2 degrees. Thus it would be better to estimate the average skew of each TextRegion first, and take this under consideration.

Deskew between 10 and 80 degrees

There are ~50 images in this range. Some are diagonal labels in tables. Some are more like illustrations as part of advertisements. This sort of advertisements is a special issue without an easy solution (It's an extra issue).

Segment regions into lines

Maybe it' a better approach to create region images first of pure text regions and use ocropy for segmentation and dewarp. The advantages are that ocropy uses masks and removes speckles outside the mask, also adds white pixels at the borders. Disadvantages are binary images (no color), position info is lost, and ocropus-dewarp scales the images down (would need a closer look into the source code to maybe find a better solution). Also ocropy does not ignore some noise at begin and end of the line. This can be cut away using the GT texts.

BTW adding white pixels around the borders of the line images in GT4Hist would maybe be an improvement.

Each of the steps improves recognition accuracy of degraded images a few percent. Should GT images for training be "too good"? IMHO it's not cheating as long as it's done with available tools of a modern OCR workflow. The remaining image quality will be noisy enough.

Text of large hight (e.g. titles)

They are all truncated at the top. TextRegion seems ok, but the polygon of TextLine has a wrong hight. This allows to cut out a region image, if the region contains only one line and segment it.

Update information in Page-XML

In case of skew and warp it makes not sense and isn't easy to update the polygons. For large titles it makes sense.

Also fontFamily and fontSize

<TextStyle fontFamily="Times New Roman" fontSize="4.5"/>

could be updated with a better guess. fontSize (should be in points according to the Page-XML specification) needs reliable information of dpi during scanning or the original paper format. For fontFamily it's questionable. It could be classified as "'Times New Roman', serif" by default and changed in cases of sans-serif, Fraktur, Textura or Gothic if automatically classified.

line files not following the XML id pattern tl_\d+ are missing

E. g. training set ONB_aze_18950706_1.xmlcontains

<TextLine id="line_1545028417729_5" custom="readingOrder {index:1;}">

but there are only line files following the id pattern tl_\d+.

Either we rename the line ids in the XML or use the existing ids from the XML for the line files.

Info: current statistics

Compared original XML "ONB_newseye" to current line texts "AustrianNewspapers".

compare_xml.pl Version 0.01

Compare XML text output against ground truth (GRT):
XML: ONB_newseye
GRT: AustrianNewspapers

Summary:

              lines   words   chars
items ocr:    57541  326524 2198240 matches + inserts + substitutions
items grt:    57541  326394 2198051 matches + deletions + substitutions
matches:      23961  265356 2125325 matches
edits:        33580   61346   73806 inserts + deletions + substitutions
 subss:       33580   60860   71835 substitutions
 inserts:         0     308    1080 inserts
 deletions:       0     178     891 deletions
precision:   0.4164  0.8127  0.9668 matches / (matches + substitutions + inserts)
recall:      0.4164  0.8130  0.9669 matches / (matches + substitutions + deletions)
accuracy:    0.4164  0.8122  0.9664 matches / (matches + substitutions + inserts + deletions)
f-score:     0.4164  0.8128  0.9669 ( 2 * recall * precision ) / (recall + precision )

Shortened list of the edits/mismatches:

Character match (confusion) table:
GRT => OCR  ratio  errors   count
---    --- ------ ------- -------
'ſ' => 's' 0.9985   56885   56971
'⸗' => '-' 0.0052      61   11639
'⸗' => '=' 0.3232    3762   11639
'⸗' => '¬' 0.6691    7788   11639
                    -----
SUM                 68496
+ transcription      1000   estimated transcription level 1 -> 2
                    -----
TOTAL transcription 69496

edits               73806
- transcription    -69496
                    -----
corrections          4310  (0,20% of all characters)

Rough guess of errors still in the GRT: 1000 - 2000.

Rotated texts in GT data set

A certain number of pages contains text written vertically. This is typically used in head rows of tables.

Example page: ONB_ibn_19110701_010.tif

The corresponding line boxes are not rotated, so also contain text written vertically. They are not suitable for training or evaluation. In addition the sample line image contains two text lines.

Missing line files (txt and png)

Shocked in the first moment I checked the history of git, if I deleted them by mistake. No, they never existed.

E. g.

ONB_nfp_19110701_006.tif_tl_6.gt.txt
ONB_nfp_19110701_006.tif_tl_6.png

In the XML they are empty:

    <TextRegion type="paragraph" id="r_6_1" custom="readingOrder {index:5;}">
      <Coords points="3235,186 3239,186 3239,190 3235,190"/>
      <TextLine id="tl_6" primaryLanguage="German" custom="readingOrder {index:0;}">
        <Coords points="3236,187 3238,187 3238,189 3236,189"/>
        <Baseline points="3236,189 3238,189"/>
        <TextEquiv>
          <Unicode/>
        </TextEquiv>
        <TextStyle fontFamily="Times New Roman" fontSize="5.0" bold="true" italic="true"/>
      </TextLine>
      <TextEquiv>
        <Unicode></Unicode>
      </TextEquiv>
    </TextRegion>

Seems not an important problem. It's just I didn't expect the need to check and handle missing or empty files everywhere.

Rotated Characters, Typesetting Errors

Just for the records.

There are some rotated characters in steps of 90 degrees. It happens often with n/u. In the binarised images of low quality and Fraktur the difference between n/u and them 180 degrees turned is seldom visible. I give spelling the precedence.

Also in low image quality and Fraktur R/K, B/V, M/W look very similar.

Sometines it seems intended by the typesetter to use long-s turned 180 degrees as separator in Hungarian phone numbers. I transcribe them as '|'.

This one is funny: