Coder Social home page Coder Social logo

ub-mannheim / reichsanzeiger-gt Goto Github PK

View Code? Open in Web Editor NEW
8.0 5.0 3.0 35.1 MB

Ground truth for German newspaper "Deutscher Reichsanzeiger und Preußischer Staatsanzeiger" (1819–1945)

License: Creative Commons Zero v1.0 Universal

Shell 100.00%
fraktur ground-truth latin-language newspaper ocr ocr-d-level-2

reichsanzeiger-gt's Introduction

reichsanzeiger-gt

DOI 10.5281/zenodo.10144428

Ground truth for German newspaper "Deutscher Reichsanzeiger und Preußischer Staatsanzeiger" (German Imperial Gazette and Prussian Official Gazette), which was published under changing names from 1819 to 1945 (https://digi.bib.uni-mannheim.de/periodika/reichsanzeiger/ausgaben).

The ground truth is provided as PAGE-XML and URLs for the corresponding newspaper scans/images. Use the provided bash-script to download the images.

Images:

Images can be downloaded via script

./download_images.sh

Quantity:

  • 197 single newspaper pages
  • 119 429 ground truth lines

Period:

1820–1939

Font / Writing class:

Fraktur, Latin

Languages:

German, English, French, Portuguese, Italian, Latin

Transcription guidelines:

All transcriptions were created using Transkribus. The transcription rules are based on the OCR-D transcription guidelines Level 2 with some exceptions (see below):

Special characters:

  • Long s (ſ)
  • Currency symbols: German Mark (ℳ) and Pfennig (₰), $, £
  • Fractions (¼ ½ ¾ ⅐ ⅑ ⅒ ⅓ ⅔ ⅕ ⅖ ⅗ ⅘ ⅙ ⅚ ⅛ ⅜ ⅝ ⅞)
  • Fraction slash (⁄) (U+2044), if
    • can't be transcribed by a unicode fraction representation
    • numerator and denominator are not on the same baseline height
  • R rotunda (ꝛ)
  • Combining Latin Small Letter E for old German Umlaut ( ͤ )
  • Dagger (†)
  • Black Right Pointing Index (☛)
  • Black Left Pointing Index (☚)
  • White square (□)
  • Superscript Numbers 0-9 (⁰¹²³⁴⁵⁶⁷⁸⁹)

Normalizations:

  • Roman numerals ⅠⅤ Ⅹ Ⅼ Ⅽ Ⅾ Ⅿ --> I V X L C D M
  • Em dash (—) instead of En dash (–)
  • Asterisk (*) used for both standard asterisk (*) and tear-drop asterisk (✽)

Additional characters transcribed true to original (contrary to OCR-D Level 2):

  • Double oblique hyphen (⸗)

Funding

This revision is predominantly funded by the German Research Foundation (DFG).

Links

reichsanzeiger-gt's People

Contributors

jkamlah avatar shigapov avatar stweil avatar tsmdt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

reichsanzeiger-gt's Issues

PAGE XML contains `TextEquiv` with empty `Unicode`

The PAGE XML files contain lots of text regions without text in their TextEquiv and a few text files without text in their TextEquiv:

# Text regions without text.
% git grep "^                <Unicode></Unicode>" | wc -l
   17097
# Text lines without text:
% git grep "^                    <Unicode></Unicode>" | wc -l  
       6

Text from regions with text in lines but without text in the region gets lost when the PAGE XML file is converted to pure text using ocr-transform.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.