Coder Social home page Coder Social logo

assets's Introduction

OCR-D/assets

Test data for testing specs and software in @OCR-D

CircleCI

assets's People

Contributors

bertsky avatar boenig avatar cneud avatar j-panzer avatar kba avatar n00blet avatar stweil avatar tboenig avatar wrznr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

assets's Issues

Most/All workspaces in bag files don't validate

#!/bin/sh                                                                                                               
                                                                                                                        
set -e                                                                                                                  
                                                                                                                        
cd `mktemp -d`                                                                                                          
virtualenv venv                                                                                                         
. venv/bin/activate                                                                                                     
pip install --pre ocrd                                                                                                  
                                                                                                                        
wget https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/558280e0-c40a-49ae-81ab-679bc29567c3/data/gerstner_mechanik01_1831.zip
dtrx gerstner_mechanik01_1831.zip                                                                                       
                                                                                                                        
cd gerstner_mechanik01_1831/data                                                                                        
ocrd workspace validate mets.xml

yields:

[...]
OSError: cannot identify image file 'OCR-D-GT-SEG-PAGE/OCR-D-GT-SEG-PAGE_0019.jpg'

This is due to METS specifying an image/jpeg type for the PAGE XML here. The validation process copies the file adding a .jpg extension and then validation breaks because it can't read it as an image.

I've seen this in some files, possibly all are affected.

no deskewing/orientation in GT

I don't know how this is supposed to work at all. Usually the images need no deskewing, but when they do, that information is missing in PAGE. (I would at least expect some orientation angle in the text regions. Or is Baseline the place to look for this information?)

E.g. in weigel_gnothi02_1618, page phys_0001 needs to be rotated about -2.0 degrees (clockwise). The effect is also pronounced in the GT annotation itself: it contains coordinates that effectively chop off parts of the glyphs in some corners, e.g. region TextRegion_1479403414297_29 line tl_1 (chopped "V"), region TextRegion_1488379719413_342 line tl_22 (chopped "durch ſein") and region TextRegion_1488379733255_361 (chopped "ſein").

multiple issues in kant_aufklaerung_1784-page-block-line-word_glyph

  1. there are filegroups referencing invalid URLs:
ocrd workspace clone -a data/kant_aufklaerung_1784-page-block-line-word_glyph/data/mets.xml
Exception: Not found: https://github.com/OCR-D/assets/raw/master/data/kant_aufklaerung_1784/OCR-D-GT-PAGE/PAGE_0017_PAGE (HTTP 404)

(it seems that data/ is missing in the path)

  1. the two filegroups OCR-D-GT-SEG-WORD and OCR-D-GT-SEG-WORD-GLYPH use the same directory and files

  2. glyph coordinates are not rectangles or rough polygons but detailed envelope paths like 135,374 135,375 137,375 137,376 156,376 156,377 158,377 158,378 159,378 159,379 160,379 160,380 161,380 161,381 162,381 162,386 164,386 164,398 165,398 165,400 166,400 166,401 167,401 167,404 168,404 168,410 167,410 167,414 166,414 166,415 165,415 165,417 164,417 164,419 163,419 163,420 162,420 162,421 161,421 161,422 160,422 160,423 159,423 159,424 157,424 157,425 155,425 155,426 154,426 154,427 151,427 151,428 147,428 147,429 142,429 142,430 139,430 139,429 137,429 137,428 135,428 135,427 134,427 134,426 116,426 116,425 115,425 115,420 114,420 115,420 115,404 116,404 116,382 117,382 117,379 118,379 118,378 120,378 120,377 121,377 121,376 123,376 123,375 124,375 124,374 – is this correct? (I am asking because I am still considering using coordinates for alignment, which would become impractical with such data.)

  3. non-standard or non-normalised characters like U+F502 (from private use area, instead of letter c) or U+E644 (also private use area, instead of letter ö or combination

  4. the README.md still mentions the deleted page-with-glyphs.xml but not the new file

missing words in GT

Probably just a singular error:

In weigel_gnothi02_1618, on page phys_0001 region TextRegion_1488379719413_342 line tl_7, the first 3 words ("auf den Erſten") are missing from the annotation (both as Word elements and in the TextEquiv of the TextLine).

Inconsistently sorted coordinate points for certain words in kant_aufklaerung sample

E.g. all Word with ID word_* in https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml, such as https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml#L71

Everywhere else, coordinates are sorted clock-wise starting with top-left but these coordinates start with bottom-right.

Can this be fixed upstream? If not, we could adapt the coordinate translation utilities in core.

@tboenig @bertsky

compression artifacts in GT

Another report on GT issues (not assets):

In …

…images show clear signs of JPEG compression, with notable artifacts around sharp contrast like graphemes. ImageMagick identifies them as TIFF with 200 PPI (or 72 PPI or no resolution tag at all), without compression, and without any crs or exif tags and with very few tiff tags (e.g. no software or artist).

(In contrast, "good" images in other workspaces are identified as TIFF with 300 PPI without compression with full aux, crs, xmp, exif and tiff tags, which list the camera model, exposure settings, the true date stamp – somewhere in 2011 – and that it was created with Adobe Photoshop Lightroom. Sometimes, they are also TIFF with 300 PPI without compression without those tags but listing IrfanView or PROView or OmniScan or multidotscan as creator software.)

I found this because I had trouble binarizing such images: I would always get too many (un)connected components, regardless of threshold settings.

@tboenig I'd say this is the most urgent issue so far.

Border instead of PrintSpace in GT

The current GT bags in the repo all use PrintSpace to annotate page-level cropping (including marginals and page numbers). But according to the PAGE standard Border should be used for that.

Please re-export, so PAGE tools dealing with coordinates do not have to be stretched to expect this transgression.

provide TableRegion/Grid examples

The PAGE-XML specification contains means to describe the inner structure of a TableRegion (i.e. as a coordinate matrix via Grid/GridPoints/@points).

However, so far there is no single document among assets, structural GT (1000pages / current repo) and text GT (dta / old repo) with an instance of this. (There's also no example in the PAGE-XML specification repo.)

We need these kinds of data both for making our processors table-capable and as training/evaluation data for layout analysis.

scribo-test: invert binarized images

The test images for binarization from scribo should be inverted (so they show positive). With ImageMagick installed, a simple convert -negated filename does the trick.

Self-contained make "update-bagit" target

Currently, make update-bagit depends on this zsh script:

#!/bin/zsh
sha512sum data/**/*(.) >! manifest-sha512.txt 
sha512sum manifest-sha512.txt bagit.txt bag-info.txt >! tagmanifest-sha512.txt
file_no=$(du -bs data/**/*(.)|wc -l)
oxum=$(du -bs data/**/*(.)|awk '{s+=$1} END {print s}')
sed -i "s/^ *Payload-Oxum:.*/Payload-Oxum: $oxum.$file_no/" bag-info.tx

I shall port this to bash and include it in the makefile.

And possibly expose via ocrd zip update-baginfo.

word segmentation in kant_aufklaerung_1784 GT PageXML

Is it correct for assets/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml to have punctuation characters as separate Word elements, even if they are written adjacent to other words (i.e. without whitespace)?

For example, in the first line tl_2 of region r_2_1, the word word_1478541900479_904 depicts a single semicolon token. But the line reads

gewiegelt worden; ſo ſchaͤdlich iſt es Vorurtheile zu

so it should be part of the second Word (i.e. worden;).

In principle, punctuation characters can occur

  • at the start of a token (e.g. opening parenthesis or quotation marks etc.)
  • at the end of a token (e.g. closing parenthesis or quotation marks, comma, colon, semicolon, sentence punctuation, hyphen etc.)
  • as an isolated token (e.g. dash)

And of course, these might be combined, as in

„die Urſach jener intereſſanten Erſcheinung ſeyn ſollte?“ —

or

sind ganz entzückt über diesen glänzenden (!) Sieg der gerechten Sache

or

„Wir alten Republikaner“, sagt Guinard, „die seit 30 Jahren über die Republik wachten, wir werden doch einem Leon Faucher zur Seite noch ferner darüber wachen dürfen.“

So again (see #12) there is no way to reproduce the TextLine content from Word level annotation, except using coordinate heuristics.

Wouldn't it be better to have a strict rule to only segment at whitespace? (This is what segmentation using OCR-D/ocrd_tesserocr does now.)

@tboenig @kba @finkf

Change the file name in DFKI test data

Can you please change the file name according to mets file ??
The href in mets points to "becker_quaestio_1586_00013.tif" but the physical file name in OCR-D-IMG is different (tiff.tif).
It should be a simple filename change, thank you.

Repository not usable on case insensitive filesystems (like macOS and Windows)

Case insensitive filesystems like those used on macOS and Windows by default have problems with these files:

modified:   data/scribo-test/data/OCR-D-SEG-PAGE-kim/OCR-D-SEG-PAGE-kim-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-niblack/OCR-D-SEG-PAGE-niblack-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-otsu/OCR-D-SEG-PAGE-otsu-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-sauvola-ms-fg/OCR-D-SEG-PAGE-sauvola-ms-fg-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-sauvola-ms-split/OCR-D-SEG-PAGE-sauvola-ms-split-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-sauvola-ms/OCR-D-SEG-PAGE-sauvola-ms-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-sauvola/OCR-D-SEG-PAGE-sauvola-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-singh/OCR-D-SEG-PAGE-singh-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-wolf/OCR-D-SEG-PAGE-wolf-orig_tiff.xml

This is caused by directory names like for example OCR-D-SEG-PAGE-kim and OCR-D-SEG-PAGE-KIM which only differ in case.

language and fontFamily in GT

Please correct me: I believe Word/@language and TextStyle/@fontFamily should be either used properly or not at all in the GT.

But I often see them wrong (with a tendency towards blackletter and German). E.g. in weigel_gnothi02_1618, word word_1479403541433_37 (which is clearly Greek) has language=German / fontFamily=antiqua, words w_w1aab1b1b2b3b1b1b7, w_w1aab1b1b2b3b1b1c23 and w_w1aab1b1b2b7b3ac21 (which are Latin) have language=German, and words w_w1aab1b1b2b7b9ac33 word_1488831895760_120 and word_1488831812931_118 (which are Latin antiqua) have language=German / fontFamily=blackletter.

Also, I believe that on the TextLine and TextRegion level, the TextStyle/@fontFamily should always be a (comma-separated) list of all the values on the lower levels.

pseudo TextLine in GT

I don't know if this is the right place to report errors in GT (not assets).

Also, I am not sure if this is a systematic error or a singular phenomenon.

In euler_rechenkunst01_1738, page phys_0005 region TextRegion_1475759982805_45, the first line line_1475759982883_47 is bogus: its y coordinates extend only 2 pixels, and its TextEquiv is empty.

Update scribo-tests with correct `k` parameters for sauvola-ms-fg

@bertsky in OCR-D/ocrd_olena#42 (comment)

The CI failure speaks of the need to update our reference data as well: data/scribo-test/data/OCR-D-IMG-BIN-SAUVOLA-MS-FG/OCR-D-SEG-PAGE-SAUVOLA-MS-FG-orig_tiff-BIN_sauvola-ms-fg.png should now be generated with --all-k 0.34 in effect (instead of the previous result which silently used --k2 0.2 --k3 0.3 --k4 0.5).

(I know: this does decrease the quality for this algorithm even further. But this is not a good time to discuss a good way to wrap the different parameterizations. If we want to have control over k for all impl, we must accept that the k2/k3/k4 difference will disappear.)

Lots of XSD validation errors

Found thanks to OCR-D/core#470:

<report valid="false">
  <error>assets/data/page_dewarp/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 18:22:49.558544' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 28: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 31: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
  <error>assets/data/leptonica_samples/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 18:14:27.999250' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/leptonica_samples/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/leptonica_samples/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
  <error>assets/data/column-samples/data/mets.xml: Line 39: Element '{http://www.loc.gov/METS/}fileSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 16:44:18.171353' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 28: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 31: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 34: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 37: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 40: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 43: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 48: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 51: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 54: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 57: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 60: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 63: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 66: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 69: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
  <error>assets/data/grenzboten-test/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2019-08-07 17:52:26.109166' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/grenzboten-test/data/mets.xml: Line 15: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>FILE_0001_FULLTEXT: Line 2: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}PcGts', attribute 'pcGtsId': '00000001' is not a valid value of the atomic type 'xs:ID'.</error>
  <error>FILE_0001_FULLTEXT: Line 6: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Page': This element is not expected. Expected is ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Metadata ).</error>
  <error>FILE_0002_FULLTEXT: Line 2: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}PcGts', attribute 'pcGtsId': '00000002' is not a valid value of the atomic type 'xs:ID'.</error>
  <error>FILE_0002_FULLTEXT: Line 6: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Page': This element is not expected. Expected is ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Metadata ).</error>
</report>
<report valid="false">
  <error>assets/data/kant_aufklaerung_1784-binarized/data/mets.xml: Line 59: Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'P_0017' is not a valid value of the atomic type 'xs:ID'.</error>
  <error>assets/data/kant_aufklaerung_1784-binarized/data/mets.xml: Line 66: Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'P_0020' is not a valid value of the atomic type 'xs:ID'.</error>
</report>
<report valid="false">
  <error>assets/data/scribo-test/data/mets.xml: Line 33: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>assets/data/communist_manifesto/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2019-03-24 22:16:26.006316' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/communist_manifesto/data/mets.xml: Line 15: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>assets/data/dfki-testdata/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-22 10:31:05.897472' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/dfki-testdata/data/mets.xml: Line 32: Element '{http://www.loc.gov/METS/}fileSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>

Invalid xml file in dorn_uppedat_1507.zip

The file page/dorn_uppedat_1507_00004.xml in the ground-truth zip file is invalid.
It is just a missing < in the metadata at Creator:

<Metadata>
        <Creator>OCR-D/Creator>
        <Created>2016-12-07T15:18:07.272+01:00</Created>
        <LastChange>2017-03-07T18:04:10.221+01:00</LastChange>
      
    </Metadata>

ordering of Word elements in page-with-glyphs.xml

Is it correct for assets/data/page-with-glyphs.xml to have its Word elements misordered w.r.t. the linear reading order as seen by TextLine?

For example, in the first line N66290 of region r0,

Ich. Chriian Edlen von S  midt

the first word N72746 is actually the last element. This striking disorder is repeats throughout this file. The only information to reproduce the TextLine content here is in the coordinates.

In principle, we could have:

  1. coordinates
  2. readingOrder indexing in the custom attribute
  3. XML ordering

What source can/must postcorrection rely on?

@tboenig

missing drop-capital in GT

Again, I do not know if this is systematic:

In weigel_gnothi02_1618, on page phys_0001 region TextRegion_1488379719413_342, a drop-capital is missing in the annotation, i.e. it became part of the adjacent paragraph region. Worse, its (larger) line height spilled over into tl_4, the first TextLine of the region, so it has a height of 369 pixels and overlaps tl_5 through tl_9.

invalid TIFF tags in GT

Again, a report on GT issues (not assets):

In ...

...images have badly formatted tags, which cause ImageMagick to issue warnings:

identify estor_rechtsgelehrsamkeit02_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 
estor_rechtsgelehrsamkeit02_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 TIFF 1489x2526 1489x2526+0+0 8-bit sRGB 11.95MB 0.000u 0:00.000
identify-im6.q16: Incorrect value for "Photoshop"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify praetorius_verrichtung_1668.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 
praetorius_verrichtung_1668.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 TIFF 1240x1948 1240x1948+0+0 8-bit sRGB 7.247MB 0.000u 0:00.000
identify-im6.q16: Incorrect value for "Photoshop"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify justi_abhandlung01_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 
justi_abhandlung01_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 TIFF 1542x2386 1542x2386+0+0 8-bit sRGB 11.35MB 0.000u 0:00.010
identify-im6.q16: ASCII value for tag "DocumentName" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "ImageDescription" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "Make" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "PageName" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "Software" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "Artist" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.

See OCR-D/ocrd_olena#4 – this subsequently causes OLENA binarization to to fail with segfault!

Of course, our components should be robust against such problems. But given that OLENA is not maintained and OCR-D module projects are in the middle of final sprint, and you are going to re-publish GT bags anyway because of the other issues: Can you please fix this in the GT repo now?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.