ocr-d / assets Goto Github PK

Test data for testing specs and software in @OCR-D

Makefile 60.32% Python 11.96% Shell 27.73%

ocr-d

assets's Issues

multiple issues in kant_aufklaerung_1784-page-block-line-word_glyph

there are filegroups referencing invalid URLs:

ocrd workspace clone -a data/kant_aufklaerung_1784-page-block-line-word_glyph/data/mets.xml
Exception: Not found: https://github.com/OCR-D/assets/raw/master/data/kant_aufklaerung_1784/OCR-D-GT-PAGE/PAGE_0017_PAGE (HTTP 404)

(it seems that data/ is missing in the path)

the two filegroups OCR-D-GT-SEG-WORD and OCR-D-GT-SEG-WORD-GLYPH use the same directory and files
glyph coordinates are not rectangles or rough polygons but detailed envelope paths like 135,374 135,375 137,375 137,376 156,376 156,377 158,377 158,378 159,378 159,379 160,379 160,380 161,380 161,381 162,381 162,386 164,386 164,398 165,398 165,400 166,400 166,401 167,401 167,404 168,404 168,410 167,410 167,414 166,414 166,415 165,415 165,417 164,417 164,419 163,419 163,420 162,420 162,421 161,421 161,422 160,422 160,423 159,423 159,424 157,424 157,425 155,425 155,426 154,426 154,427 151,427 151,428 147,428 147,429 142,429 142,430 139,430 139,429 137,429 137,428 135,428 135,427 134,427 134,426 116,426 116,425 115,425 115,420 114,420 115,420 115,404 116,404 116,382 117,382 117,379 118,379 118,378 120,378 120,377 121,377 121,376 123,376 123,375 124,375 124,374 – is this correct? (I am asking because I am still considering using coordinates for alignment, which would become impractical with such data.)
non-standard or non-normalised characters like  U+F502 (from private use area, instead of letter c) or  U+E644 (also private use area, instead of letter ö or combination oͤ
the README.md still mentions the deleted page-with-glyphs.xml but not the new file

1000pages: Paragraph missing on page 0013 of "rein_japan02_1886"

A whole paragraph is not present as TextRegion on page 0013 of rein_japan02_1886.

ordering of Word elements in page-with-glyphs.xml

Is it correct for assets/data/page-with-glyphs.xml to have its Word elements misordered w.r.t. the linear reading order as seen by TextLine?

For example, in the first line N66290 of region r0,

Ich. Chriian Edlen von S  midt

the first word N72746 is actually the last element. This striking disorder is repeats throughout this file. The only information to reproduce the TextLine content here is in the coordinates.

In principle, we could have:

coordinates
readingOrder indexing in the custom attribute
XML ordering

What source can/must postcorrection rely on?

@tboenig

1000pages: Missing border element in page 0007 of "berg_ostasien02_1866"

The mandatory element Border is missing in page 0007 (i.e. the only page) of the bag berg_ostasien02_1866.

1000pages: Missing text on page 0003 and 0004 of "lenau_gedichte_1832"

The headings within the TOC are not zoned:

1000pages: Separator classified as text on page 312 of "storch_petersburg01_1794"

Region r0 on page 312 is actually a separator not a TextRegion.

scribo-test: invert binarized images

The test images for binarization from scribo should be inverted (so they show positive). With ImageMagick installed, a simple convert -negated filename does the trick.

invalid TIFF tags in GT

Again, a report on GT issues (not assets):

In ...

...images have badly formatted tags, which cause ImageMagick to issue warnings:

identify estor_rechtsgelehrsamkeit02_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 
estor_rechtsgelehrsamkeit02_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 TIFF 1489x2526 1489x2526+0+0 8-bit sRGB 11.95MB 0.000u 0:00.000
identify-im6.q16: Incorrect value for "Photoshop"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.

identify praetorius_verrichtung_1668.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 
praetorius_verrichtung_1668.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 TIFF 1240x1948 1240x1948+0+0 8-bit sRGB 7.247MB 0.000u 0:00.000
identify-im6.q16: Incorrect value for "Photoshop"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.

identify justi_abhandlung01_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 
justi_abhandlung01_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 TIFF 1542x2386 1542x2386+0+0 8-bit sRGB 11.35MB 0.000u 0:00.010
identify-im6.q16: ASCII value for tag "DocumentName" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "ImageDescription" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "Make" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "PageName" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "Software" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "Artist" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.

See OCR-D/ocrd_olena#4 – this subsequently causes OLENA binarization to to fail with segfault!

Of course, our components should be robust against such problems. But given that OLENA is not maintained and OCR-D module projects are in the middle of final sprint, and you are going to re-publish GT bags anyway because of the other issues: Can you please fix this in the GT repo now?

Update scribo-tests with correct `k` parameters for sauvola-ms-fg

@bertsky in OCR-D/ocrd_olena#42 (comment)

The CI failure speaks of the need to update our reference data as well: data/scribo-test/data/OCR-D-IMG-BIN-SAUVOLA-MS-FG/OCR-D-SEG-PAGE-SAUVOLA-MS-FG-orig_tiff-BIN_sauvola-ms-fg.png should now be generated with --all-k 0.34 in effect (instead of the previous result which silently used --k2 0.2 --k3 0.3 --k4 0.5).

(I know: this does decrease the quality for this algorithm even further. But this is not a good time to discuss a good way to wrap the different parameterizations. If we want to have control over k for all impl, we must accept that the k2/k3/k4 difference will disappear.)

Invalid xml file in dorn_uppedat_1507.zip

The file page/dorn_uppedat_1507_00004.xml in the ground-truth zip file is invalid.
It is just a missing < in the metadata at Creator:

<Metadata>
        <Creator>OCR-D/Creator>
        <Created>2016-12-07T15:18:07.272+01:00</Created>
        <LastChange>2017-03-07T18:04:10.221+01:00</LastChange>
      
    </Metadata>

1000pages: Non-annotated handwritten annotation on page 0005 of "hobrecht_polytechnikum_1878"

Self-contained make "update-bagit" target

Currently, make update-bagit depends on this zsh script:

#!/bin/zsh
sha512sum data/**/*(.) >! manifest-sha512.txt 
sha512sum manifest-sha512.txt bagit.txt bag-info.txt >! tagmanifest-sha512.txt
file_no=$(du -bs data/**/*(.)|wc -l)
oxum=$(du -bs data/**/*(.)|awk '{s+=$1} END {print s}')
sed -i "s/^ *Payload-Oxum:.*/Payload-Oxum: $oxum.$file_no/" bag-info.tx

I shall port this to bash and include it in the makefile.

And possibly expose via ocrd zip update-baginfo.

1000pages: Non-existent separator annotated on page 0018 of "krafft_landwirthschaft03_1876""

It's in the XML but not visible in the image:

comparison between the contents of the <Word> elements and the <TextEquivType><Unicode> elements : Schematron

It can happen that a comparison between the content of the elements <Word> and <TextEquiv><Unicode> detects differences. To check any differences a schematron is necessary.

zip files broken links

the links to the zip files in the Readme don't work

The asset SBB0000F29300010000 is missing files.

Something's gone wrong. Only 7 files of 28 referenced in METS are still there.
Also the paths for the existing ones are not valid.

1000pages: Missing noise annotation on page 0020 of "falke_trachten02_1858"

There is noise on the upper right corner of the page (it might also be a handwritten annotation, not sure):

euler_rechenkunst01_1738 has wrong structLink

In euler_rechenkunst01_1738 from the OCR-D structure+text GT, there is a mismatch between physical pages (and files) versus structLink references: The latter includes a (non-existent) phys_0000 but misses the (existing) phys_0006.

pseudo TextLine in GT

I don't know if this is the right place to report errors in GT (not assets).

Also, I am not sure if this is a systematic error or a singular phenomenon.

In euler_rechenkunst01_1738, page phys_0005 region TextRegion_1475759982805_45, the first line line_1475759982883_47 is bogus: its y coordinates extend only 2 pixels, and its TextEquiv is empty.

Add a METS with lots of files for testing

https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784-binarized/mets.xml does not conform to specification

Quote: If used, the GROUPID of the page MUST BE the ID of the file that represents the original image. In other words: For the file representing the original image, ID and GROUPID must be identical.

1000pages: Separators missing in "forster_reise02_1780"

They have been forgotten (not just on this page):

Border instead of PrintSpace in GT

The current GT bags in the repo all use PrintSpace to annotate page-level cropping (including marginals and page numbers). But according to the PAGE standard Border should be used for that.

Please re-export, so PAGE tools dealing with coordinates do not have to be stretched to expect this transgression.

provide TableRegion/Grid examples

The PAGE-XML specification contains means to describe the inner structure of a TableRegion (i.e. as a coordinate matrix via Grid/GridPoints/@points).

However, so far there is no single document among assets, structural GT (1000pages / current repo) and text GT (dta / old repo) with an instance of this. (There's also no example in the PAGE-XML specification repo.)

We need these kinds of data both for making our processors table-capable and as training/evaluation data for layout analysis.

missing drop-capital in GT

Again, I do not know if this is systematic:

In weigel_gnothi02_1618, on page phys_0001 region TextRegion_1488379719413_342, a drop-capital is missing in the annotation, i.e. it became part of the adjacent paragraph region. Worse, its (larger) line height spilled over into tl_4, the first TextLine of the region, so it has a height of 369 pixels and overlaps tl_5 through tl_9.

Inconsistently sorted coordinate points for certain words in kant_aufklaerung sample

E.g. all Word with ID word_* in https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml, such as https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml#L71

Everywhere else, coordinates are sorted clock-wise starting with top-left but these coordinates start with bottom-right.

Can this be fixed upstream? If not, we could adapt the coordinate translation utilities in core.

@tboenig @bertsky

1000pages: Handwritten annotations not annotated on page 0015 of "helmholtz_erhaltung_1847""

Most/All workspaces in bag files don't validate

#!/bin/sh                                                                                                               
                                                                                                                        
set -e                                                                                                                  
                                                                                                                        
cd `mktemp -d`                                                                                                          
virtualenv venv                                                                                                         
. venv/bin/activate                                                                                                     
pip install --pre ocrd                                                                                                  
                                                                                                                        
wget https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/558280e0-c40a-49ae-81ab-679bc29567c3/data/gerstner_mechanik01_1831.zip
dtrx gerstner_mechanik01_1831.zip                                                                                       
                                                                                                                        
cd gerstner_mechanik01_1831/data                                                                                        
ocrd workspace validate mets.xml

yields:

[...]
OSError: cannot identify image file 'OCR-D-GT-SEG-PAGE/OCR-D-GT-SEG-PAGE_0019.jpg'

This is due to METS specifying an image/jpeg type for the PAGE XML here. The validation process copies the file adding a .jpg extension and then validation breaks because it can't read it as an image.

I've seen this in some files, possibly all are affected.

language and fontFamily in GT

Please correct me: I believe Word/@language and TextStyle/@fontFamily should be either used properly or not at all in the GT.

But I often see them wrong (with a tendency towards blackletter and German). E.g. in weigel_gnothi02_1618, word word_1479403541433_37 (which is clearly Greek) has language=German / fontFamily=antiqua, words w_w1aab1b1b2b3b1b1b7, w_w1aab1b1b2b3b1b1c23 and w_w1aab1b1b2b7b3ac21 (which are Latin) have language=German, and words w_w1aab1b1b2b7b9ac33 word_1488831895760_120 and word_1488831812931_118 (which are Latin antiqua) have language=German / fontFamily=blackletter.

Also, I believe that on the TextLine and TextRegion level, the TextStyle/@fontFamily should always be a (comma-separated) list of all the values on the lower levels.

1000pages: Separator classified as text on page 0018 of "fontane_brandenburg04_1882""

The separator on the left just before the poem is clearly not a text region.

Repository not usable on case insensitive filesystems (like macOS and Windows)

Case insensitive filesystems like those used on macOS and Windows by default have problems with these files:

modified:   data/scribo-test/data/OCR-D-SEG-PAGE-kim/OCR-D-SEG-PAGE-kim-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-niblack/OCR-D-SEG-PAGE-niblack-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-otsu/OCR-D-SEG-PAGE-otsu-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-sauvola-ms-fg/OCR-D-SEG-PAGE-sauvola-ms-fg-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-sauvola-ms-split/OCR-D-SEG-PAGE-sauvola-ms-split-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-sauvola-ms/OCR-D-SEG-PAGE-sauvola-ms-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-sauvola/OCR-D-SEG-PAGE-sauvola-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-singh/OCR-D-SEG-PAGE-singh-orig_tiff.xml
modified:   data/scribo-test/data/OCR-D-SEG-PAGE-wolf/OCR-D-SEG-PAGE-wolf-orig_tiff.xml

This is caused by directory names like for example OCR-D-SEG-PAGE-kim and OCR-D-SEG-PAGE-KIM which only differ in case.

Change the file name in DFKI test data

Can you please change the file name according to mets file ??
The href in mets points to "becker_quaestio_1586_00013.tif" but the physical file name in OCR-D-IMG is different (tiff.tif).
It should be a simple filename change, thank you.

1000pages: Inconsistent annotation of column separators in "krafft_landwirtschaft02_1876""

Sometimes they are marked as separators, sometimes not.

word segmentation in kant_aufklaerung_1784 GT PageXML

Is it correct for assets/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml to have punctuation characters as separate Word elements, even if they are written adjacent to other words (i.e. without whitespace)?

For example, in the first line tl_2 of region r_2_1, the word word_1478541900479_904 depicts a single semicolon token. But the line reads

gewiegelt worden; ſo ſchaͤdlich iſt es Vorurtheile zu

so it should be part of the second Word (i.e. worden;).

In principle, punctuation characters can occur

at the start of a token (e.g. opening parenthesis or quotation marks etc.)
at the end of a token (e.g. closing parenthesis or quotation marks, comma, colon, semicolon, sentence punctuation, hyphen etc.)
as an isolated token (e.g. dash)

And of course, these might be combined, as in

„die Urſach jener intereſſanten Erſcheinung ſeyn ſollte?“ —

sind ganz entzückt über diesen glänzenden (!) Sieg der gerechten Sache

„Wir alten Republikaner“, sagt Guinard, „die seit 30 Jahren über die Republik wachten, wir werden doch einem Leon Faucher zur Seite noch ferner darüber wachen dürfen.“

So again (see #12) there is no way to reproduce the TextLine content from Word level annotation, except using coordinate heuristics.

Wouldn't it be better to have a strict rule to only segment at whitespace? (This is what segmentation using OCR-D/ocrd_tesserocr does now.)

@tboenig @kba @finkf

1000pages: Separators missing on page 0010 of "immermann_muenchhausen02_1839"

Some are there, some are not.

no deskewing/orientation in GT

I don't know how this is supposed to work at all. Usually the images need no deskewing, but when they do, that information is missing in PAGE. (I would at least expect some orientation angle in the text regions. Or is Baseline the place to look for this information?)

E.g. in weigel_gnothi02_1618, page phys_0001 needs to be rotated about -2.0 degrees (clockwise). The effect is also pronounced in the GT annotation itself: it contains coordinates that effectively chop off parts of the glyphs in some corners, e.g. region TextRegion_1479403414297_29 line tl_1 (chopped "V"), region TextRegion_1488379719413_342 line tl_22 (chopped "durch ſein") and region TextRegion_1488379733255_361 (chopped "ſein").

1000pages: Inconsistent annotation of separators in "hobrecht_strassenbau_1890"

They are annotated in great detail on the first page but not on the following pages. Very difficult for the notion GT.

Add references to OCR-D Ground Truth repo.

Assets of GT repo can now be referenced via their OCRD identifier.
e.g.: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit/ocrdidentifier?ocrdidentifier=ocrd_data_structur_wundt_grundriss_1896

1000pages: Graphic embedded in text region on page 0009 of "fischer_werkzeugmachinen01_1900""

The graphic is clearly not embedded (two different paragraphs, cf. for a correct annotation page 0010 in the same volume).

Fix paths referenced in attribute "imageFilename" in PAGE XML

The attribute https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784-page-block/data/OCR-D-GT-SEG-BLOCK/OCR-D-GT-SEG-BLOCK_0008.xml references an image file which is not existing in the bag.

1000pages: Separator missing on title page of "gall_untersuchungen_1791"

Two separators have been annotated, one not:

1000pages: Graphic overlapped by text region in page 14 of "fischer_werkzeugmachinen01_1900"

The text-wrapped figure on page 14 of fischer_werkzeugmachinen01_1900 is completely covered by a large text region.

compression artifacts in GT

Another report on GT issues (not assets):

In …

…images show clear signs of JPEG compression, with notable artifacts around sharp contrast like graphemes. ImageMagick identifies them as TIFF with 200 PPI (or 72 PPI or no resolution tag at all), without compression, and without any crs or exif tags and with very few tiff tags (e.g. no software or artist).

(In contrast, "good" images in other workspaces are identified as TIFF with 300 PPI without compression with full aux, crs, xmp, exif and tiff tags, which list the camera model, exposure settings, the true date stamp – somewhere in 2011 – and that it was created with Adobe Photoshop Lightroom. Sometimes, they are also TIFF with 300 PPI without compression without those tags but listing IrfanView or PROView or OmniScan or multidotscan as creator software.)

I found this because I had trouble binarizing such images: I would always get too many (un)connected components, regardless of threshold settings.

@tboenig I'd say this is the most urgent issue so far.

Lots of XSD validation errors

Found thanks to OCR-D/core#470:

<report valid="false">
  <error>assets/data/page_dewarp/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 18:22:49.558544' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 28: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/page_dewarp/data/mets.xml: Line 31: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
  <error>assets/data/leptonica_samples/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 18:14:27.999250' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/leptonica_samples/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/leptonica_samples/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
  <error>assets/data/column-samples/data/mets.xml: Line 39: Element '{http://www.loc.gov/METS/}fileSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 16:44:18.171353' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 28: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 31: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 34: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 37: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 40: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 43: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 48: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 51: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 54: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 57: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 60: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 63: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 66: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
  <error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 69: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
  <error>assets/data/grenzboten-test/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2019-08-07 17:52:26.109166' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/grenzboten-test/data/mets.xml: Line 15: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>FILE_0001_FULLTEXT: Line 2: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}PcGts', attribute 'pcGtsId': '00000001' is not a valid value of the atomic type 'xs:ID'.</error>
  <error>FILE_0001_FULLTEXT: Line 6: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Page': This element is not expected. Expected is ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Metadata ).</error>
  <error>FILE_0002_FULLTEXT: Line 2: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}PcGts', attribute 'pcGtsId': '00000002' is not a valid value of the atomic type 'xs:ID'.</error>
  <error>FILE_0002_FULLTEXT: Line 6: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Page': This element is not expected. Expected is ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Metadata ).</error>
</report>
<report valid="false">
  <error>assets/data/kant_aufklaerung_1784-binarized/data/mets.xml: Line 59: Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'P_0017' is not a valid value of the atomic type 'xs:ID'.</error>
  <error>assets/data/kant_aufklaerung_1784-binarized/data/mets.xml: Line 66: Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'P_0020' is not a valid value of the atomic type 'xs:ID'.</error>
</report>
<report valid="false">
  <error>assets/data/scribo-test/data/mets.xml: Line 33: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>assets/data/communist_manifesto/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2019-03-24 22:16:26.006316' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/communist_manifesto/data/mets.xml: Line 15: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
  <error>assets/data/dfki-testdata/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-22 10:31:05.897472' is not a valid value of the atomic type 'xs:dateTime'.</error>
  <error>assets/data/dfki-testdata/data/mets.xml: Line 32: Element '{http://www.loc.gov/METS/}fileSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>

1000pages: Missing separators on all pages of "glueck_pandecten01verbesserungen_1798""

The lines in the header of the pages have to be annotated somehow.

GT should consistently have file extensions

Files referenced in the METS and stored in the data directory should have file extensions to make life easier for PAGE Viewer etc.

1000pages: Inconsistent annotation of footnotes on page 0019 of "gerstner_mechanik01_1831"

The footnote is separated in two text regions where the second one contains a table. Either use a container region for the whole footnote (containing smaller regions) or structure the footnote completely (thereby avoiding the embedding of the table which makes no sense).

ocr-d / assets Goto Github PK

assets's Issues

Recommend Projects

Recommend Topics

Recommend Org