ocr-d / assets Goto Github PK
View Code? Open in Web Editor NEWTest data for testing specs and software in @OCR-D
Test data for testing specs and software in @OCR-D
The PAGE-XML specification contains means to describe the inner structure of a TableRegion
(i.e. as a coordinate matrix via Grid/GridPoints/@points
).
However, so far there is no single document among assets, structural GT (1000pages
/ current repo) and text GT (dta
/ old repo) with an instance of this. (There's also no example in the PAGE-XML specification repo.)
We need these kinds of data both for making our processors table-capable and as training/evaluation data for layout analysis.
They are annotated in great detail on the first page but not on the following pages. Very difficult for the notion GT.
The file page/dorn_uppedat_1507_00004.xml
in the ground-truth zip file is invalid.
It is just a missing <
in the metadata at Creator
:
<Metadata>
<Creator>OCR-D/Creator>
<Created>2016-12-07T15:18:07.272+01:00</Created>
<LastChange>2017-03-07T18:04:10.221+01:00</LastChange>
</Metadata>
I don't know how this is supposed to work at all. Usually the images need no deskewing, but when they do, that information is missing in PAGE. (I would at least expect some orientation
angle in the text regions. Or is Baseline
the place to look for this information?)
E.g. in weigel_gnothi02_1618, page phys_0001
needs to be rotated about -2.0
degrees (clockwise). The effect is also pronounced in the GT annotation itself: it contains coordinates that effectively chop off parts of the glyphs in some corners, e.g. region TextRegion_1479403414297_29
line tl_1
(chopped "V"), region TextRegion_1488379719413_342
line tl_22
(chopped "durch ſein") and region TextRegion_1488379733255_361
(chopped "ſein").
The footnote is separated in two text regions where the second one contains a table. Either use a container region for the whole footnote (containing smaller regions) or structure the footnote completely (thereby avoiding the embedding of the table which makes no sense).
Region r0
on page 312 is actually a separator not a TextRegion
.
The lines in the header of the pages have to be annotated somehow.
#!/bin/sh
set -e
cd `mktemp -d`
virtualenv venv
. venv/bin/activate
pip install --pre ocrd
wget https://ocr-d-repo.scc.kit.edu/api/v1/dataresources/558280e0-c40a-49ae-81ab-679bc29567c3/data/gerstner_mechanik01_1831.zip
dtrx gerstner_mechanik01_1831.zip
cd gerstner_mechanik01_1831/data
ocrd workspace validate mets.xml
yields:
[...]
OSError: cannot identify image file 'OCR-D-GT-SEG-PAGE/OCR-D-GT-SEG-PAGE_0019.jpg'
This is due to METS specifying an image/jpeg
type for the PAGE XML here. The validation process copies the file adding a .jpg
extension and then validation breaks because it can't read it as an image.
I've seen this in some files, possibly all are affected.
Files referenced in the METS and stored in the data directory should have file extensions to make life easier for PAGE Viewer etc.
The current GT bags in the repo all use PrintSpace
to annotate page-level cropping (including marginals and page numbers). But according to the PAGE standard Border
should be used for that.
Please re-export, so PAGE tools dealing with coordinates do not have to be stretched to expect this transgression.
Is it correct for assets/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml to have punctuation characters as separate Word
elements, even if they are written adjacent to other words (i.e. without whitespace)?
For example, in the first line tl_2 of region r_2_1, the word word_1478541900479_904 depicts a single semicolon token. But the line reads
gewiegelt worden; ſo ſchaͤdlich iſt es Vorurtheile zu
so it should be part of the second Word
(i.e. worden;
).
In principle, punctuation characters can occur
And of course, these might be combined, as in
„die Urſach jener intereſſanten Erſcheinung ſeyn ſollte?“ —
or
sind ganz entzückt über diesen glänzenden (!) Sieg der gerechten Sache
or
„Wir alten Republikaner“, sagt Guinard, „die seit 30 Jahren über die Republik wachten, wir werden doch einem Leon Faucher zur Seite noch ferner darüber wachen dürfen.“
So again (see #12) there is no way to reproduce the TextLine
content from Word
level annotation, except using coordinate heuristics.
Wouldn't it be better to have a strict rule to only segment at whitespace? (This is what segmentation using OCR-D/ocrd_tesserocr does now.)
Is it correct for assets/data/page-with-glyphs.xml to have its Word
elements misordered w.r.t. the linear reading order as seen by TextLine
?
For example, in the first line N66290 of region r0,
Ich. Chriian Edlen von S midt
the first word N72746 is actually the last element. This striking disorder is repeats throughout this file. The only information to reproduce the TextLine
content here is in the coordinates.
In principle, we could have:
readingOrder
indexing in the custom
attributeWhat source can/must postcorrection rely on?
The attribute https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784-page-block/data/OCR-D-GT-SEG-BLOCK/OCR-D-GT-SEG-BLOCK_0008.xml references an image file which is not existing in the bag.
Quote: If used, the GROUPID of the page MUST BE the ID of the file that represents the original image. In other words: For the file representing the original image, ID and GROUPID must be identical.
Probably just a singular error:
In weigel_gnothi02_1618, on page phys_0001
region TextRegion_1488379719413_342
line tl_7
, the first 3 words ("auf den Erſten") are missing from the annotation (both as Word
elements and in the TextEquiv
of the TextLine
).
In euler_rechenkunst01_1738 from the OCR-D structure+text GT, there is a mismatch between physical pages (and files) versus structLink references: The latter includes a (non-existent) phys_0000 but misses the (existing) phys_0006.
The test images for binarization from scribo should be inverted (so they show positive). With ImageMagick installed, a simple convert -negated filename
does the trick.
Another report on GT issues (not assets):
In …
…images show clear signs of JPEG compression, with notable artifacts around sharp contrast like graphemes. ImageMagick identifies them as TIFF with 200 PPI (or 72 PPI or no resolution tag at all), without compression, and without any crs
or exif
tags and with very few tiff
tags (e.g. no software
or artist
).
(In contrast, "good" images in other workspaces are identified as TIFF with 300 PPI without compression with full aux
, crs
, xmp
, exif
and tiff
tags, which list the camera model, exposure settings, the true date stamp – somewhere in 2011 – and that it was created with Adobe Photoshop Lightroom
. Sometimes, they are also TIFF with 300 PPI without compression without those tags but listing IrfanView
or PROView
or OmniScan
or multidotscan
as creator software.)
I found this because I had trouble binarizing such images: I would always get too many (un)connected components, regardless of threshold settings.
@tboenig I'd say this is the most urgent issue so far.
Found thanks to OCR-D/core#470:
<report valid="false">
<error>assets/data/page_dewarp/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 18:22:49.558544' is not a valid value of the atomic type 'xs:dateTime'.</error>
<error>assets/data/page_dewarp/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/page_dewarp/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/page_dewarp/data/mets.xml: Line 28: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/page_dewarp/data/mets.xml: Line 31: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
<error>assets/data/leptonica_samples/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 18:14:27.999250' is not a valid value of the atomic type 'xs:dateTime'.</error>
<error>assets/data/leptonica_samples/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/leptonica_samples/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
<error>assets/data/column-samples/data/mets.xml: Line 39: Element '{http://www.loc.gov/METS/}fileSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-14 16:44:18.171353' is not a valid value of the atomic type 'xs:dateTime'.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 22: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 25: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 28: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 31: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 34: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 37: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 40: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 43: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 48: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 51: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 54: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 57: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 60: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 63: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 66: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
<error>assets/data/DIBCO11-machine_printed/data/mets.xml: Line 69: Element '{http://www.loc.gov/METS/}FLocat': The attribute 'LOCTYPE' is required but missing.</error>
</report>
<report valid="false">
<error>assets/data/grenzboten-test/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2019-08-07 17:52:26.109166' is not a valid value of the atomic type 'xs:dateTime'.</error>
<error>assets/data/grenzboten-test/data/mets.xml: Line 15: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
<error>FILE_0001_FULLTEXT: Line 2: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}PcGts', attribute 'pcGtsId': '00000001' is not a valid value of the atomic type 'xs:ID'.</error>
<error>FILE_0001_FULLTEXT: Line 6: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Page': This element is not expected. Expected is ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Metadata ).</error>
<error>FILE_0002_FULLTEXT: Line 2: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}PcGts', attribute 'pcGtsId': '00000002' is not a valid value of the atomic type 'xs:ID'.</error>
<error>FILE_0002_FULLTEXT: Line 6: Element '{http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Page': This element is not expected. Expected is ( {http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15}Metadata ).</error>
</report>
<report valid="false">
<error>assets/data/kant_aufklaerung_1784-binarized/data/mets.xml: Line 59: Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'P_0017' is not a valid value of the atomic type 'xs:ID'.</error>
<error>assets/data/kant_aufklaerung_1784-binarized/data/mets.xml: Line 66: Element '{http://www.loc.gov/METS/}div', attribute 'ID': 'P_0020' is not a valid value of the atomic type 'xs:ID'.</error>
</report>
<report valid="false">
<error>assets/data/scribo-test/data/mets.xml: Line 33: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
<error>assets/data/communist_manifesto/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2019-03-24 22:16:26.006316' is not a valid value of the atomic type 'xs:dateTime'.</error>
<error>assets/data/communist_manifesto/data/mets.xml: Line 15: Element '{http://www.loc.gov/METS/}dmdSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
<report valid="false">
<error>assets/data/dfki-testdata/data/mets.xml: Line 3: Element '{http://www.loc.gov/METS/}metsHdr', attribute 'CREATEDATE': '2018-11-22 10:31:05.897472' is not a valid value of the atomic type 'xs:dateTime'.</error>
<error>assets/data/dfki-testdata/data/mets.xml: Line 32: Element '{http://www.loc.gov/METS/}fileSec': This element is not expected. Expected is one of ( {http://www.loc.gov/METS/}structMap, {http://www.loc.gov/METS/}structLink, {http://www.loc.gov/METS/}behaviorSec ).</error>
</report>
@bertsky in OCR-D/ocrd_olena#42 (comment)
The CI failure speaks of the need to update our reference data as well:
data/scribo-test/data/OCR-D-IMG-BIN-SAUVOLA-MS-FG/OCR-D-SEG-PAGE-SAUVOLA-MS-FG-orig_tiff-BIN_sauvola-ms-fg.png
should now be generated with--all-k 0.34
in effect (instead of the previous result which silently used--k2 0.2 --k3 0.3 --k4 0.5
).
(I know: this does decrease the quality for this algorithm even further. But this is not a good time to discuss a good way to wrap the different parameterizations. If we want to have control over
k
for allimpl
, we must accept that the k2/k3/k4 difference will disappear.)
Again, a report on GT issues (not assets):
In ...
...images have badly formatted tags, which cause ImageMagick to issue warnings:
identify estor_rechtsgelehrsamkeit02_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001
estor_rechtsgelehrsamkeit02_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 TIFF 1489x2526 1489x2526+0+0 8-bit sRGB 11.95MB 0.000u 0:00.000
identify-im6.q16: Incorrect value for "Photoshop"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify praetorius_verrichtung_1668.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001
praetorius_verrichtung_1668.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 TIFF 1240x1948 1240x1948+0+0 8-bit sRGB 7.247MB 0.000u 0:00.000
identify-im6.q16: Incorrect value for "Photoshop"; tag ignored. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify justi_abhandlung01_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001
justi_abhandlung01_1758.ocrd/data/OCR-D-IMG/OCR-D-IMG_0001 TIFF 1542x2386 1542x2386+0+0 8-bit sRGB 11.35MB 0.000u 0:00.010
identify-im6.q16: ASCII value for tag "DocumentName" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "ImageDescription" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "Make" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "PageName" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "Software" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
identify-im6.q16: ASCII value for tag "Artist" does not end in null byte. `TIFFFetchNormalTag' @ warning/tiff.c/TIFFWarnings/912.
See OCR-D/ocrd_olena#4 – this subsequently causes OLENA binarization to to fail with segfault!
Of course, our components should be robust against such problems. But given that OLENA is not maintained and OCR-D module projects are in the middle of final sprint, and you are going to re-publish GT bags anyway because of the other issues: Can you please fix this in the GT repo now?
The large text region has to be split (like it is done on most other pages).
It can happen that a comparison between the content of the elements <Word>
and <TextEquiv><Unicode>
detects differences. To check any differences a schematron is necessary.
Sometimes they are marked as separators, sometimes not.
Case insensitive filesystems like those used on macOS and Windows by default have problems with these files:
modified: data/scribo-test/data/OCR-D-SEG-PAGE-kim/OCR-D-SEG-PAGE-kim-orig_tiff.xml
modified: data/scribo-test/data/OCR-D-SEG-PAGE-niblack/OCR-D-SEG-PAGE-niblack-orig_tiff.xml
modified: data/scribo-test/data/OCR-D-SEG-PAGE-otsu/OCR-D-SEG-PAGE-otsu-orig_tiff.xml
modified: data/scribo-test/data/OCR-D-SEG-PAGE-sauvola-ms-fg/OCR-D-SEG-PAGE-sauvola-ms-fg-orig_tiff.xml
modified: data/scribo-test/data/OCR-D-SEG-PAGE-sauvola-ms-split/OCR-D-SEG-PAGE-sauvola-ms-split-orig_tiff.xml
modified: data/scribo-test/data/OCR-D-SEG-PAGE-sauvola-ms/OCR-D-SEG-PAGE-sauvola-ms-orig_tiff.xml
modified: data/scribo-test/data/OCR-D-SEG-PAGE-sauvola/OCR-D-SEG-PAGE-sauvola-orig_tiff.xml
modified: data/scribo-test/data/OCR-D-SEG-PAGE-singh/OCR-D-SEG-PAGE-singh-orig_tiff.xml
modified: data/scribo-test/data/OCR-D-SEG-PAGE-wolf/OCR-D-SEG-PAGE-wolf-orig_tiff.xml
This is caused by directory names like for example OCR-D-SEG-PAGE-kim
and OCR-D-SEG-PAGE-KIM
which only differ in case.
I don't know if this is the right place to report errors in GT (not assets).
Also, I am not sure if this is a systematic error or a singular phenomenon.
In euler_rechenkunst01_1738, page phys_0005
region TextRegion_1475759982805_45
, the first line line_1475759982883_47
is bogus: its y coordinates extend only 2 pixels, and its TextEquiv is empty.
the links to the zip files in the Readme don't work
ocrd workspace clone -a data/kant_aufklaerung_1784-page-block-line-word_glyph/data/mets.xml
Exception: Not found: https://github.com/OCR-D/assets/raw/master/data/kant_aufklaerung_1784/OCR-D-GT-PAGE/PAGE_0017_PAGE (HTTP 404)
(it seems that data/
is missing in the path)
the two filegroups OCR-D-GT-SEG-WORD
and OCR-D-GT-SEG-WORD-GLYPH
use the same directory and files
glyph coordinates are not rectangles or rough polygons but detailed envelope paths like 135,374 135,375 137,375 137,376 156,376 156,377 158,377 158,378 159,378 159,379 160,379 160,380 161,380 161,381 162,381 162,386 164,386 164,398 165,398 165,400 166,400 166,401 167,401 167,404 168,404 168,410 167,410 167,414 166,414 166,415 165,415 165,417 164,417 164,419 163,419 163,420 162,420 162,421 161,421 161,422 160,422 160,423 159,423 159,424 157,424 157,425 155,425 155,426 154,426 154,427 151,427 151,428 147,428 147,429 142,429 142,430 139,430 139,429 137,429 137,428 135,428 135,427 134,427 134,426 116,426 116,425 115,425 115,420 114,420 115,420 115,404 116,404 116,382 117,382 117,379 118,379 118,378 120,378 120,377 121,377 121,376 123,376 123,375 124,375 124,374
– is this correct? (I am asking because I am still considering using coordinates for alignment, which would become impractical with such data.)
non-standard or non-normalised characters like
U+F502 (from private use area, instead of letter c
) or
U+E644 (also private use area, instead of letter ö
or combination oͤ
the README.md
still mentions the deleted page-with-glyphs.xml
but not the new file
Again, I do not know if this is systematic:
In weigel_gnothi02_1618, on page phys_0001
region TextRegion_1488379719413_342
, a drop-capital is missing in the annotation, i.e. it became part of the adjacent paragraph region. Worse, its (larger) line height spilled over into tl_4
, the first TextLine of the region, so it has a height of 369 pixels and overlaps tl_5
through tl_9
.
Assets of GT repo can now be referenced via their OCRD identifier.
e.g.: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit/ocrdidentifier?ocrdidentifier=ocrd_data_structur_wundt_grundriss_1896
Please correct me: I believe Word/@language
and TextStyle/@fontFamily
should be either used properly or not at all in the GT.
But I often see them wrong (with a tendency towards blackletter
and German
). E.g. in weigel_gnothi02_1618, word word_1479403541433_37
(which is clearly Greek) has language=German
/ fontFamily=antiqua
, words w_w1aab1b1b2b3b1b1b7
, w_w1aab1b1b2b3b1b1c23
and w_w1aab1b1b2b7b3ac21
(which are Latin) have language=German
, and words w_w1aab1b1b2b7b9ac33
word_1488831895760_120
and word_1488831812931_118
(which are Latin antiqua) have language=German
/ fontFamily=blackletter
.
Also, I believe that on the TextLine and TextRegion level, the TextStyle/@fontFamily
should always be a (comma-separated) list of all the values on the lower levels.
The mandatory element Border
is missing in page 0007 (i.e. the only page) of the bag berg_ostasien02_1866
.
Currently, make update-bagit
depends on this zsh script:
#!/bin/zsh
sha512sum data/**/*(.) >! manifest-sha512.txt
sha512sum manifest-sha512.txt bagit.txt bag-info.txt >! tagmanifest-sha512.txt
file_no=$(du -bs data/**/*(.)|wc -l)
oxum=$(du -bs data/**/*(.)|awk '{s+=$1} END {print s}')
sed -i "s/^ *Payload-Oxum:.*/Payload-Oxum: $oxum.$file_no/" bag-info.tx
I shall port this to bash and include it in the makefile.
And possibly expose via ocrd zip update-baginfo
.
The text-wrapped figure on page 14 of fischer_werkzeugmachinen01_1900
is completely covered by a large text region.
Can you please change the file name according to mets file ??
The href in mets points to "becker_quaestio_1586_00013.tif" but the physical file name in OCR-D-IMG is different (tiff.tif).
It should be a simple filename change, thank you.
E.g. all Word
with ID
word_*
in https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml, such as https://github.com/OCR-D/assets/blob/master/data/kant_aufklaerung_1784/page/kant_aufklaerung_1784_0020.xml#L71
Everywhere else, coordinates are sorted clock-wise starting with top-left but these coordinates start with bottom-right.
Can this be fixed upstream? If not, we could adapt the coordinate translation utilities in core.
Something's gone wrong. Only 7 files of 28 referenced in METS are still there.
Also the paths for the existing ones are not valid.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.