Comments (15)
This file is from the IMPACT interoperability framework testfiles: https://github.com/impactcentre/iif-testfiles
I explicitly chose an external example for glyphs that has been widely used and is established and see where it led us.... :)
IMHO the order of preference for Words (and Glyphs) should be
- document order (XML ordering)
- Page readingOrder mechanism (which seems overkill for inline elements)
- coordinates
from assets.
I see. Well I traced it down from there (gt.xml) to the IMPACT component USAL Line and Word Segmentation. Not having a login for IMPACT, I can only speculate what these data are and how others deal with them. But since this is announced as testdata (as opposed to GT data, despite the file name), isn't it possible these files are expected to fail, too?
Back to our problem: I don't think we can have any such thing like an order of preference here. What if document ordering (your number one) is wrong (as in the case at hand)? How to detect this if not by comparing the other two options (ranked lower in your list)? And what if those disagree as well? Make a majority vote? And shouldn't looking at coordinates be ruled out entirely, because it is expensive (performance-wise), error-prone, and complex (at least for components that would not need to deal with them otherwise)?
I agree that readingOrder
would be overkill for elements below the block/region level. Provided you share my objection to coordinates, that leaves us with XML ordering only.
Looking at it from the PAGE specification, IIRC, XML ordering must agree with textLineOrder
and readingDirection
.
from assets.
I meant order of my personal (humble) preference not for implementation. I m no expert but could imagine that readingOrder could be unavoidable for RTL or some constructs for non-latin scripts with fallback to document order if no readingorder defined. Coordinates are too error prone I think too.
As for the test data: Could very well be that this is an expected failure. Ill try to find out more thanks for digging into it.
Can @tboenig and @wrznr offer a more qualified opinion on modeling order of inline elements?
from assets.
I meant order of my personal (humble) preference not for implementation.
Oh sorry, I got you completely wrong there before. So I concur!
from assets.
Let me summarize:
- there is only a "region" order so called Reading Order.
It would be overkill a proposal for elements below the block/region. - The coordinates are an indicator for the Reading Order. For polygons the definition of the 'order' indicator is not so simple. Therefore the definition of a bounding box would be useful in this case. Furthermore, the reading direction and LineOrder must also be considered in this case.
- The Reading Order is defined in the sequence of the word elements. The content of the elements
<TextEquivType><Unicode>
may differ. In this case, it is more likely that an entry error occurred.
Proposal Decision:
The Reading Order is defined in the sequence of the word elements. An evaluation of the elements <TextEquivType><Unicode>
can be ignored. An evaluation of the elements can be neglected. If a comparison is made between the contents of the <Word>
elements and the <TextEquivType><Unicode>
elements, the rich sequence of the <Word>
elements is always recommended.
from assets.
@tboenig I agree that the XML ordering is the most practical and flexible solution for us. Especially with so-called "Schmuckdruck" in mind. We should therefore handle this issue as a bug in the assets data which should be fixed by a PR.
from assets.
@kba RTL is a special case. But there are additional mechanisms in PAGE XML for handling this.
from assets.
@tboenig Just to make sure we understand each other:
Your point 1 is not the same as my point 2: I was not referring to the ReadingOrder
element, but the readingOrder
key of the custom
attribute.
1. It would be overkill a proposal for elements below the block/region.
Okay fine, but why does it appear in kant_aufklaerung_1784-page-block-line-word up to the Word
level? Merely as an illustration of (non-standard) possibilities?
3. The content of the elements `<TextEquivType><Unicode>` may differ. In this case, it is more likely that an entry error occurred.
Do you mean you don't know for sure now (whether this is an error in the assets), or rather there is always a possibility of error but one can never know (and specify) with certainty? My concern is with what processors can safely assume from incoming data, so a strict rule is needed, and one that can easily be implemented (like relying on XML ordering exclusively).
If a comparison is made between the contents of the elements
<TextEquivType><Unicode>
and the elements, the rich sequence of the elements is always recommended.
Does that mean the contents of TextLine:TextEquiv:Unicode
may (is allowed to) in fact deviate from the concatenation of its content TextLine:Word:TextEquiv:Unicode
(plus whitespace)? I was actually hoping for a specification that would rule out such deviations entirely.
from assets.
@bertsky The concatenation of TextLine:Word:TextEquiv:Unicode
contents is not allowed to deviate from the corresponding TextLine:TextEquiv:Unicode
contents. That's why this issue has been marked with the label bug. The file in assets will be fixed asap.
from assets.
The rule:
- The Reading Order is defined in the sequence of the word elements.
- If a comparison is made between the contents of the
<Word>
elements and the<TextEquivType><Unicode>
elements, the contents of both elements must correspond. - If a comparison between the contents of the
<Word>
elements and the<TextEquivType><Unicode>
elements shows a difference, then there is an error. However, if this file is processed further, the order of the<Word>
elements must be followed.
from assets.
@wrznr @tboenig Thanks for clarifying!
from assets.
@tboenig Pls. repair the erroneous files and close this issue.
from assets.
@tboenig PUSH.
from assets.
@bertsky:
see the document:
https://github.com/OCR-D/assets/tree/master/data/kant_enlightenment_1784-page-block-block-line-word_glyph/data/OCR-D-GT-SEG-WORD_GLYPH
here you will find an example for the recording of:
Region, Word and Glyph
from assets.
@tboenig thanks!
Alas, see #26
from assets.
Related Issues (20)
- 1000pages: Incomplete annotation on page 0001 of "immermann_muenchhausen02_1839"" HOT 2
- 1000pages: Separators missing on page 0010 of "immermann_muenchhausen02_1839" HOT 1
- 1000pages: Inconsistent annotation of column separators in "krafft_landwirtschaft02_1876"" HOT 1
- 1000pages: Non-existent separator annotated on page 0018 of "krafft_landwirthschaft03_1876"" HOT 2
- 1000pages: Missing text on page 0003 and 0004 of "lenau_gedichte_1832" HOT 3
- Change the file name in DFKI test data HOT 2
- Most/All workspaces in bag files don't validate HOT 4
- Add references to OCR-D Ground Truth repo. HOT 1
- provide TableRegion/Grid examples HOT 6
- Repository not usable on case insensitive filesystems (like macOS and Windows) HOT 6
- Update scribo-tests with correct `k` parameters for sauvola-ms-fg HOT 1
- Add a METS with lots of files for testing HOT 9
- Lots of XSD validation errors HOT 2
- Self-contained make "update-bagit" target
- zip files broken links
- euler_rechenkunst01_1738 has wrong structLink
- OCR-D GT uses wrong mods:languageTerm/@authority
- wrong image references
- Validation errors for 'gutachten'
- Broken CI validation test and warning because of outdated code
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from assets.