Coder Social home page Coder Social logo

Comments (15)

kba avatar kba commented on June 16, 2024

This file is from the IMPACT interoperability framework testfiles: https://github.com/impactcentre/iif-testfiles

I explicitly chose an external example for glyphs that has been widely used and is established and see where it led us.... :)

IMHO the order of preference for Words (and Glyphs) should be

  • document order (XML ordering)
  • Page readingOrder mechanism (which seems overkill for inline elements)
  • coordinates

from assets.

bertsky avatar bertsky commented on June 16, 2024

I see. Well I traced it down from there (gt.xml) to the IMPACT component USAL Line and Word Segmentation. Not having a login for IMPACT, I can only speculate what these data are and how others deal with them. But since this is announced as testdata (as opposed to GT data, despite the file name), isn't it possible these files are expected to fail, too?

Back to our problem: I don't think we can have any such thing like an order of preference here. What if document ordering (your number one) is wrong (as in the case at hand)? How to detect this if not by comparing the other two options (ranked lower in your list)? And what if those disagree as well? Make a majority vote? And shouldn't looking at coordinates be ruled out entirely, because it is expensive (performance-wise), error-prone, and complex (at least for components that would not need to deal with them otherwise)?

I agree that readingOrder would be overkill for elements below the block/region level. Provided you share my objection to coordinates, that leaves us with XML ordering only.

Looking at it from the PAGE specification, IIRC, XML ordering must agree with textLineOrder and readingDirection.

from assets.

kba avatar kba commented on June 16, 2024

I meant order of my personal (humble) preference not for implementation. I m no expert but could imagine that readingOrder could be unavoidable for RTL or some constructs for non-latin scripts with fallback to document order if no readingorder defined. Coordinates are too error prone I think too.

As for the test data: Could very well be that this is an expected failure. Ill try to find out more thanks for digging into it.

Can @tboenig and @wrznr offer a more qualified opinion on modeling order of inline elements?

from assets.

bertsky avatar bertsky commented on June 16, 2024

I meant order of my personal (humble) preference not for implementation.

Oh sorry, I got you completely wrong there before. So I concur!

from assets.

tboenig avatar tboenig commented on June 16, 2024

Let me summarize:

  1. there is only a "region" order so called Reading Order.
    It would be overkill a proposal for elements below the block/region.
  2. The coordinates are an indicator for the Reading Order. For polygons the definition of the 'order' indicator is not so simple. Therefore the definition of a bounding box would be useful in this case. Furthermore, the reading direction and LineOrder must also be considered in this case.
  3. The Reading Order is defined in the sequence of the word elements. The content of the elements <TextEquivType><Unicode> may differ. In this case, it is more likely that an entry error occurred.

Proposal Decision:
The Reading Order is defined in the sequence of the word elements. An evaluation of the elements <TextEquivType><Unicode>can be ignored. An evaluation of the elements can be neglected. If a comparison is made between the contents of the <Word> elements and the <TextEquivType><Unicode> elements, the rich sequence of the <Word> elements is always recommended.

from assets.

wrznr avatar wrznr commented on June 16, 2024

@tboenig I agree that the XML ordering is the most practical and flexible solution for us. Especially with so-called "Schmuckdruck" in mind. We should therefore handle this issue as a bug in the assets data which should be fixed by a PR.

from assets.

wrznr avatar wrznr commented on June 16, 2024

@kba RTL is a special case. But there are additional mechanisms in PAGE XML for handling this.

from assets.

bertsky avatar bertsky commented on June 16, 2024

@tboenig Just to make sure we understand each other:

Your point 1 is not the same as my point 2: I was not referring to the ReadingOrder element, but the readingOrder key of the custom attribute.

1. It would be overkill a proposal for elements below the block/region.

Okay fine, but why does it appear in kant_aufklaerung_1784-page-block-line-word up to the Word level? Merely as an illustration of (non-standard) possibilities?

3. The content of the elements `<TextEquivType><Unicode>` may differ. In this case, it is more likely that an entry error occurred.

Do you mean you don't know for sure now (whether this is an error in the assets), or rather there is always a possibility of error but one can never know (and specify) with certainty? My concern is with what processors can safely assume from incoming data, so a strict rule is needed, and one that can easily be implemented (like relying on XML ordering exclusively).

If a comparison is made between the contents of the elements <TextEquivType><Unicode> and the elements, the rich sequence of the elements is always recommended.

Does that mean the contents of TextLine:TextEquiv:Unicode may (is allowed to) in fact deviate from the concatenation of its content TextLine:Word:TextEquiv:Unicode (plus whitespace)? I was actually hoping for a specification that would rule out such deviations entirely.

from assets.

wrznr avatar wrznr commented on June 16, 2024

@bertsky The concatenation of TextLine:Word:TextEquiv:Unicode contents is not allowed to deviate from the corresponding TextLine:TextEquiv:Unicode contents. That's why this issue has been marked with the label bug. The file in assets will be fixed asap.

from assets.

tboenig avatar tboenig commented on June 16, 2024

The rule:

  1. The Reading Order is defined in the sequence of the word elements.
  2. If a comparison is made between the contents of the <Word> elements and the <TextEquivType><Unicode> elements, the contents of both elements must correspond.
  3. If a comparison between the contents of the <Word> elements and the <TextEquivType><Unicode> elements shows a difference, then there is an error. However, if this file is processed further, the order of the <Word> elements must be followed.

from assets.

bertsky avatar bertsky commented on June 16, 2024

@wrznr @tboenig Thanks for clarifying!

from assets.

wrznr avatar wrznr commented on June 16, 2024

@tboenig Pls. repair the erroneous files and close this issue.

from assets.

wrznr avatar wrznr commented on June 16, 2024

@tboenig PUSH.

from assets.

tboenig avatar tboenig commented on June 16, 2024

@bertsky:
see the document:
https://github.com/OCR-D/assets/tree/master/data/kant_enlightenment_1784-page-block-block-line-word_glyph/data/OCR-D-GT-SEG-WORD_GLYPH
here you will find an example for the recording of:
Region, Word and Glyph

from assets.

bertsky avatar bertsky commented on June 16, 2024

@tboenig thanks!

Alas, see #26

from assets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.