Comments (12)
This point is probably decisive. Even if we would now fix this upstream, we can never be sure how components evolve. Since we have no (simple and expectable) way to check whether the clock-wise-tl-starting assumption is fulfilled, some day things will go wrong.
That goes for weaker assumptions too: We cannot enforce clock-wise or even ordered path for points via XML schema.
So how much more expensive would a robust solution in utils.py be? Every PAGE element has coordinates, every page goes through several processing steps involving core's functions.
from assets.
One more thing: are there really 4 Points needed? If we use rectangles top-left and bottom-right would be sufficient.
from assets.
I interpret page:Coords
to be points of a polygon, not necessarily a rectangle. c.f. https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/pagecontent/schema/pagecontent.xsd#L441
A two-coordinate tuple could be a special case but translating between all these representations is confusing enough as it is IMHO.
from assets.
Yes OK.
But then I would suggest not to rely on any ordering of the points. If you have more than 4 points or less than 4 (technically a triangle is a polygon, too) You need a more robust way to calculate the according bounding boxes anyway.
from assets.
And you cannot check/enforce the ordering in a schema AFAIK.
from assets.
The problem behind this issue was a segfault in tesseract for certain words IIRC.
I wouldn't want to enforce this via schema, I was just curious how this happens since the coordinates are shifted only in these specific cases.
Good point about polygons and bounding boxes, so far we do not have support for bounding polygons with boxes at all because we assume coordinates to be coordinates.
@cneud @wrznr @tboenig Do we have samples of non-rectangular text blocks to test?
from assets.
@kba Yes, we have plenty examples (incl. some with PAGE ground truth) of non-rectangular text blocks, will try to upload some samples over the next few days.
from assets.
A simple programmatical solution for this would be to calculate the min/max x and y coordinates over all points. I do have a simple fix for this -- if you are interested in it.
from assets.
So how much more expensive would a robust solution in utils.py be?
Not that much I guess. I'll send a PR.
from assets.
I do have a simple fix for this -- if you are interested in it.
Didn't see this before. Contributions welcome :)
from assets.
from assets.
For posterity's sake: The original problem was a bug in Transkribus that has been fixed up-stream and will be rolled out in the next release. HT @tboenig
from assets.
Related Issues (20)
- 1000pages: Separators missing on page 0010 of "immermann_muenchhausen02_1839" HOT 1
- 1000pages: Inconsistent annotation of column separators in "krafft_landwirtschaft02_1876"" HOT 1
- 1000pages: Non-existent separator annotated on page 0018 of "krafft_landwirthschaft03_1876"" HOT 2
- 1000pages: Missing text on page 0003 and 0004 of "lenau_gedichte_1832" HOT 3
- Change the file name in DFKI test data HOT 2
- Most/All workspaces in bag files don't validate HOT 4
- Add references to OCR-D Ground Truth repo. HOT 1
- provide TableRegion/Grid examples HOT 6
- Repository not usable on case insensitive filesystems (like macOS and Windows) HOT 6
- Update scribo-tests with correct `k` parameters for sauvola-ms-fg HOT 1
- Add a METS with lots of files for testing HOT 9
- Lots of XSD validation errors HOT 2
- Self-contained make "update-bagit" target
- zip files broken links
- euler_rechenkunst01_1738 has wrong structLink
- OCR-D GT uses wrong mods:languageTerm/@authority
- wrong image references
- Validation errors for 'gutachten'
- Broken CI validation test and warning because of outdated code
- make local image refs LOCTYPE=OTHER OTHERLOCTYPE=FILE instead of URL HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from assets.