Comments (9)
OK, so a stress test of sorts, that should be doable.
from assets.
@kba I guess this can be closed?
from assets.
I don't remember what I meant by this. I'll try to open more descriptive isssues in the future 😬
from assets.
I think this was to have a realistic test case for performance issues with large METS. Large could be many fileGrps or many files therein or many pages – or any combination of it. This came up earlier when some change to the PAGE model (esp. the pageId lookup) severely degraded performance on my workspaces to the point were it became unusable.
from assets.
probably sth like this? http://digital.slub-dresden.de/id336927223
from assets.
probably sth like this? http://digital.slub-dresden.de/id336927223
well, 300 pages is not that much of a stretch. How about: http://digital.slub-dresden.de/id507244877-18920000
That would cover the many pages scenario. But how about many fileGrps? The METS from Kitodo.Presentation is rather small (just FULLTEXT, ORIGINAL and various JPEG qualities). All I can think of is an OCR-D workspace after running lots of different workflows with many steps.
from assets.
That would cover the many pages scenario
Or rather: I could give you the METS built from https://github.com/bertsky/ocrd_publaynet – it contains 671407 pages in the training set and 56227 in the validation set.
from assets.
my example above is 1400 pages, nothing compared to your publaynet though
from assets.
my example above is 1400 pages, nothing compared to your publaynet though
oh, right! Sorry, got confused. Yes, I do think the bible should be a test case. PubLayNet is an extreme (probably never used that way) – I actually recommend against having it included in the auto regression tests, as it's such a drag. (But it might help to have it somewhere ...)
from assets.
Related Issues (20)
- 1000pages: Inconsistent annotation of column separators in "krafft_landwirtschaft02_1876"" HOT 1
- 1000pages: Non-existent separator annotated on page 0018 of "krafft_landwirthschaft03_1876"" HOT 2
- 1000pages: Missing text on page 0003 and 0004 of "lenau_gedichte_1832" HOT 3
- Change the file name in DFKI test data HOT 2
- Most/All workspaces in bag files don't validate HOT 4
- Add references to OCR-D Ground Truth repo. HOT 1
- provide TableRegion/Grid examples HOT 6
- Repository not usable on case insensitive filesystems (like macOS and Windows) HOT 6
- Update scribo-tests with correct `k` parameters for sauvola-ms-fg HOT 1
- Lots of XSD validation errors HOT 2
- Self-contained make "update-bagit" target
- zip files broken links
- euler_rechenkunst01_1738 has wrong structLink
- OCR-D GT uses wrong mods:languageTerm/@authority
- wrong image references
- Validation errors for 'gutachten'
- Broken CI validation test and warning because of outdated code
- make local image refs LOCTYPE=OTHER OTHERLOCTYPE=FILE instead of URL HOT 1
- Missing license
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from assets.