Comments (16)
The prima developer don't use the current version of itextpdf. I guess this problem can derive from that. Sorry, i am not really deep into the specifics here. Maybe someone else can help.
from ocrd_pagetopdf.
Okay, so switching to OpenJDK 8 helped make that warning go away.
But I still get no text layer! (My source file group had TextEquiv
at the word, line and region level.)
How is this supposed to work?
from ocrd_pagetopdf.
Did you set the "-text-source" parameter correctly?
https://github.com/JKamlah/ocrd_pagetopdf/blob/master/ocrd-tool.json#L38
from ocrd_pagetopdf.
Maybe i should set it to std value like "T"..
from ocrd_pagetopdf.
Did you set the "-text-source" parameter correctly?
https://github.com/JKamlah/ocrd_pagetopdf/blob/master/ocrd-tool.json#L38
Here's what I did:
ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "T", "text-source": "T"}'
where assets
is our GT test repo.
from ocrd_pagetopdf.
My bad.. It works with:
ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "R", "text-source": "R"}'
from ocrd_pagetopdf.
The data seems to have some negative coordinate values, with the added script it works:
ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "T", "text-source": "R", "negative2zero":true}'
No, negative2zero
makes no difference. But I do get text when I set text-source
other than T
(both word and region level works – without negative2zero
). So at least there is a bug with at the textline level.
Also, I don't see outlines other than on the word level. Perhaps because the other levels have non-rectangular polygons?
from ocrd_pagetopdf.
Another problem seems to be that letters like ſ
are lost.
from ocrd_pagetopdf.
The loosing letter problem, should be solved by using another font
from ocrd_pagetopdf.
"font":"/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"
from ocrd_pagetopdf.
The loosing letter problem, should be solved by using another font
"font":"/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"
Indeed, thanks!
from ocrd_pagetopdf.
I think you could be right with the polygons. I will test it with a transformation..
from ocrd_pagetopdf.
So at least there is a bug with at the textline level.
The reason is simply that this has since been renamed from T
to L
!
from ocrd_pagetopdf.
Thanks: 4933e4c
from ocrd_pagetopdf.
I think you could be right with the polygons. I will test it with a transformation..
Should not be the reason: the converter appears to use polygons itself.
from ocrd_pagetopdf.
@JKamlah can you please document the JDK version required (with a pointer to prima-page-to-pdf
, in case that should change in the future)?
from ocrd_pagetopdf.
Related Issues (11)
- Add as transform script to ocr-fileformat? HOT 4
- throw error if input-filegrp doesn't exist HOT 2
- Installation fails on Debian 10 HOT 10
- workaround for pagetopdf.jar exceptions HOT 1
- image input file group requirement HOT 4
- Usage example for converting page xml to searchable pdf? HOT 1
- does not work on two input fileGrps anymore HOT 5
- run without showing commands executed on stdout HOT 1
- allow creating multi-page PDFs HOT 11
- Add license HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrd_pagetopdf.