Coder Social home page Coder Social logo

Comments (16)

JKamlah avatar JKamlah commented on June 16, 2024

The prima developer don't use the current version of itextpdf. I guess this problem can derive from that. Sorry, i am not really deep into the specifics here. Maybe someone else can help.

from ocrd_pagetopdf.

bertsky avatar bertsky commented on June 16, 2024

Okay, so switching to OpenJDK 8 helped make that warning go away.

But I still get no text layer! (My source file group had TextEquiv at the word, line and region level.)

How is this supposed to work?

from ocrd_pagetopdf.

JKamlah avatar JKamlah commented on June 16, 2024

Did you set the "-text-source" parameter correctly?
https://github.com/JKamlah/ocrd_pagetopdf/blob/master/ocrd-tool.json#L38

from ocrd_pagetopdf.

JKamlah avatar JKamlah commented on June 16, 2024

Maybe i should set it to std value like "T"..

from ocrd_pagetopdf.

bertsky avatar bertsky commented on June 16, 2024

Did you set the "-text-source" parameter correctly?
https://github.com/JKamlah/ocrd_pagetopdf/blob/master/ocrd-tool.json#L38

Here's what I did:

ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "T", "text-source": "T"}'

where assets is our GT test repo.

from ocrd_pagetopdf.

JKamlah avatar JKamlah commented on June 16, 2024

My bad.. It works with:
ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "R", "text-source": "R"}'

from ocrd_pagetopdf.

bertsky avatar bertsky commented on June 16, 2024

The data seems to have some negative coordinate values, with the added script it works:
ocrd-pagetopdf -m assets/data/kant_aufklaerung_1784/data/mets.xml -I OCR-D-GT-PAGE,OCR-D-IMG -O OCR-D-PDF -p '{"outlines": "T", "text-source": "R", "negative2zero":true}'

No, negative2zero makes no difference. But I do get text when I set text-source other than T (both word and region level works – without negative2zero). So at least there is a bug with at the textline level.

Also, I don't see outlines other than on the word level. Perhaps because the other levels have non-rectangular polygons?

from ocrd_pagetopdf.

bertsky avatar bertsky commented on June 16, 2024

Another problem seems to be that letters like ſ are lost.

from ocrd_pagetopdf.

JKamlah avatar JKamlah commented on June 16, 2024

The loosing letter problem, should be solved by using another font

from ocrd_pagetopdf.

JKamlah avatar JKamlah commented on June 16, 2024

"font":"/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"

from ocrd_pagetopdf.

bertsky avatar bertsky commented on June 16, 2024

The loosing letter problem, should be solved by using another font
"font":"/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"

Indeed, thanks!

from ocrd_pagetopdf.

JKamlah avatar JKamlah commented on June 16, 2024

I think you could be right with the polygons. I will test it with a transformation..

from ocrd_pagetopdf.

bertsky avatar bertsky commented on June 16, 2024

So at least there is a bug with at the textline level.

The reason is simply that this has since been renamed from T to L!

from ocrd_pagetopdf.

JKamlah avatar JKamlah commented on June 16, 2024

Thanks: 4933e4c

from ocrd_pagetopdf.

bertsky avatar bertsky commented on June 16, 2024

I think you could be right with the polygons. I will test it with a transformation..

Should not be the reason: the converter appears to use polygons itself.

from ocrd_pagetopdf.

bertsky avatar bertsky commented on June 16, 2024

@JKamlah can you please document the JDK version required (with a pointer to prima-page-to-pdf, in case that should change in the future)?

from ocrd_pagetopdf.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.