Coder Social home page Coder Social logo

Comments (9)

MerlijnWajer avatar MerlijnWajer commented on May 22, 2024

This would probably also help with the text backdrop/shade on the background generation - if we improve the mask generation, that should probably start working better as well.

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 22, 2024

In particular, the scribo implementation(s) could be helpful: https://github.com/OCR-D/olena/tree/master/scribo/scribo

ocrd/olena:latest contains scribo-cli but also its OCR-D wrapper ocrd-olena-binarize (which uses bash and xmlstarlet for all the METS/PAGE-XML interfacing) and is ~300MB (built from https://github.com/OCR-D/ocrd_olena/blob/master/Dockerfile)
ocrd/olena:build-olena contains scribo-cli only and is ~100MB (built from https://github.com/OCR-D/ocrd_olena/blob/master/build-olena.dockerfile)

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 22, 2024

I've been working on implementing supporting scribo (https://github.com/OCR-D/olena/tree/master/scribo/scribo) binarisation methods and in particular looked at the intersection of singh and wolf/sauvola_ms, since singh doesn't seem to produce as fat letters as most other algorithms do. With that in place, the foreground text is definitely better colour wise, and also more sharp, but the background has more artifacts, since the borders of the text are not removed from the background.

This makes me wonder if it makes sense to introduce some third layer (not in the final result), which contains the text borders and other pixels that are (for example) not in singh but are in wolf. We would then ultimately place those pixels in the foreground image, but not use them when creating the 'smoothed' foreground image. But we would use them (the 'second' mask) when smoothing out the background.

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 22, 2024

diff

After messing around a bit, adding the 'extra' temporary layer, I think the background generation has gotten quite a bit better. The mask is creating using Singh's algorithm, with any other binarisation mixed in to filter out some noise, and then the hOCR layer is mixed in with Singh to find the borders/backdrop of text. Will push code in the next few days.

(E: Left is new, right is old)

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 22, 2024

diff2

Another example with a large newspaper, left is old, right is new.

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 22, 2024

We could/should also consider wapping out the sauvola algorithm for the sauvola_ms algorithm with the right window size -- that might further improve quality and compress masks.

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 22, 2024

I have improved the background generation significantly in a more simple manner in this commit: 3cbcc90

It doesn't make the images also sharper, but arguably that shouldn't happen if they weren't sharp to begin with. Leaving this open for now, but much of "shade backdrop" problems are gone.

from archive-pdf-tools.

rmast avatar rmast commented on May 22, 2024

Have you seen the OCRD-project that contains lots of binarizations? Ocr4all tries to use it in an upcoming edition.

Gamera 4 is also providing some binarization algorithms, for example an incomplete DjVu-binarization dat doesn't contain the software patent that's going to end in two months that helps inverting white on black parts in the binarized image.

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 22, 2024

Yes, I'm in pretty good contact with the OCR-D people and the branch linked here uses various algorithms from the OCR-D folks. I ended up going with just Sauvola recently because my implementation it's that much faster than basically anything in the scribo toolbox, and performance matters for archive.org since we deal with millions of pages a day.

I'm happy to add support for alternative binarisation methods, btw. What I have found however that it's really hard to find a 'one size fits all' approach, and the current code 1.4.9 is on par or often even better than the commercial foxit compression.

from archive-pdf-tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.