There a few things to improve in the mask generation: The Sauv

In particular, the scribo implementation(s) could be helpful: <a href="https://github.

I've been working on implementing supporting scribo (<a href="https://github.com/OCR-D

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Improve mask and background generation about archive-pdf-tools HOT 9 OPEN

internetarchive commented on May 22, 2024

Improve mask and background generation

from archive-pdf-tools.

Comments (9)

MerlijnWajer commented on May 22, 2024

This would probably also help with the text backdrop/shade on the background generation - if we improve the mask generation, that should probably start working better as well.

from archive-pdf-tools.

MerlijnWajer commented on May 22, 2024

In particular, the scribo implementation(s) could be helpful: https://github.com/OCR-D/olena/tree/master/scribo/scribo

ocrd/olena:latest contains scribo-cli but also its OCR-D wrapper ocrd-olena-binarize (which uses bash and xmlstarlet for all the METS/PAGE-XML interfacing) and is ~300MB (built from https://github.com/OCR-D/ocrd_olena/blob/master/Dockerfile)
ocrd/olena:build-olena contains scribo-cli only and is ~100MB (built from https://github.com/OCR-D/ocrd_olena/blob/master/build-olena.dockerfile)

from archive-pdf-tools.

MerlijnWajer commented on May 22, 2024

I've been working on implementing supporting scribo (https://github.com/OCR-D/olena/tree/master/scribo/scribo) binarisation methods and in particular looked at the intersection of singh and wolf/sauvola_ms, since singh doesn't seem to produce as fat letters as most other algorithms do. With that in place, the foreground text is definitely better colour wise, and also more sharp, but the background has more artifacts, since the borders of the text are not removed from the background.

This makes me wonder if it makes sense to introduce some third layer (not in the final result), which contains the text borders and other pixels that are (for example) not in singh but are in wolf. We would then ultimately place those pixels in the foreground image, but not use them when creating the 'smoothed' foreground image. But we would use them (the 'second' mask) when smoothing out the background.

from archive-pdf-tools.

MerlijnWajer commented on May 22, 2024

After messing around a bit, adding the 'extra' temporary layer, I think the background generation has gotten quite a bit better. The mask is creating using Singh's algorithm, with any other binarisation mixed in to filter out some noise, and then the hOCR layer is mixed in with Singh to find the borders/backdrop of text. Will push code in the next few days.

(E: Left is new, right is old)

from archive-pdf-tools.

MerlijnWajer commented on May 22, 2024

Another example with a large newspaper, left is old, right is new.

from archive-pdf-tools.

MerlijnWajer commented on May 22, 2024

We could/should also consider wapping out the sauvola algorithm for the sauvola_ms algorithm with the right window size -- that might further improve quality and compress masks.

from archive-pdf-tools.

MerlijnWajer commented on May 22, 2024

I have improved the background generation significantly in a more simple manner in this commit: 3cbcc90

It doesn't make the images also sharper, but arguably that shouldn't happen if they weren't sharp to begin with. Leaving this open for now, but much of "shade backdrop" problems are gone.

from archive-pdf-tools.

rmast commented on May 22, 2024

Have you seen the OCRD-project that contains lots of binarizations? Ocr4all tries to use it in an upcoming edition.

Gamera 4 is also providing some binarization algorithms, for example an incomplete DjVu-binarization dat doesn't contain the software patent that's going to end in two months that helps inverting white on black parts in the binarized image.

from archive-pdf-tools.

MerlijnWajer commented on May 22, 2024

Yes, I'm in pretty good contact with the OCR-D people and the branch linked here uses various algorithms from the OCR-D folks. I ended up going with just Sauvola recently because my implementation it's that much faster than basically anything in the scribo toolbox, and performance matters for archive.org since we deal with millions of pages a day.

I'm happy to add support for alternative binarisation methods, btw. What I have found however that it's really hard to find a 'one size fits all' approach, and the current code 1.4.9 is on par or often even better than the commercial foxit compression.

from archive-pdf-tools.

Improve mask and background generation about archive-pdf-tools HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent