Comments (9)
This would probably also help with the text backdrop/shade on the background generation - if we improve the mask generation, that should probably start working better as well.
from archive-pdf-tools.
In particular, the scribo implementation(s) could be helpful: https://github.com/OCR-D/olena/tree/master/scribo/scribo
ocrd/olena:latest contains scribo-cli but also its OCR-D wrapper ocrd-olena-binarize (which uses bash and xmlstarlet for all the METS/PAGE-XML interfacing) and is ~300MB (built from https://github.com/OCR-D/ocrd_olena/blob/master/Dockerfile)
ocrd/olena:build-olena contains scribo-cli only and is ~100MB (built from https://github.com/OCR-D/ocrd_olena/blob/master/build-olena.dockerfile)
from archive-pdf-tools.
I've been working on implementing supporting scribo (https://github.com/OCR-D/olena/tree/master/scribo/scribo) binarisation methods and in particular looked at the intersection of singh and wolf/sauvola_ms, since singh doesn't seem to produce as fat letters as most other algorithms do. With that in place, the foreground text is definitely better colour wise, and also more sharp, but the background has more artifacts, since the borders of the text are not removed from the background.
This makes me wonder if it makes sense to introduce some third layer (not in the final result), which contains the text borders and other pixels that are (for example) not in singh but are in wolf. We would then ultimately place those pixels in the foreground image, but not use them when creating the 'smoothed' foreground image. But we would use them (the 'second' mask) when smoothing out the background.
from archive-pdf-tools.
After messing around a bit, adding the 'extra' temporary layer, I think the background generation has gotten quite a bit better. The mask is creating using Singh's algorithm, with any other binarisation mixed in to filter out some noise, and then the hOCR layer is mixed in with Singh to find the borders/backdrop of text. Will push code in the next few days.
(E: Left is new, right is old)
from archive-pdf-tools.
Another example with a large newspaper, left is old, right is new.
from archive-pdf-tools.
We could/should also consider wapping out the sauvola algorithm for the sauvola_ms algorithm with the right window size -- that might further improve quality and compress masks.
from archive-pdf-tools.
I have improved the background generation significantly in a more simple manner in this commit: 3cbcc90
It doesn't make the images also sharper, but arguably that shouldn't happen if they weren't sharp to begin with. Leaving this open for now, but much of "shade backdrop" problems are gone.
from archive-pdf-tools.
Have you seen the OCRD-project that contains lots of binarizations? Ocr4all tries to use it in an upcoming edition.
Gamera 4 is also providing some binarization algorithms, for example an incomplete DjVu-binarization dat doesn't contain the software patent that's going to end in two months that helps inverting white on black parts in the binarized image.
from archive-pdf-tools.
Yes, I'm in pretty good contact with the OCR-D people and the branch linked here uses various algorithms from the OCR-D folks. I ended up going with just Sauvola recently because my implementation it's that much faster than basically anything in the scribo toolbox, and performance matters for archive.org since we deal with millions of pages a day.
I'm happy to add support for alternative binarisation methods, btw. What I have found however that it's really hard to find a 'one size fits all' approach, and the current code 1.4.9
is on par or often even better than the commercial foxit compression.
from archive-pdf-tools.
Related Issues (20)
- pillow is not working properly HOT 27
- Need some inspiration? HOT 7
- Some scans become inverted HOT 7
- Detect if RGB images in pages are greyscale or even 1bit
- Define scope of tooling and work to improve for that scope
- Create better presets for users with quality-comparable options for openjpeg/grok/pillow and kakadu HOT 1
- Missing test suite? HOT 1
- pdfcomp: new tool, discussion, compression questions HOT 19
- Bug in foreground/background separator choosing massive block instead of character outline. HOT 14
- The choice for inverting, what's the use for perc_larger?
- pdfcomp: problems with inverted text that is often better in hocr. HOT 10
- Wrong resolution of mask image when foreground image is downsampled HOT 1
- First recode_pdf test: 'numpy' has no attribute 'int'. HOT 5
- IndexError: list index out of range (single TIFF file) HOT 5
- HOCR rendering compares unfavorably with tesseract PDF text layer HOT 11
- Installing on MacOS? HOT 29
- Q: accessible tagging/hints? HOT 4
- A certain PDF from Archive.org does not display all of its contents on Mac OS HOT 26
- A user-friendly example for a scanned multipage PDF needed HOT 3
- Recode does not merge hocr into pdf HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from archive-pdf-tools.