Comments (14)
I just looked up the issue myself. There's something wrong with the ratio-determination:
The image size for determining the amount of 0's is done with the complete image instead of the text-box.
If you correct that the issue is gone:
from archive-pdf-tools.
By the way, nice tool, PyCharm. It looks somewhat like Intellij that I tried before.
from archive-pdf-tools.
I created a pull-request for the solution. You might be tempted to even merge it in master .
from archive-pdf-tools.
I just looked up the issue myself. There's something wrong with the ratio-determination: ![image](https://user-
Thanks for finding this, this is indeed a real problem. I will take a look and see if this fix is ok, but will need to do some local testing on my test images to make sure everything looks ok.
from archive-pdf-tools.
So with the change from your pull request, there are some regressions for some of my tests, for example
Before:
After:
from archive-pdf-tools.
Removing the *100
makes it better, but some other images still regress, so I will need to spend a bit more time on this later. Thanks for noticing, I also found another typo -- it writes - ones
instead of - ones_i
.
from archive-pdf-tools.
from archive-pdf-tools.
I don't remember exactly what I toyed with, but I definitely tried to do something like that: trying to rely on what makes a character vs noise in the bounding box. I think my idea was to use the "ratio", characters usually don't fill up most of the pixels in the bounding box, and if you apply that for an entirely word or even line, then any outliners (e.g. 'w') will be filtered out. That's what the original code was designed to do. On top of that, I then added some simple noise estimation to filter out noise.
I think I locally have some changes that improve somewhat over the current code in my test cases and don't have the bug you found, but I'll need to do further evaluation.
from archive-pdf-tools.
Continuing on your thought I would expect an if-construction like this:
diff --git a/internetarchivepdf/mrc.py b/internetarchivepdf/mrc.py
index f6290db..e2bb6c0 100644
--- a/internetarchivepdf/mrc.py
+++ b/internetarchivepdf/mrc.py
@@ -237,12 +237,11 @@ def create_hocr_mask(img, mask_arr, hocr_word_data, downsample=None, dpi=None, t
zero_i = thres_invert[np.where(thres_invert == 0)].size
inv_ratio = (ones_i/(zero_i+ones_i))*100
+
+
if ratio < 0.3 or inv_ratio < 0.3:
th = None
- perc_larger = 0.
- if inv_ratio != 0.0:
- perc_larger = (ratio / inv_ratio) * 100
if inv_ratio > 0.2 and ratio < 0.2:
th = thres
@@ -261,9 +260,16 @@ def create_hocr_mask(img, mask_arr, hocr_word_data, downsample=None, dpi=None, t
th = thres_invert
elif ratio < 0.2:
th = thres
-
- if th is not None:
- mask_arr[top:bottom, left:right] = th
+ else:
+ perc_larger = 0.
+ if inv_ratio != 0.0:
+ perc_larger = (ratio / inv_ratio) * 100
+ if perc_larger < 50:
+ th = thres
+ else:
+ th = thres_invert
+ if th is not None:
+ mask_arr[top:bottom, left:right] = th
if timing_data is not None:
from archive-pdf-tools.
By the way, DjVu has an expired patented algorithm for foreground/background separation: https://patents.google.com/patent/US6901169
However it performs less when there's noise in the scan that looks like holes in the mask:
jwilk/didjvu#21
from archive-pdf-tools.
Just a heads up, I'm branching the current code to a 1.4.x
branch so that I can build future archive.org releases based on that, which allows for master to have more "compression" breaking changes.
We performed a lot of QA on the output on current parameters/code, so I don't feel confident just rolling out changes, however minor, so this should set us up to make some more breaking changes in master
.
from archive-pdf-tools.
from archive-pdf-tools.
If you're looking to fix deskew issues mostly automatically, for text content, we use this: https://git.archive.org/archivecd/tesserotate/
It's only applied to our books and microfilm, but it works wonders, in my experience. It's based on Tesseract. It's combined with some heuristics, but by itself it works pretty decently. (Better than leptonica's deskew imho)
from archive-pdf-tools.
In the past I decided not to include deskew and such in archive-pdf-tools, as such preprocessing could be done by another tool, prior to invoking recode_pdf
, that's why stuff like this is not included.
from archive-pdf-tools.
Related Issues (20)
- pillow is not working properly HOT 27
- Need some inspiration? HOT 7
- Some scans become inverted HOT 7
- Detect if RGB images in pages are greyscale or even 1bit
- Define scope of tooling and work to improve for that scope
- Create better presets for users with quality-comparable options for openjpeg/grok/pillow and kakadu HOT 1
- Missing test suite? HOT 1
- pdfcomp: new tool, discussion, compression questions HOT 19
- The choice for inverting, what's the use for perc_larger?
- pdfcomp: problems with inverted text that is often better in hocr. HOT 10
- Wrong resolution of mask image when foreground image is downsampled HOT 1
- First recode_pdf test: 'numpy' has no attribute 'int'. HOT 5
- IndexError: list index out of range (single TIFF file) HOT 5
- HOCR rendering compares unfavorably with tesseract PDF text layer HOT 11
- Installing on MacOS? HOT 29
- Q: accessible tagging/hints? HOT 4
- A certain PDF from Archive.org does not display all of its contents on Mac OS HOT 26
- A user-friendly example for a scanned multipage PDF needed HOT 3
- Recode does not merge hocr into pdf HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from archive-pdf-tools.