Coder Social home page Coder Social logo

Comments (14)

rmast avatar rmast commented on May 16, 2024

I just looked up the issue myself. There's something wrong with the ratio-determination:
image

The image size for determining the amount of 0's is done with the complete image instead of the text-box.
If you correct that the issue is gone:
image

from archive-pdf-tools.

rmast avatar rmast commented on May 16, 2024

By the way, nice tool, PyCharm. It looks somewhat like Intellij that I tried before.

from archive-pdf-tools.

rmast avatar rmast commented on May 16, 2024

I created a pull-request for the solution. You might be tempted to even merge it in master .

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 16, 2024

I just looked up the issue myself. There's something wrong with the ratio-determination: ![image](https://user-

Thanks for finding this, this is indeed a real problem. I will take a look and see if this fix is ok, but will need to do some local testing on my test images to make sure everything looks ok.

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 16, 2024

So with the change from your pull request, there are some regressions for some of my tests, for example

Before:

burn-care-inverted

After:

burn-care-inverted

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 16, 2024

Removing the *100 makes it better, but some other images still regress, so I will need to spend a bit more time on this later. Thanks for noticing, I also found another typo -- it writes - ones instead of - ones_i.

from archive-pdf-tools.

rmast avatar rmast commented on May 16, 2024

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 16, 2024

I don't remember exactly what I toyed with, but I definitely tried to do something like that: trying to rely on what makes a character vs noise in the bounding box. I think my idea was to use the "ratio", characters usually don't fill up most of the pixels in the bounding box, and if you apply that for an entirely word or even line, then any outliners (e.g. 'w') will be filtered out. That's what the original code was designed to do. On top of that, I then added some simple noise estimation to filter out noise.

I think I locally have some changes that improve somewhat over the current code in my test cases and don't have the bug you found, but I'll need to do further evaluation.

from archive-pdf-tools.

rmast avatar rmast commented on May 16, 2024

Continuing on your thought I would expect an if-construction like this:

diff --git a/internetarchivepdf/mrc.py b/internetarchivepdf/mrc.py
index f6290db..e2bb6c0 100644
--- a/internetarchivepdf/mrc.py
+++ b/internetarchivepdf/mrc.py
@@ -237,12 +237,11 @@ def create_hocr_mask(img, mask_arr, hocr_word_data, downsample=None, dpi=None, t
             zero_i = thres_invert[np.where(thres_invert == 0)].size
             inv_ratio = (ones_i/(zero_i+ones_i))*100

+
+
             if ratio < 0.3 or inv_ratio < 0.3:
                 th = None

-                perc_larger = 0.
-                if inv_ratio != 0.0:
-                    perc_larger = (ratio / inv_ratio) * 100

                 if inv_ratio > 0.2 and ratio < 0.2:
                     th = thres
@@ -261,9 +260,16 @@ def create_hocr_mask(img, mask_arr, hocr_word_data, downsample=None, dpi=None, t
                         th = thres_invert
                     elif ratio < 0.2:
                         th = thres
-
-                if th is not None:
-                    mask_arr[top:bottom, left:right] = th
+            else:
+                perc_larger = 0.
+                if inv_ratio != 0.0:
+                    perc_larger = (ratio / inv_ratio) * 100
+                if perc_larger < 50:
+                    th = thres
+                else:
+                    th = thres_invert
+            if th is not None:
+                mask_arr[top:bottom, left:right] = th


     if timing_data is not None:


from archive-pdf-tools.

rmast avatar rmast commented on May 16, 2024

By the way, DjVu has an expired patented algorithm for foreground/background separation: https://patents.google.com/patent/US6901169
However it performs less when there's noise in the scan that looks like holes in the mask:
jwilk/didjvu#21

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 16, 2024

Just a heads up, I'm branching the current code to a 1.4.x branch so that I can build future archive.org releases based on that, which allows for master to have more "compression" breaking changes.

We performed a lot of QA on the output on current parameters/code, so I don't feel confident just rolling out changes, however minor, so this should set us up to make some more breaking changes in master.

from archive-pdf-tools.

rmast avatar rmast commented on May 16, 2024

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 16, 2024

If you're looking to fix deskew issues mostly automatically, for text content, we use this: https://git.archive.org/archivecd/tesserotate/

It's only applied to our books and microfilm, but it works wonders, in my experience. It's based on Tesseract. It's combined with some heuristics, but by itself it works pretty decently. (Better than leptonica's deskew imho)

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on May 16, 2024

In the past I decided not to include deskew and such in archive-pdf-tools, as such preprocessing could be done by another tool, prior to invoking recode_pdf, that's why stuff like this is not included.

from archive-pdf-tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.