Hi Merlijn, I like this repo as it looks like the first serious open

I ran your image in recode_pdf : <div class="snipp

I'm even able to read the fuzz in the background <a target="_blank" rel="noopener

It's part of the DjVuLibre toolset on sourceforge <span class="ema

Lot of fuzz in background picture,about internetarchive/archive-pdf-tools

rmast commented on May 16, 2024

By the way, you'll probably need a good IPS-monitor to see that fuzz in the lower 2 bits. It doesn't show on my old Medion-monitor.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

I ran your image in recode_pdf:

cp outputbase2-000-raar-effect-onderste-regel-didjvu\ zonder\ tekst.tif img.tif
tesseract img.tif - hocr > img.hocr
recode_pdf --from-imagestack /tmp/img.tif --hocr-file /tmp/img.hocr -o /tmp/out.pdf
Processed 1 pages at 3.73 seconds/page
mrcview /tmp/out.pdf /tmp/mrc.pdf

See file below -- do you see the same background fuzz?

[out.pdf](https://github.com/internetarchive/archive-pdf-tools/files/7597950/out.pdf
mrc.pdf
)

from archive-pdf-tools.

rmast commented on May 16, 2024

Yes, with Acrobat Reader DC it takes some time to render the foreground, in a fraction of a second I can see lot of fuzzy edges of those letters.

from archive-pdf-tools.

rmast commented on May 16, 2024

I'm even able to read the fuzz in the background

from archive-pdf-tools.

rmast commented on May 16, 2024

If you don't have an IPS-monitor, my Android Phone is also showing it. Outlook voor Android downloaden<https://aka.ms/ghei36>

…

________________________________ From: Merlijn Wajer ***@***.***> Sent: Wednesday, November 24, 2021 7:17:42 PM To: internetarchive/archive-pdf-tools ***@***.***> Cc: rmast ***@***.***>; Author ***@***.***> Subject: Re: [internetarchive/archive-pdf-tools] Lot of fuzz in background picture (Issue #26) I ran your image in recode_pdf: cp outputbase2-000-raar-effect-onderste-regel-didjvu\ zonder\ tekst.tif img.tif tesseract img.tif - hocr > img.hocr recode_pdf --from-imagestack /tmp/img.tif --hocr-file /tmp/img.hocr -o /tmp/out.pdf Processed 1 pages at 3.73 seconds/page mrcview /tmp/out.pdf /tmp/mrc.pdf See file below -- do you see the same background fuzz? [out.pdf](https://github.com/internetarchive/archive-pdf-tools/files/7597950/out.pdf mrc.pdf<https://github.com/internetarchive/archive-pdf-tools/files/7597951/mrc.pdf> ) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#26 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5W6H7RJGW6E4JIXKZDUNUT4NANCNFSM5IWWC63A>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

Right, so you're referring to the background not fully having the text layer removed. There are some tricks you can do, see for example #8 (looks like you found it) - but they also harm the quality in some other cases.

In general the latest releases (the one I shared with you) has gotten much better at making the background just background. Providing it the right DPI for example matters some too, and then the binarisation methods also matter. At archive.org we also deal with a lot of books, if you know that you're dealing only with text, there's a lot more that can be done.

In general, this project is pretty new: it's hardly a year old, and I started it because most of the tools (as you have found) don't do MRC at all, so I decided to write it myself.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

BTW: the PDFs I shared about are already ~40x smaller than the tif that you uploaded, which is about as much as you can expect from MRC. If you want more compression, you can downsample the background, like so:

$ recode_pdf --from-imagestack /tmp/img.tif --hocr-file /tmp/img.hocr -o /tmp/out.pdf --bg-downsample 3 -v --dpi 600 --fg-compression-flags '-slope 45000' --mask-compression jbig2
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 FMA3
	 AVX2
Creating text only PDF
Starting page generation at 2021-11-24T19:34:25.015312
Finished page generation at 2021-11-24T19:34:25.025512
Creating text pages took 0.0102 seconds
Inserting (and compressing) images
Converting with image mode: 2
MRC time breakdown: {'image_load': 0, 'grey_conversion': 358, 'hocr_mask_gen': 101, 'est_1': 91, 'threshold': 350, 'fast_denoise': 22, 'mask_jbig2': 174, 'fg_partial_blur': 719, 'fg_jp2': 309, 'bg_partial_blur': 770, 'bg_downsample': 270, 'bg_jp2': 82, 'page_image_insertion': 0}
Saving PDF now
Processed 1 pages at 3.63 seconds/page
Compression ratio: 66.214312

which leads to a 64K PDF with the following breakdown:

$ pdfimagesmrc /tmp/out.pdf
backsize: 4.00% 2.43kB
frntsize: 40.16% 24.42kB
masksize: 39.18% 23.82kB
restsize: 16.66% 10.13kB

If you remove the text layer (restsize), it'd be a bit smaller still.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

I've been looking to collect a relatively large sample of different test images. I only have a few here now: https://github.com/internetarchive/archive-pdf-tools/tree/tests/tests/files

If you'd like to contribute some, that could also help in the future when trying to further improve the background generation.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

Do you have some pointers to C44 implementations (C or Python)? I'd like to try it on some of the books with images/photos that I have.

from archive-pdf-tools.

rmast commented on May 16, 2024

It's part of the DjVuLibre toolset on sourceforge

…

________________________________ From: Merlijn Wajer ***@***.***> Sent: Wednesday, November 24, 2021 8:34:57 PM To: internetarchive/archive-pdf-tools ***@***.***> Cc: rmast ***@***.***>; Author ***@***.***> Subject: Re: [internetarchive/archive-pdf-tools] Lot of fuzz in background picture (Issue #26) Do you have some pointers to C44 implementations (C or Python)? I'd like to try it on some of the books with images/photos that I have. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#26 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5UGZTKLCTP3557ZLEDUNU46DANCNFSM5IWWC63A>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

from archive-pdf-tools.

rmast commented on May 16, 2024

Didjvu calls it in combination with a mask picture so it can leave out all occluded parts from the wavelet compressed result. Outlook voor Android downloaden<https://aka.ms/ghei36>

…

________________________________ From: Robert Mast ***@***.***> Sent: Wednesday, November 24, 2021 8:47:39 PM To: internetarchive/archive-pdf-tools ***@***.***>; internetarchive/archive-pdf-tools ***@***.***> Cc: Author ***@***.***> Subject: Re: [internetarchive/archive-pdf-tools] Lot of fuzz in background picture (Issue #26) It's part of the DjVuLibre toolset on sourceforge

________________________________ From: Merlijn Wajer ***@***.***> Sent: Wednesday, November 24, 2021 8:34:57 PM To: internetarchive/archive-pdf-tools ***@***.***> Cc: rmast ***@***.***>; Author ***@***.***> Subject: Re: [internetarchive/archive-pdf-tools] Lot of fuzz in background picture (Issue #26) Do you have some pointers to C44 implementations (C or Python)? I'd like to try it on some of the books with images/photos that I have. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#26 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5UGZTKLCTP3557ZLEDUNU46DANCNFSM5IWWC63A>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

Interesting. In the past I had also looked at specifying the region of interest (https://www.researchgate.net/publication/252087916_Selecting_the_don't_care_bits_in_JPEG2000_ROI_coding) with kakadu or openjpeg, but it didn't seem to make a difference.

Do you perhaps have a few instructions on how to use this? I'm working on these kind of improvements in my spare time and I don't have a lot of time to investigate the djvu tooling. The sourceforge website isn't too helpful and I got the same impression when I last checked this out: the project looks a bit dead and it's not clear how they want you (as a user) to use it. Thanks in advance!

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

Hm -- looks like I have c44 already installed on my system...

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

It looks like c44 can create a djvu file from a mask file and an input image, and then djvuextract can extract some parts (although I am not sure how to read it yet)

e.g.

c44 /tmp/img.pnm -mask /tmp/out-mask.pbm /tmp/out.djvu

and

$ djvuextract /tmp/out.djvu BG44=/tmp/out-djvu-bg
  BG44=/tmp/out-djvu-bg --> "/tmp/out-djvu-bg" (290742 bytes)

from archive-pdf-tools.

rmast commented on May 16, 2024

Yes. You can also use Djvutoy to convert the djvu to pdf. Unfortunately DjVuToy is freeware, but not open source. DjVuToy even translates a jb2 to jbig2 dictionary 1:1. Outlook voor Android downloaden<https://aka.ms/ghei36>

…

________________________________ From: Merlijn Wajer ***@***.***> Sent: Wednesday, November 24, 2021 9:23:44 PM To: internetarchive/archive-pdf-tools ***@***.***> Cc: rmast ***@***.***>; Author ***@***.***> Subject: Re: [internetarchive/archive-pdf-tools] Lot of fuzz in background picture (Issue #26) It looks like c44 can create a djvu file from a mask file and an input image, and then djvuextract can extract some parts (although I am not sure how to read it yet) e.g. c44 /tmp/img.pnm -mask /tmp/out-mask.pbm /tmp/out.djvu and $ djvuextract /tmp/out.djvu BG44=/tmp/out-djvu-bg BG44=/tmp/out-djvu-bg --> "/tmp/out-djvu-bg" (290742 bytes) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#26 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5VIJPKZ7O6A6RJE7PDUNVCVBANCNFSM5IWWC63A>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

For my understanding, do you know how this deals with non-text, like images? I was able to see some output (by converting to postscript and then postscript to pdf) and it seems to perform quantization on the background, which is fine if the background uniform, but not if it contains any kind of drawings or interesting colours?

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

Here are some examples from (much) older versions of this software: https://archive.org/~merlijn/projects/archive-pdf-tools/index.html#mrc-examples

from archive-pdf-tools.

rmast commented on May 16, 2024

If you want to see the hole package of DjVu patents at work DjVuSolo 3.1 converts tiffs to a djvu that's immediately displayed. You can select info and layers via the menus.

from archive-pdf-tools.

rmast commented on May 16, 2024

I was very surprised with the small result.

from archive-pdf-tools.

rmast commented on May 16, 2024

I see two large pictures with text fuzz on the bottom of your example page, which together only add one little colored picture to the left of a further bitonal image. Leaving out fuzz and at least the lowest resolution one of the two, would runlenght encoded for the empty space probably be much smaller.

from archive-pdf-tools.

rmast commented on May 16, 2024

If you are interested in compressing comics then Ma Jian of DjVuToy also has a lot of knowledge on that subject. He even pointed me to another freeware tool he maintains to preprocess those pictures. One of the steps you would want is posterizing and palettizing to reduce the colors to the visually different colors. One of the problems with the usual jb2/jbig2 coder is that it needs backgroundspace between foreground pictures. I read djvupalette can make jb2’s with consecutive colored glyphs. Coloring of jb2-glyphs with a fgbz-segment is however not possible in PDF. I saw DjVuToy to split such colored JB2’s to multiple JBIG2’s with a shared dictionary-indirection. Those separate JBIG2’s than can have different colors and together form the original colored picture. Alexander Truvanof has the opinion automatic compression always has flaws that need manual correction, so you’ll always need some workflow to be able to visually judge the intermediate results. By the way, I don't think MRC would be the best approach to compress comics. The content abides to several rules such as homogenous fill color, pen thickness, splines, so I guess a vector language like PostScript would be the closest language to describe such drawings with minimal overhead. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=93DF060319E2A6241F40D1F261A22D0C?doi=10.1.1.489.8522&rep=rep1&type=pdf I also see mentioning of a .CBR, Comic Book Reader compression format.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

Manual correction is hard on the scale that I/we usually use this program (many million of pages every day).

You can find a comic I tested on recently here (https://archive.org/details/bruno-de-bever-a01):

The compression ratio is only ~4x because of a specific archive.org policy where we use higher quality compression for the first 10 and last 5 pages of an artifact (and this one only has 33 pages).

from archive-pdf-tools.

rmast commented on May 16, 2024

If I look at the textballoon https://wizzup.org/bruno.pdf p. 18 "Waar heb je het over, ik heb geen spullen" "Ben je echt alleen"

I see a lot of fuzzy color in the textballoons, where the original has more white in those textballoons, together with some jpeg-artefacts around the text. With the resolution of the color in the textballoons I would expect there to be some blocking at the edge, But not that the color of the hat of the blackbird goes 3 pixels above the black line:

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

Right, that happens because of an algorithm I wrote that attempts to make the images a bit more smooth to compress, but can cause the colours to bleed through a bit: https://github.com/internetarchive/archive-pdf-tools/blob/master/cython/optimiser.pyx#L65 - I've observed similar things with the PDFs produced by the foxit (luratech) pdf compressor

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

It might be possible to add some more algorithms, allow tweaking more parameters to be tweaked to create "profiles" for certain input types. I've tried to make it all pretty generic so that it should do a decent job for a wide range of input documents.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

(Very open to improvements btw -- I've been toying with this on and off for a year and this is best I could come up with currently)

from archive-pdf-tools.

rmast commented on May 16, 2024

The first thing I could think of is try to do the same with DjVuSolo 3.1 to see whether there are better options for which patents will be expiring very soon. The American patent however has a strange enddate with no clear relation to the filing date. That should be 20 years max difference. One of the european patents expires in two months.

from archive-pdf-tools.

rmast commented on May 16, 2024

I've made a DjVu from the with standard settings of DjVuSolo3.1, which has this look in WinDjView:

The resulting djvu of this picture has about the same size as the pdfsam split page of your pdf which has this look in Acrobat DC:

However unfortunately, when I make DjVuToy convert the DjVu to PDF it becomes twice as big.
Documents.zip

Your algorithm seems to preserve more of the drawn lines if you look at the eyes and the nose on the right, but both damage the tree above the character on the right. This was the original:

from archive-pdf-tools.

rmast commented on May 16, 2024

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8404928/#B41-jimaging-07-00153
https://github.com/WangJieying/SDMD-resources
https://github.com/WangJieying/SDMD-resources/blob/main/cartoon.md
So I would propose to diverge from PDF for this purpose, or look whether some vector description language possible within PDF could be used for the vectors.

Unfortunately this SDMD requires CUDA, which I don't have. I could probably rent one at AWS or Azure.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

Interesting how the DjVu code does the inverse: removing colour around the beak of the raven and the teeth of the beaver.

from archive-pdf-tools.

rmast commented on May 16, 2024

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8404928/#B41-jimaging-07-00153 https://github.com/WangJieying/SDMD-resources https://github.com/WangJieying/SDMD-resources/blob/main/cartoon.md So I would propose to diverge from PDF for this purpose, or look whether some vector description language possible within PDF could be used for the vectors.

Unfortunately this SDMD requires CUDA, which I don't have. I could probably rent one at AWS or Azure.

Edit: I've tried to bring the contents of these repo's alive, but an important part was missing: the class BSplineCurveGenerate.

I succeeded in getting the missing files, and already saw the compile finish, however I'm now rearranging the dependencies and then will try again to get that bspline-compression done on these pictures on an Azure-cuda server.

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

It might be worth checking out #33 and also perhaps toying with some of the JPEG2000 compression parameters. I think I currently always use just one layer, but multiple layers with region of interest coding could also help. We could mark the text as 'not interesting' in the background image, and hope that some of the background noise just disappears because of that.

from archive-pdf-tools.

rmast commented on May 16, 2024

I'm now as far that I can compile the SDMD-code that I had to recompose from some repo's, but still no luck in getting the skeleton. Searching for the goal-to-be-reached: A pdf using fitted bezier curves or something like that to redraw the pictures, I guess using PostScript, I found the complete legacy of Zunzunsite: http://web.archive.org/web/20200315020359/http://zunzun.com/

https://bitbucket.org/zunzuncode/zunzunsite3/src/master/

Unfortunately the owner had to quit due to health reasons:
https://groups.google.com/g/zunzun_dot_com/c/n7Uk9P_CDe4

It's revived! http://www.findcurves.com/

The site offers lots of curve formulas, so I doubt it fits bezier curves right into PostScript in a PDF.

from archive-pdf-tools.

rmast commented on May 16, 2024

With a little help from @WangJieying I was able to reconstruct and recompose the SDMD-image compression on an Azure server:
https://github.com/rmast/SDMD-resources.
This is the result I get with the default config.txt, of which the compressed output.sir is 211590 bytes:

For convenience, the original:

So, this SDMD algorithm seems not really fit to preserve the black lines of a comic. I wonder if comic compression with splines to create a postscript Bspline-PDF could become a new CUDA-research study subject ;-).

from archive-pdf-tools.

MerlijnWajer commented on May 16, 2024

Hmm... interesting, is there another part where the foreground or other components are visible? Like, how is it combined?

from archive-pdf-tools.

rmast commented on May 16, 2024

The decomposition is in b-spline coordinates for three greylevel pictures, after thresholding them on some levels. The B-spline coordinates are then compressed into the final output.sir. The the image is recomposed by reversing those steps. All should be explained in the corresponding scientific publication. I think for cartoon-pictures with clear black separators the approach should be different. Outlook voor Android downloaden<https://aka.ms/ghei36>

…

________________________________ From: Merlijn Wajer ***@***.***> Sent: Saturday, January 1, 2022 3:44:46 PM To: internetarchive/archive-pdf-tools ***@***.***> Cc: rmast ***@***.***>; Author ***@***.***> Subject: Re: [internetarchive/archive-pdf-tools] Lot of fuzz in background picture (Issue #26) Hmm... interesting, is there another part where the foreground or other components are visible? Like, how is it combined? — Reply to this email directly, view it on GitHub<#26 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5R7DK6IUJYOI24DUHTUT4HN5ANCNFSM5IWWC63A>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

from archive-pdf-tools.

Lot of fuzz in background picture about archive-pdf-tools HOT 36 OPEN

Comments (36)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent