internetarchive / archive-pdf-tools Goto Github PK

View Code? Open in Web Editor NEW

81.0 20.0 13.0 26.42 MB

Fast PDF generation and compression. Deals with millions of pages daily.

Home Page: https://archive-pdf-tools.readthedocs.io/en/latest/

License: GNU Affero General Public License v3.0

Python 87.76% Cython 11.74% Shell 0.49%

compression ocr pdf pdf-compression pdf-generation pdf-generator pdf-to-image python pdf-compressor

archive-pdf-tools's Introduction

Internet Archive PDF tools

authors

Merlijn Wajer <[email protected]>

date

2021-11-14 18:00

This repository contains a library to perform MRC (Mixed Raster Content) compression on images¹, which offers lossy high compression of images, in particular images with text.

Additionally, the library can generate MRC-compressed PDF files with hOCR² text layers mixed into to the PDF, which makes searching and copy-pasting of the PDF possible. PDFs generated by bin/recode_pdf should be PDF/A 3b and PDF/UA compatible.

Some of the tooling also supports specific Internet Archive file formats (such as the "scandata.xml" files, but the tooling should work fine without those files, too.

While the code is already being used internally to create PDFs at the Internet Archive, the code still needs more documentation and cleaning up, so don't expect this to be super well documented just yet.

Features

Reliable: has produced over 6 million PDFs in 2021 alone (each with many hundreds of pages)
Fast and robust compression: Competes directly with the proprietary software offerings when it comes to speed and compressibility (often outperforming in both)
MRC compression of images, leading to anywhere from 3-15x compression ratios, depending on the quality setting provided.
Creates PDF from a directory of images
Improved compression based on OCR results (hOCR files)
Hidden text layer insertion based on hOCR files, which makes a PDF searchable and the text copy-pasteable.
PDF/A 3b compatible.
Basic PDF/UA support (accessibility features)
Creation of 1 bit (black and white) PDFs

Dependencies

Python 3.x
Python packages (also see requirements.txt):
- PyMuPDF
- lxml
- scikit-image
- Pillow
- roman
- archive-hocr-tools

One-of:

Kakadu JPEG2000 binaries
Open source OpenJPEG2000 tools (opj_compress and opj_decompress)
Grok (grk_compress and grk_decompress)
jpegoptim (when using JPEG instead of JPEG2000)

For JBIG2 compression:

jbig2enc for JBIG2 compression (and PyMuPDF 1.19.0 or higher)

Installation

First install dependencies. For example, in Ubuntu:

sudo apt install libleptonica-dev libopenjp2-tools libxml2-dev libxslt-dev python3-dev python3-pip

sudo apt install automake libtool
git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
sudo make install

Because archive-pdf-tools is on the Python Package Index (PyPI), you can use pip (the Python 3 version is often called pip3) to install the latest version:

# Latest version
pip3 install archive-pdf-tools

# Specific version
pip3 install archive-pdf-tools==1.4.14

Alternatively, if you want a specific commit or unreleased version, check out the master branch or a tagged release and use pip to install:

git clone https://github.com/internetarchive/archive-pdf-tools.git
cd archive-pdf-tools
pip3 install .

Finally, if you've downloaded a wheel to test a specific commit, you can also install it using `pip`:

pip3 install --force-reinstall -U --no-deps ./archive_pdf_tools-${version}.whl

To see if archive-pdf-tools is installed correctly for your user, run:

recode_pdf --version

Not well tested features

"Recoding" an existing PDF, extracting the images and creating a new PDF with the images from the existing PDF is not well tested. This works OK if every PDF page just has a single image.

Known issues

Using --image-mode 0 and --image-mode 1 is currently broken, so only MRC or no images is supported.
It is not possible to recode/compress a PDF without hOCR files. This will be addressed in the future, since it should not be a problem to generate a PDF lacking hOCR data.

Planned features

Addition of a second set of fonts in the PDFs, so that hidden selected text also renders the original glyphs.
Better background generation (text shade removal from the background)
Better compression parameter selection, I have not toyed around that much with kakadu and grok/openjpeg2000 parameters.

MRC

The goal of Mixed Raster Content compression is to decompose the image into a background, foreground and mask. The background should contain components that are not of particular interest, whereas the foreground would contain all glyphs/text on a page, as well as the lines and edges of various drawings or images. The mask is a 1-bit image which has the value '1' when a pixel is part of the foreground.

This decomposition can then be used to compress the different components individually, applying much higher compression to specific components, usually the background, which can be downscaled as well. The foreground can be quite compressed as well, since it mostly just needs to contain the approximate colours of the text and other lines - any artifacts introduced during the foreground compression (e.g. ugly artifact around text borders) are removed by overlaying the mask component of the image, which is losslessly compressed (typically using either JBIG2 or CCITT).

In a PDF, this usually means the background image is inserted into a page, followed by the foreground image, which uses the mask as its alpha layer.

Usage

Creating a PDF from a set of images is pretty straightforward:

recode_pdf --from-imagestack 'sim_english-illustrated-magazine_1884-12_2_15_jp2/*' \
    --hocr-file sim_english-illustrated-magazine_1884-12_2_15_hocr.html \
    --dpi 400 --bg-downsample 3 \
    -m 2 -t 10 --mask-compression jbig2 \
    -o /tmp/example.pdf
[...]
Processed 9 pages at 1.16 seconds/page
Compression ratio: 7.144962

Or, to scan a document, OCR it with Tesseract and save the result as a compressed PDF (JPEG2000 compression with OpenJPEG, background downsampled three times), with text layer:

scanimage --resolution 300 --mode Color --format tiff | tee /tmp/scan.tiff | tesseract - - hocr > /tmp/scan.hocr ; recode_pdf -v -J openjpeg --bg-downsample 3 --from-imagestack /tmp/scan.tiff --hocr-file /tmp/scan.hocr -o /tmp/scan.pdf
[...]
Processed 1 pages at 11.40 seconds/page
Compression ratio: 249.876613

Examining the results

mrcview (tools/mrcview) is shipped with the package and can be used to turn a MRC-compressed PDF into a PDF with each layer on a separate page, this is the easiest way to inspect the resulting compression. Run it like so:

mrcview /tmp/compressed.pdf /tmp/mrc.pdf

There is also maskview, which just renders the masks of a PDF to another PDF.

Alternatively, one could use pdfimages to extract the image layers of a specific page and then view them with your favourite image viewer:

pageno=0; pdfimages -f $pageno -l $pageno -png path_to_pdf extracted_image_base
feh extracted_image_base*.png

tools/pdfimagesmrc can be used to check how the size of the PDF is broken down into the foreground, background, masks and text layer.

License

License for all code (minus internetarchivepdf/pdfrenderer.py) is AGPL 3.0.

internetarchivepdf/pdfrenderer.py is Apache 2.0, which matches the Tesseract license for that file.

archive-pdf-tools's People

Contributors

Stargazers

Watchers

Forkers

novikovke cclauss redsandro maxpeal rmast stweil digdug101 pasaopasen sanchaya jrochkind tfmorris cunckster storytracer

archive-pdf-tools's Issues

Consider turning on mask denoising by default

It's currently turned off by default, but could help with quality and compression, in case the mask contains noise. It would likely slow down the mask generation a bit, though.

pillow is not working properly

Using -J pillow results in a terrible images. It looks like the image is resampled 4 to 1.

recode_pdf -v --dpi 300 \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow.pdf

Here is the -J pillow foreground layer:

For comparison, here is -J kakadu:

The resulting files are approximately similar in size. Is pillow really absurdly bad, or does it need to get different compression parameters? I wanted to try this out, recode_pdf doesn't like the documented compression-flags and will throw an error:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 750' \
  -J pillow \
  -I in.png --hocr-file in.hocr -o out-pillow-r750.pdf

  File "internetarchivepdf/jpeg2000.py", line 188, in _jpeg2000_pillow_str_to_kwargs
    k, v = en.split(':', maxsplit=1)
ValueError: not enough values to unpack (expected 2, got 1)

Additional info

Linux Mint 20.2 AKA Ubuntu 20.04.3

Test scan to experiment with

test_1.png.zip

Suggested actionables

Use sane defaults for pillow so quality is reasonable.
Show clear distinct error message so user doesn't get ambiguous ValueError when following the docs.
Update documentation with Pillow compression flags.

Support actual recompression of an existing PDF without any input hOCR or input images

We could just extract every image from a PDF, and insert the MRC compressed images in its place. This way we could just compress existing PDFs, much like the foxit pdf compressor does: https://www.foxit.com/compress-pdf/

This would actually be a pretty trivial, but powerful, addition.

Support PDF generation without MRC

This used to work (currently called image mode 1), but it for sure doesn't work right now, so it would be nice to make that work again.

Just some other errors with the current version. I can't get the current version to work with a hocr-file coming from pdftree to get out the current searchable text from a PDF

I now work with a hocr-file coming from pdftree to get out the current searchable text from a PDF as suggested on the bottom of this issue:
ocropus/hocr-tools#117

recode_pdf --from-imagestack './2022-01-08*.tif' --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 2022-01-08a.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages

recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf --hocr-file anonymized.hocr --dpi 400 --bg-downsample 3 --mask-compression jbig2 -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 741, in recode
outdoc.save(outfile, deflate=True, pretty=True)
File "/usr/local/lib/python3.8/dist-packages/PyMuPDF-1.19.2-py3.8-linux-x86_64.egg/fitz/fitz.py", line 4416, in save
raise ValueError("cannot save with zero pages")
ValueError: cannot save with zero pages

Even if I leave out the hocr-file in the hope the input PDF should be already taken for the searchable text inside there's still an error:
recode_pdf --from-pdf Afbeeldingen/scantailorin/out/2022-01-08a.pdf -o 220108uitvoer.pdf
Traceback (most recent call last):
File "/usr/local/bin/recode_pdf", line 4, in
import('pkg_resources').run_script('archive-pdf-tools==1.4.11', 'recode_pdf')
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 667, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 1463, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 288, in
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 628, in recode
create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.11-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 110, in create_tess_textonly_pdf
for idx, hocr_page in enumerate(hocr_iter):
File "/usr/local/lib/python3.8/dist-packages/archive_hocr_tools-1.1.13-py3.8.egg/hocr/parse.py", line 42, in hocr_page_iterator
fp.seek(0)
AttributeError: 'NoneType' object has no attribute 'seek'

I anonymized the hocr by :%s/>.*</span>/>bla</span>
anonymized.zip

Create better presets for users with quality-comparable options for openjpeg/grok/pillow and kakadu

Per this issue #42 (comment) - it might be a better idea to not have users mess around with compression flags too much and provide human readable 'profiles' or 'presets'.

Having shared presets/profiles would also make it easier to integrate the same profiles in other standalone tools that are to come.

Support Grok for JPEG2000 encode/decode and support OpenJPEG2000 in a better fashion

Grok is a pretty promising JPEG2000 encoder/decoder: https://github.com/GrokImageCompression/grok/
We already support OpenJPEG2000, but only in a pretty basic fashion. It might make sense to add special compression options for it, like we do for kakadu currently with the slope flags.

License (in)compatibility

Hi, any progress on the license-incompatibility with OcrMyPDF (MPL-2.0)?

Would GScan2PDF (GPLv3) be a better fit? I'll try to study the differences...

Add option to disable jbig2

Since it's to enabled by default, we don't have a way to disable it.

Add --best flag?

Add another font beyond the glyphless font to actually render fonts of the languages that are in use

There is an old branch here that implements the concept:

https://github.com/internetarchive/archive-pdf-tools/tree/show-text-on-selection

It looks a bit messy, and the code was older (wrt font sizes when I wrote it), but something was working back then:

This table is a set of fonts that we could expect to have around I believe (system wide?):

Font Name	Installed Base Font	Comments
china-s	Heiti	simplified Chinese
china-ss	Song	simplified Chinese (serif)
china-t	Fangti	traditional Chinese
china-ts	Ming	traditional Chinese (serif)
japan	Gothic	Japanese
japan-s	Mincho	Japanese (serif)
korea	Dotum	Korean
korea-s	Batang	Korean (serif)

Then the question becomes -- what do we do for Arabic fonts?

We will want to add the language to the word data as returned by archive-hocr-tools, and then on a per page basis insert the right font.


(Old bug: https://git.archive.org/merlijn/archive-pdf-tools/-/issues/4)

Small difference in compressionratio

See my post after the closed #30

There is a small compression-ratio difference between your and my setup. Could that signal a memory-leak, or some other difference in setup?

Support pillow jpeg2000 writing

It's probably not as great as grok or kakadu, but it'd be nice to support it for who folks who don't have the other programs installed.

Wrong resolution of mask image when foreground image is downsampled

I tried to use recode_pdf from imagestack together with the option to downsample the foreground ("--fg-downsample 4").
The resulting pdf was unreadable.

I found out that the foreground (meaning the color layer) was resampled as expected.
When the pdf is written, the resolution of the mask layer (which should stay in the original size) is taken from the foreground and therefor wrong.

As a solution i changed mrc.py to return the size of the mask and used the values from recode.py

This works fine when encoding images to pdf.
I did not test ist with other modes.

Attached you find patches for mrc.py and recode.py.
patches.tar.gz

Maybe support a glob for hocr files too, rather than requiring them to be combined into a single file

This shouldn't be too hard, and might be easier for some. We could add it as --from-hocrstack

Add option (and heuristic) to treat the background as 'just plain (white) paper' for further optimisations

Error with hocr-files from Tesseract

When Tesseract generates this HOCR-file
img.zip

I get this error:

recode_pdf --from-imagestack ../210923-005.tif --hocr-file ~/img.hocr -o /tmp/outf.pdf --bg-downsample 3 -v --dpi 300 --fg-compression-flags '-slope 45000' --mask-compression jbig2
	 MMX
	 SSE
	 SSE2
	 SSE3
	 SSSE3
	 SSE41
	 POPCNT
	 SSE42
	 AVX
	 F16C
	 XOP
	 FMA4
	 FMA3
Creating text only PDF
Starting page generation at 2021-11-28T10:59:56.133494
Traceback (most recent call last):
  File "/usr/local/bin/recode_pdf", line 4, in <module>
    __import__('pkg_resources').run_script('archive-pdf-tools==1.4.9', 'recode_pdf')
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 667, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/EGG-INFO/scripts/recode_pdf", line 262, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 1070, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/usr/local/lib/python3.8/dist-packages/archive_pdf_tools-1.4.9-py3.8-linux-x86_64.egg/internetarchivepdf/recode.py", line 189, in create_tess_textonly_pdf
    imgfile = image_files[idx]
IndexError: list index out of range

For the first 5 pages there was no issue with the same command, it's only this page, so the hocr coming from Tesseract contains something not allowed.|

Use JBIG2 compression to determine if we want to blur or denoise before thresholding

We can perform threshold on the original image, optimistically do the JBIG2 conversion, and only when the JBIG2 doesn't compress well, we either apply blur to the image and re-threshold, and/or denoise the threshold result (mask).

JBIG2 compression is fast, and our current noise estimation is not. Since our JBIG2 is lossless, good compression suggests that the image is not noisy.

This will help up speed up the PDF generation, since the Gaussian noise estimation is currently the most CPU intensive part, which is kind of silly.

master file contents.rst not found during build of docs

I had to apply this instruction of Hrvoje to get the docs building running the apt installed Sphinx v1.8.5:

Windows port

Add pillow fallback for reading images (in case kakadu or openjpeg2000 or grok is not available)
os.remove causes sharing violations since I remove files after I open them, which Windows doesn't allow
jbig2 encoding is not available right now

Use (not yet released) pdf->hocr conversation to improve compression for existing PDFs

If we know where the PDF contains text, we could apply our usual higher-quality hOCR-based compression there.

Usefulness of MRC for decent quality compression of scanned book pages with illustrations

Opening a new issue as requested.

Here are some samples: https://mega.nz/folder/BRhChKob#xo-HHaJrD9VYN6YV3ur9WA

128.tif & 188.tif - original cleaned up 600dpi scans
*-scantailor.tif - 600dpi mixed output with bitonal text and color photos, as autodetected
*-scantailor-pdfbeads.pdf - above .tif split into two layers, with the text layer jbig2-encoded and the background layer JP2-encocded downsampled to 150dpi, and everything encoded in a pdf using pdfbeads
*.jp2 - some compressed versions of the original, forgot the settings. Page 128 is almost half as small as the PDF's so I assume PDF sizes can be slightly improved.

The folders have some residual files. ScanTailor itself can now split tiffs, though I have no idea how to merge them as layers in a PDF. (That would be useful to learn.)

Can MRC output get to be anything comparable to the PDFs at the same or lower size? I'm also curious whether it can be achieved directly from the original cleaned up scan or the ScanTailor mixed output step is still advised.

--jbig2 deprecated

I built the newest version of this tool, and it states I should use [--mask-compression {jbig2,ccitt}]. So the main readme should be adapted accordingly.

Improve mask and background generation

There a few things to improve in the mask generation:

The Sauvola binarisation currently uses fixed parameters, which is not ideal. We probably want to make some of those parameters dependent on the image DPI, and change the k value to 0.34 as default.
We could look into better binarisation algorithms like multi-scale sauvola as mentioned here: tesseract-ocr/tesseract#3083 (comment)

The same applies to the hocr-specific mask generation.

Add/implement regression tests for MRC

We can leverage the scripts in tools/ to perform the MRC compression separately, and merge the final result, and create a diff of the output of the original image and the MRC compressed image. This way, if we have a database of images, we could improve the algorithm and see how it performs against known data/images.

Detect if RGB images in pages are greyscale or even 1bit

Would be a neat way to efficiently compress certain input data

look at kakadu/grok/openjpeg compression parameters

It would be worth it to look at the "-q" parameter and also at the multilayer options, perhaps we can make better use of the compression codecs.

pdfcomp: problems with inverted text that is often better in hocr.

This form https://www.kvk.nl/download/Formulier-14-wijziging-ondernemings-en-vestigingsgegevens_tcm109-365607.pdf

First page saved to jpeg via this site: https://smallpdf.com

Result of the left column is quite readable at the right screen-resolution.

ocrmypdf --pdfa-image-compression lossless -O0  0001.jpg formulierhocrjpg.pdf
Input file is not a PDF, checking if it is an image...
Input file is an image
Input image has no ICC profile, assuming sRGB
Image seems valid. Try converting to PDF...
Successfully converted to PDF, processing...
Scanning contents: 100%|████████████████████████| 1/1 [00:00<00:00, 73.93page/s]
OCR: 100%|██████████████████████████████████| 1.0/1.0 [00:09<00:00,  9.92s/page]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████| 1/1 [00:00<00:00,  2.46page/s]
Optimize ratio: 1.00 savings: 0.0%
Output file is a PDF/A-2B (as expected)

pdfcomp formulierhocrjpg.pdf formulierhocrjpgkleiner.pdf
Compression factor: 9.617848822158944

formulierhocrjpgkleiner.pdf

Contains unreadable text on the left. The hocr contains "Toelichting 1.1", it is completely unreadable.

My patch for the inversion ratio makes it better readable:

formulierhocrjpgkleinerpatch.pdf

However if you lookup the mask-picture it doesn't contain this text in the left column at all.

So my patch isn't the only needed change for that routine.

Some scans become inverted

I've noticed it two times before, and I thought it was a computer issue because I scanned too large at 600 dpi. But now I encounter this for a third time, this time while scanning a small card at 300 dpi. I'm beginning to think this might be a bug.

Original: Left. recode_pdf: Right.

My normal workflow:

ls -1 *.png > in.txt
tesseract -l nld+eng --dpi 300 in.txt out hocr
recode_pdf -v -m 2 --dpi 300 --from-imagestack "./*.png" --hocr-file out.hocr -o "out-recode.pdf"

Is this a known issue? Is there a known workaround? I did a quick search, didn't turn up anything.
I'm not sure I can share the full resolution card openly because it is copyrighted, but if this issue is never seen before I am willing to email full resolution file for testing purposes.

$ recode_pdf --version
internetarchivepdf 1.4.14

Support recompressing existing PDFs without hOCR files and without touching the text input

This would be quite helpful for OCRmyPDF users if they wanted to aggressively compress their PDFs after OCRmyPDF has done its work, see ocrmypdf/OCRmyPDF#541

I don't understand this picture

Why would we need so many colors smeared to the bottom right that are not behind the foreground mask?

Those could all be optimized away to facilitate Run Length Encoding.

Upon release of the new mupdf and pymupdf, flip on JBIG2 by default

Currently the default mask encoding is not JBIG2, since mupdf 1.18 has bugs dealing with it:

Bug in foreground/background separator choosing massive block instead of character outline.

Partly anonymized replay of my previous finding on compressing the bankstatement with downsampling the foreground, revealing a bug in the foreground-binarizer/separator.

Add fg_downsample=12 in compress-pdf-images:

    mrc_gen = create_mrc_hocr_components(pil_image, hocr_word_data,
    #mrc_gen = create_mrc_hocr_components(pil_image, [],
            denoise_mask=DENOISE_FAST,
            bg_downsample=3,
            fg_downsample=12
            )

bankstatementgeknipt8noalphag.zip

ocrmypdf --pdfa-image-compression lossless -O0 --image-dpi 600 bankstatementgeknipt8noalphag.tiff outgeknipt8g.pdf
pdfcomp outgeknipt8g.pdf outgeknipt8-12g.pdf
outgeknipt8-12g.pdf
outgeknipt8g.pdf

Need some inspiration?

https://github.com/whitelok/image-text-localization-recognition
https://github.com/qurator-spk/eynollah

openjpeg is not working properly

Recommended actions discussed in this issue:

Remove -threads (or place flag last) for OpenJpeg (done: 31def81)
Allow threads to be specified (encoder agnostic e.g. -num_threads for Kakadu)
Merge debug messages from ROI build

(Original issue below.)

Using -J openjpeg results in lossless compression. Probably because it is the default:

$ opj_compress -h | grep -A 3 "Default encoding"
Default encoding options:
-------------------------

 * Lossless

My test image compresses 0.29 times (got 3 times bigger).

recode_pdf -v --dpi 300 \
  -J openjpeg \
  -I in.png --hocr-file in.hocr -o out-openjpeg.pdf

recode_pdf should probably use some sane defaults with -J openjpeg.

Command line arguments don't work.

Unfortunately, manually setting the compression options doesn't work. According to opj_compress -h, the compression ratio can be adjusted with -q or -r:

-r <compression ratio>,<compression ratio>,...
    Different compression ratios for successive layers.
    The rate specified for each quality level is the desired
    compression factor (use 1 for lossless)
    Decreasing ratios required.
      Example: -r 20,10,1 means 
            quality layer 1: compress 20x, 
            quality layer 2: compress 10x 
            quality layer 3: compress lossless
    Options -r and -q cannot be used together.
-q <psnr value>,<psnr value>,<psnr value>,...
    Different psnr for successive layers (-q 30,40,50).
    Increasing PSNR values required, except 0 which can
    be used for the last layer to indicate it is lossless.
    Options -r and -q cannot be used together.

Yet the resulting files are identical size-wise regardless of compression-flags:

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -r 20' \
  --bg-compression-flags ' -r 20' \
  -J openjpeg \
  -I in.png --hocr-file in.hocr -o out-openjpeg.pdf

recode_pdf -v --dpi 300 \
  --fg-compression-flags ' -q 5' \
  --bg-compression-flags ' -q 5' \
  -J openjpeg \
  -I in.png --hocr-file in.hocr -o out-openjpeg.pdf

Am I missing something, or is this a bug with either recode_pdf or the documentation?

Testing openjpeg directly

$ opj_compress -r 750 -i in.png -o out.jp2

3.0 MB -> 34,3 kB

Additional info

Linux Mint 20.2 AKA Ubuntu 20.04.3

openjpeg version

$ opj_compress -h | grep openjp2
It has been compiled against openjp2 library v2.3.1.

Test scan to experiment with

test_1.png.zip

pdfcomp: new tool, discussion, compression questions

The tool needs command line arguments much like recode_pdf (which we might want to rename) - and probably those flags out to be shared mostly.

Let's also use this to discuss issues of people testing pdfcomp now.

Add tests...

We have none, and it would be good to have some.

The choice for inverting, what's the use for perc_larger?

In mrc.py some perc_larger seems useful for choosing for inversion of the word. However that value seems to be never used?

Support hOCR ocr_photo / ocr_image element

Support PDF generation/compression without hOCR files

This should be a no-brainer, but we need to deal with a few things:

We use hOCR files to estimate the page size based on the DPI encoded in the hOCR files (if present), otherwise we estimate it.
The code that generates the initial PDF with text layer obviously relies on hOCR. We could just make a PDF with empty pages of the right size as alternative when we have no hOCR.

Define scope of tooling and work to improve for that scope

Right now the tooling naming is a bit confusing. The main tool is called "recode_pdf", but it really doesn't do PDF recoding, it does PDF creation and also inserts text layers, and performs MRC compression.

Since I am working on adding a tool to actually recode existing PDFs (MRC compressing them, and not doing anything else for starters), it might make sense to think about renaming the tool names, but also define what the tools ought to do.

I think there are a few scenarios:

Given a set of images (and hOCR results), create a (compressed) PDF - like what ocrmypdf does.
Given an input PDF with just one image per page, do what the above step does.
Given an uncompressed PDF, compress (recode) the PDF. Optional features here are to (1) insert a text layer (2) make the PDF PDF/A compatible

Can others think of other scenarios?

I guess there could be a tool that also incorporates calling Tesseract, but I think that should probably be out of scope of this particular project (I am interesting in building public tooling for this, just not in the scope of this repo)

Run noise estimation on a part of the image

We probably only need to analyse a part of the image to get a decent sense of camera (or other) noise. Running it on the whole image takes quite some time (it's the most costly operation currently).

Missing test suite?

It looks like archive-pdf-tools currently does not have an automated test suite.
I know that not all developers like to work this way, but I think providing a test suite can be very advantageous for quality assurance, to make sure the library works equally well on different platforms. It may also be helpful to verify changes for correctness and avoid regressions.
pytest is a popular choice as a test framework among open-source Python projects, for instance.

Add support for 1-bit (black & white) mode, where the end result is just the mask

First recode_pdf test: 'numpy' has no attribute 'int'.

Just followed the install instructions, but the test recode_pdf --version gets a numpy-related error:

david@DESKTOP5:~/src/jbig2enc$ recode_pdf --version
Traceback (most recent call last):
  File "/home/david/.local/bin/recode_pdf", line 4, in <module>
    from internetarchivepdf.recode import recode
  File "/home/david/.local/lib/python3.10/site-packages/internetarchivepdf/__init__.py", line 2, in <module>
    from . import mrc
  File "/home/david/.local/lib/python3.10/site-packages/internetarchivepdf/mrc.py", line 36, in <module>
    from optimiser import optimise_gray, optimise_rgb, optimise_gray2, optimise_rgb2, fast_mask_denoise
  File "cython/optimiser.pyx", line 11, in init optimiser
  File "/home/david/.local/lib/python3.10/site-packages/numpy/__init__.py", line 305, in __getattr__
    raise AttributeError(__former_attrs__[attr])
AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

Thanks.

Look into support JPG instead of JPEG2000 for foreground/background generation

This should make the PDFs load quite a bit faster, but at the expensive of quality and compression, I believe. Still, it might be worth trying.

Look into increasing the quality of the foreground image by compressing less

I think the default parameters are over compressing the foreground image, and changing the quality a little bit should hardly infer a compression size hit, but would improve the quality quite a bit.

Lot of fuzz in background picture

Hi Merlijn,

I like this repo as it looks like the first serious open source MRC PDF solution I've found. However, I recently filed an issue with didjvu that has equal bad background fuzz. However, that was diminished by using the better DjVu-algorithm from C44 instead of DjVuMake for removing the surrounding pixels from characters:

jwilk/didjvu#18

I guess when you try the picture overthere you'll find a similar fuzzy background with this MRC PDF-compressor. It might be interesting to study the algorithm in the open source c44 to better separate the foreground from the background.

PDF/UA improvements

VeraPDF now supports PDF/UA verification:

~/verapdf/verapdf --format xml --flavour ua1 /tmp/test.pdf  > /tmp/out.xml

We should fix the problems that it finds with our PDFs, I suspect that this will also help with the problems that Adobe finds.

This means at least:

Adding the Primary language
Mark Figure's as Artifacts
Adding alt text to Figure's (we might not need to if we mark them as artifact)
Define language for text blocks
Potentially indicate the reading order?

Use "linear" option from new pymupdf (if it doesn't break metadata writing)

This option could be used:

linear (bool) – Save a linearised version of the document. This option creates a file format for improved performance for Internet access. Excludes “incremental”.

Last time I tried to use it, it heavily broken evince/poppler, so we might need to file bugs with them first.

internetarchive / archive-pdf-tools Goto Github PK

archive-pdf-tools's Introduction

Internet Archive PDF tools

Features

Dependencies

Installation

Not well tested features

Known issues

Planned features

MRC

Usage

Examining the results

License

archive-pdf-tools's People

Contributors

Stargazers

Watchers

Forkers

archive-pdf-tools's Issues

Additional info

Test scan to experiment with

Suggested actionables

Command line arguments don't work.

Testing openjpeg directly

Additional info

Test scan to experiment with

Recommend Projects

Recommend Topics

Recommend Org