Describe the bug I am not sure what are the limits of ocrmypdf. I

[Bug]: ocrmypdf invoked oom-killer about ocrmypdf HOT 4 CLOSED

munzirtaha commented on June 3, 2024

[Bug]: ocrmypdf invoked oom-killer

from ocrmypdf.

Comments (4)

jbarlow83 commented on June 3, 2024

OCRmyPDF uses all available resources by design so it's most likely to be targeted by OOM killer (busy process tree using lots of CPU and memory). It's precisely the process that ought to be killed in a system under memory pressure.

Your file has some large images at high resolution (11000x14000 @ 400dpi is ~24x36 inches).

You can use --jobs to limit the number of simultaneous worker processes, and there are also options to adjust behavior on large images, including skipping them outright.

from ocrmypdf.

munzirtaha commented on June 3, 2024

Thank you so much for the explanation. I managed to avoid the oom killer by increasing the swap space. However, I still wonder whether there is a bug here. The increase in size is too huge.

BIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.kuutl_np/optimize.opt.pdf, /tmp/ocrmypdf.io.kuutl_np/optimize.pdf)                                                            helpers.py:178
Running: ['jbig2', '--version']                                                                                                                          __init__.py:134
Running: ['pngquant', '--version']                                                                                                                       __init__.py:134
Image optimization ratio: 1.95 savings: 48.7%                                                                                                           _pipeline.py:904
Total file size ratio: 0.15 savings: -573.8%                                                                                                            _pipeline.py:907
/tmp/ocrmypdf.io.kuutl_np/optimize.pdf -> 03_ocr.pdf                                                                                                    _pipeline.py:979
The output file size is 6.74× larger than the input file.                                                                                             _validation.py:375
Possible reasons for this include:
--force-ocr was issued, causing transcoding.

> ls 03*
╭───┬────────────┬──────┬───────────┬────────────────╮
│ # │    name    │ type │   size    │    modified    │
├───┼────────────┼──────┼───────────┼────────────────┤
│ 0 │ 03.pdf     │ file │  48.9 MiB │ 11 months ago  │
│ 1 │ 03_ocr.pdf │ file │ 329.7 MiB │ 12 minutes ago │
╰───┴────────────┴──────┴───────────┴────────────────╯

The images list are originally:

> pdfimages -list 03.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2000  2588  rgb     3   8  jpeg   no         8  0    72    72  267K 1.8%
   2     1 image    2000  2588  gray    1   8  jpeg   no        22  0    72    72 42.4K 0.8%

And after running ocrmypdf they are converted to this huge 11111x14378 size and to 400 ppi instead of the original 72.

> pdfimages -list 03_ocr.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image   11111 14378  rgb     3   8  jpeg   no       707  0   400   400 2435K 0.5%
   2     1 image   11111 14378  gray    1   8  jpeg   no         4  0   400   400 87.9K 0.1%

from ocrmypdf.

jbarlow83 commented on June 3, 2024

Can't be sure without reviewing the actual file, but most likely those pages contain vector artwork, so the whole page had to be promoted to high resolution. Using --force-ocr asks for this to be done. There's an argument to control the DPI used.
The other option in this sort of situation is to use really aggressive optimization.

from ocrmypdf.

munzirtaha commented on June 3, 2024

I managed to get a better result by using
pdfimages -all; img2pdf *.jpg | ocrmypdf - --output-type pdf
This results in a better file size and accuracy, and less resources. Here I don't need to use --force-ocr. Originally, running ocrmypdf directly without --force-ocr failed because of error:

ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;

There's an argument to control the DPI

But --image-dpi seems to work only with images, not PDF.

If the proper workflow is to convert to images first and use pdfimages and img2pdf before using ocrmypdf, I can live with that but if ocrmypdf should be able to do this directly and there is something that could be improved, I will be glad to share the file privately.

from ocrmypdf.

[Bug]: ocrmypdf invoked oom-killer about ocrmypdf HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent