Coder Social home page Coder Social logo

Comments (4)

jbarlow83 avatar jbarlow83 commented on June 3, 2024

OCRmyPDF uses all available resources by design so it's most likely to be targeted by OOM killer (busy process tree using lots of CPU and memory). It's precisely the process that ought to be killed in a system under memory pressure.

Your file has some large images at high resolution (11000x14000 @ 400dpi is ~24x36 inches).

You can use --jobs to limit the number of simultaneous worker processes, and there are also options to adjust behavior on large images, including skipping them outright.

from ocrmypdf.

munzirtaha avatar munzirtaha commented on June 3, 2024

Thank you so much for the explanation. I managed to avoid the oom killer by increasing the swap space. However, I still wonder whether there is a bug here. The increase in size is too huge.

BIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.kuutl_np/optimize.opt.pdf, /tmp/ocrmypdf.io.kuutl_np/optimize.pdf)                                                            helpers.py:178
Running: ['jbig2', '--version']                                                                                                                          __init__.py:134
Running: ['pngquant', '--version']                                                                                                                       __init__.py:134
Image optimization ratio: 1.95 savings: 48.7%                                                                                                           _pipeline.py:904
Total file size ratio: 0.15 savings: -573.8%                                                                                                            _pipeline.py:907
/tmp/ocrmypdf.io.kuutl_np/optimize.pdf -> 03_ocr.pdf                                                                                                    _pipeline.py:979
The output file size is 6.74× larger than the input file.                                                                                             _validation.py:375
Possible reasons for this include:
--force-ocr was issued, causing transcoding.

> ls 03*
╭───┬────────────┬──────┬───────────┬────────────────╮
│ # │    name    │ type │   size    │    modified    │
├───┼────────────┼──────┼───────────┼────────────────┤
│ 0 │ 03.pdf     │ file │  48.9 MiB │ 11 months ago  │
│ 1 │ 03_ocr.pdf │ file │ 329.7 MiB │ 12 minutes ago │
╰───┴────────────┴──────┴───────────┴────────────────╯

The images list are originally:

> pdfimages -list 03.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    2000  2588  rgb     3   8  jpeg   no         8  0    72    72  267K 1.8%
   2     1 image    2000  2588  gray    1   8  jpeg   no        22  0    72    72 42.4K 0.8%
 

And after running ocrmypdf they are converted to this huge 11111x14378 size and to 400 ppi instead of the original 72.

> pdfimages -list 03_ocr.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image   11111 14378  rgb     3   8  jpeg   no       707  0   400   400 2435K 0.5%
   2     1 image   11111 14378  gray    1   8  jpeg   no         4  0   400   400 87.9K 0.1%

from ocrmypdf.

jbarlow83 avatar jbarlow83 commented on June 3, 2024

Can't be sure without reviewing the actual file, but most likely those pages contain vector artwork, so the whole page had to be promoted to high resolution. Using --force-ocr asks for this to be done. There's an argument to control the DPI used.
The other option in this sort of situation is to use really aggressive optimization.

from ocrmypdf.

munzirtaha avatar munzirtaha commented on June 3, 2024

I managed to get a better result by using
pdfimages -all; img2pdf *.jpg | ocrmypdf - --output-type pdf
This results in a better file size and accuracy, and less resources. Here I don't need to use --force-ocr. Originally, running ocrmypdf directly without --force-ocr failed because of error:

ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;

There's an argument to control the DPI

But --image-dpi seems to work only with images, not PDF.

If the proper workflow is to convert to images first and use pdfimages and img2pdf before using ocrmypdf, I can live with that but if ocrmypdf should be able to do this directly and there is something that could be improved, I will be glad to share the file privately.

from ocrmypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.