Comments (4)
OCRmyPDF uses all available resources by design so it's most likely to be targeted by OOM killer (busy process tree using lots of CPU and memory). It's precisely the process that ought to be killed in a system under memory pressure.
Your file has some large images at high resolution (11000x14000 @ 400dpi is ~24x36 inches).
You can use --jobs
to limit the number of simultaneous worker processes, and there are also options to adjust behavior on large images, including skipping them outright.
from ocrmypdf.
Thank you so much for the explanation. I managed to avoid the oom killer by increasing the swap space. However, I still wonder whether there is a bug here. The increase in size is too huge.
BIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.kuutl_np/optimize.opt.pdf, /tmp/ocrmypdf.io.kuutl_np/optimize.pdf) helpers.py:178
Running: ['jbig2', '--version'] __init__.py:134
Running: ['pngquant', '--version'] __init__.py:134
Image optimization ratio: 1.95 savings: 48.7% _pipeline.py:904
Total file size ratio: 0.15 savings: -573.8% _pipeline.py:907
/tmp/ocrmypdf.io.kuutl_np/optimize.pdf -> 03_ocr.pdf _pipeline.py:979
The output file size is 6.74× larger than the input file. _validation.py:375
Possible reasons for this include:
--force-ocr was issued, causing transcoding.
> ls 03*
╭───┬────────────┬──────┬───────────┬────────────────╮
│ # │ name │ type │ size │ modified │
├───┼────────────┼──────┼───────────┼────────────────┤
│ 0 │ 03.pdf │ file │ 48.9 MiB │ 11 months ago │
│ 1 │ 03_ocr.pdf │ file │ 329.7 MiB │ 12 minutes ago │
╰───┴────────────┴──────┴───────────┴────────────────╯
The images list are originally:
> pdfimages -list 03.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 2000 2588 rgb 3 8 jpeg no 8 0 72 72 267K 1.8%
2 1 image 2000 2588 gray 1 8 jpeg no 22 0 72 72 42.4K 0.8%
And after running ocrmypdf they are converted to this huge 11111x14378 size and to 400 ppi instead of the original 72.
> pdfimages -list 03_ocr.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 11111 14378 rgb 3 8 jpeg no 707 0 400 400 2435K 0.5%
2 1 image 11111 14378 gray 1 8 jpeg no 4 0 400 400 87.9K 0.1%
from ocrmypdf.
Can't be sure without reviewing the actual file, but most likely those pages contain vector artwork, so the whole page had to be promoted to high resolution. Using --force-ocr asks for this to be done. There's an argument to control the DPI used.
The other option in this sort of situation is to use really aggressive optimization.
from ocrmypdf.
I managed to get a better result by using
pdfimages -all; img2pdf *.jpg | ocrmypdf - --output-type pdf
This results in a better file size and accuracy, and less resources. Here I don't need to use --force-ocr
. Originally, running ocrmypdf directly without --force-ocr
failed because of error:
ocrmypdf.exceptions.PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR;
There's an argument to control the DPI
But --image-dpi seems to work only with images, not PDF.
If the proper workflow is to convert to images first and use pdfimages
and img2pdf
before using ocrmypdf
, I can live with that but if ocrmypdf
should be able to do this directly and there is something that could be improved, I will be glad to share the file privately.
from ocrmypdf.
Related Issues (20)
- Trying to debug OCR_ON_SUCCESS_DELETE flag not being executed - add exit code to watcher.py? HOT 2
- [Bug]: Watcher doesnt notice changes after update
- [Bug]: version confusion HOT 1
- [Bug]: OCRmyPDF succeeded with warning(s): InputFileError: pdfminer could not process page 0 HOT 1
- Error: jbig2 not found on path, even though installed HOT 4
- [Bug]: OCRmyPDF Docker Hot Folder Option OCR_ON_SUCCESS_ARCHIVE OCR_ON_SUCCESS_DELETE doesnt work
- [Bug]: dpi-problem with rasterizing text HOT 5
- [Bug]: Ghostscript PDF/A rendering failed HOT 1
- [Bug]: "Corrupt JPEG data: premature end of data segment" with some files
- [Bug]: AttributeError: 'NoneType' object has no attribute 'get'
- [Bug]: Missing support for certain unicode characters HOT 4
- Recommended settings for dealing with text superimposed on clipart? HOT 1
- [Bug]: The file size increases significantly by OCR even without image recompression HOT 2
- Allow resuming OCR after DecompressionBombError HOT 3
- [Bug] SubprocessOutputError HOT 2
- [Feature]: Choose between NFKC and NFC normalization for Unicode characters so copy-pasting works HOT 5
- max_workers must be greater than 0 HOT 2
- [Feature]: Could watcher.py be enhanced to support the conversion of single or multi TIF and JPG files to PDF?
- [Bug]: DecompressionBombWarning HOT 1
- [Bug]: Memory Error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.