Comments (4)
No official way, but you can try (ab)using the --tesseract-config argument which forwards one argument at a time to tesseract.
e.g. for a single text line
ocrmypdf [other options] --tesseract-config '--psm' --tesseract-config '7'
I'm not sure if I'd implement this since most PDF images have a text page, not a line or word.
from ocrmypdf.
Hey thanks for your help!
using the command above I get the error ocrmypdf: error: argument --tesseract-config: expected one argument
and using the command ocrmypdf [other options] --tesseract-config '--psm 4
he generates a conversion error:
________________________________________
Tasks which will be run:
Task enters queue = 'ocrmypdf.main.repair_pdf'
[{'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'jpeg', 'dpi': Decimal('299.841'), 'color': 'rgb', 'width': 944, 'comp': 3, 'bpc': 8, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 0}, {'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'ccitt', 'dpi': Decimal('299.841'), 'color': 'gray', 'width': 944, 'comp': 1, 'bpc': 1, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 1}]
Completed Task = 'ocrmypdf.main.repair_pdf'
Task enters queue = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.generate_postscript_stub'
os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000002.page.pdf, /tmp/com.github.ocrmypdf.muvt83ir/000002.ocr.page.pdf)
os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000001.page.pdf, /tmp/com.github.ocrmypdf.muvt83ir/000001.ocr.page.pdf)
Completed Task = 'ocrmypdf.main.split_pages'
Task enters queue = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.skip_page'
Uptodate Task = 'ocrmypdf.main.skip_page'
WARNING:
In Task 'ocrmypdf.main.skip_page':
No jobs were run because no file names matched.
Please make sure that the regular expression is correctly specified.
Rendering 000001.ocr.page.pdf with png16m
Completed Task = 'ocrmypdf.main.generate_postscript_stub'
Rendering 000002.ocr.page.pdf with pngmono
Completed Task = 'ocrmypdf.main.rasterize_with_ghostscript'
Task enters queue = 'ocrmypdf.main.preprocess_deskew'
os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000001.page.png, /tmp/com.github.ocrmypdf.muvt83ir/000001.pp-deskew.png)
os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000002.page.png, /tmp/com.github.ocrmypdf.muvt83ir/000002.pp-deskew.png)
Completed Task = 'ocrmypdf.main.preprocess_deskew'
Task enters queue = 'ocrmypdf.main.preprocess_clean'
os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000001.pp-deskew.png, /tmp/com.github.ocrmypdf.muvt83ir/000001.pp-clean.png)
os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000002.pp-deskew.png, /tmp/com.github.ocrmypdf.muvt83ir/000002.pp-clean.png)
Completed Task = 'ocrmypdf.main.preprocess_clean'
Task enters queue = 'ocrmypdf.main.select_image_for_pdf'
Task enters queue = 'ocrmypdf.main.ocr_tesseract_hocr'
os.symlink(/tmp/com.github.ocrmypdf.muvt83ir/000002.page.png, /tmp/com.github.ocrmypdf.muvt83ir/000002.image)
Completed Task = 'ocrmypdf.main.select_image_for_pdf'
Original exceptions:
Exception #1
'builtins.TypeError(Can't convert 'list' object to str implicitly)' raised in ...
Task = def ocrmypdf.main.ocr_tesseract_hocr(...):
Job = [.../com.github.ocrmypdf.muvt83ir/000001.pp-clean.png -> .../com.github.ocrmypdf.muvt83ir/000001.hocr, <ocrmypdf.main.WrappedLogger>, [{'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'jpeg', 'dpi': Decimal('299.841'), 'color': 'rgb', 'width': 944, 'comp': 3, 'bpc': 8, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 0}, {'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'ccitt', 'dpi': Decimal('299.841'), 'color': 'gray', 'width': 944, 'comp': 1, 'bpc': 1, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 1}], <_thread.lock>]
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/home/florian/Downloads/OCRmyPDF-3.1/ocrmypdf/main.py", line 560, in ocr_tesseract_hocr
universal_newlines=True)
File "/usr/lib/python3.4/subprocess.py", line 848, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.4/subprocess.py", line 1384, in _execute_child
restore_signals, start_new_session, preexec_fn)
TypeError: Can't convert 'list' object to str implicitly
Exception #2
'builtins.TypeError(Can't convert 'list' object to str implicitly)' raised in ...
Task = def ocrmypdf.main.ocr_tesseract_hocr(...):
Job = [.../com.github.ocrmypdf.muvt83ir/000002.pp-clean.png -> .../com.github.ocrmypdf.muvt83ir/000002.hocr, <ocrmypdf.main.WrappedLogger>, [{'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'jpeg', 'dpi': Decimal('299.841'), 'color': 'rgb', 'width': 944, 'comp': 3, 'bpc': 8, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 0}, {'height_pixels': 1887, 'height_inches': Decimal('6.29'), 'width_inches': Decimal('3.15'), 'has_text': False, 'xres': Decimal('299.683'), 'yres': Decimal('3E+2'), 'width_pixels': 944, 'images': [{'enc': 'ccitt', 'dpi': Decimal('299.841'), 'color': 'gray', 'width': 944, 'comp': 1, 'bpc': 1, 'height': 1887, 'dpi_w': Decimal('299.683'), 'dpi_h': Decimal('3E+2')}], 'pageno': 1}], <_thread.lock>]
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 751, in run_pooled_job_without_exceptions
register_cleanup, touch_files_only)
File "/usr/local/lib/python3.4/dist-packages/ruffus/task.py", line 567, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "/home/florian/Downloads/OCRmyPDF-3.1/ocrmypdf/main.py", line 560, in ocr_tesseract_hocr
universal_newlines=True)
File "/usr/lib/python3.4/subprocess.py", line 848, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.4/subprocess.py", line 1384, in _execute_child
restore_signals, start_new_session, preexec_fn)
TypeError: Can't convert 'list' object to str implicitly'
It's not really about a one line pdf or a one word pdf. My Problem is the automatic column detection which ruins my OCR (the page is a mix of 2 and 1 column text)
from ocrmypdf.
Implemented in commit 8d323ae.
from ocrmypdf.
Officially released in v3.2
from ocrmypdf.
Related Issues (20)
- [Bug]: OCRmyPDF succeeded with warning(s): InputFileError: pdfminer could not process page 0 HOT 1
- Error: jbig2 not found on path, even though installed HOT 4
- [Bug]: OCRmyPDF Docker Hot Folder Option OCR_ON_SUCCESS_ARCHIVE OCR_ON_SUCCESS_DELETE doesnt work
- [Bug]: dpi-problem with rasterizing text HOT 5
- [Bug]: Ghostscript PDF/A rendering failed HOT 1
- [Bug]: "Corrupt JPEG data: premature end of data segment" with some files
- [Bug]: AttributeError: 'NoneType' object has no attribute 'get'
- [Bug]: Missing support for certain unicode characters HOT 4
- Recommended settings for dealing with text superimposed on clipart? HOT 1
- [Bug]: The file size increases significantly by OCR even without image recompression HOT 2
- Allow resuming OCR after DecompressionBombError HOT 3
- [Bug] SubprocessOutputError HOT 2
- [Feature]: Choose between NFKC and NFC normalization for Unicode characters so copy-pasting works HOT 5
- max_workers must be greater than 0 HOT 2
- [Feature]: Could watcher.py be enhanced to support the conversion of single or multi TIF and JPG files to PDF?
- [Bug]: DecompressionBombWarning HOT 1
- [Bug]: Memory Error
- [Bug]: Warning: "xref 473: While extracting this image, an error occurred" HOT 1
- [Bug]: watcher.py requires the "ARCHIVE" folder to be assigned, even if the option is disabled HOT 1
- Release notes don't include the latest versions HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.