leofcardoso / pdf2pdfocr Goto Github PK

A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!

License: Apache License 2.0

Python 94.98% Shell 0.24% Dockerfile 1.26% VBScript 3.53%

pdf ocr pdftk docker python tesseract

pdf2pdfocr's People

Contributors

Stargazers

Watchers

pdf2pdfocr's Issues

Multiple Files Together

Hey is there any option to ocr multiple files together ( or any script that I can use). Doing one by one takes alot of time. Thanks for this awesome tool btw!

Improve efficiency on PDFs which contain large amounts of text

If a PDF contains a large amount of text and a small amount of pictures, we only want to OCR the pictures. The script currently OCRs the whole pages, including any existing text, which is undesirable because of the CPU consumption, and degradation of existing text.

I want to implement a change (probably optional, enabled by a flag) to only run OCR on the images, not on any exising text. I would split the images away from the PDF using pdfimages, and then somehow re-create the layer sandwich using only the OCR generated for those images. The original text inside the files should be left untouched.

Do you have any pointers on doing this? I have a couple of ideas I want to investigate:

process everything page by page
- edit the PDF to make all text invisible (same color as background)
- run pipeline as it is -- OCR should be faster, since most of the page is blank
- recombine original PDF text & image layers with the new OCR layer overlay (still page by page)
- still inefficient -- OCR needs to scan through a lot of empty pages
process everything image by image
- run pdfimages to extract images from PDF (along with page number, img size and coordinates)
- maybe use pdf2html to get image location & position
- create PDF sandwiches for each image separately (using pdf2pdfocr, of course)
- re-combine them in the original PDF using pdfjam and pdftk
- more efficient -- we don't give blank images to the OCR engine

"-g grayscale" fail

script aborting

File doesn't pass PDF/A validation after OCR

File used PDF A-1b.pdf
Site for validation: VeraPDF Demo
Terminal output:

eduardo@000563-desk:~/Área\ de Trabalho/testepdf$ python3 ~/Área\ de\ Trabalho/pdf2pdfocr/pdf2pdfocr.py -w -o pdfabr.pdf -v -l por -i ~/Documentos/PDF\ A-1b.pdf`
File: /home/eduardo/Documentos/PDF A-1b.pdf
[2023-02-13 10:25:11.326759] [DEBUG] Tesseract can 'textonly_pdf': True
[2023-02-13 10:25:11.329506] [DEBUG] Tesseract version: 4
[2023-02-13 10:25:11.329598] [DEBUG] cuneiform not available
[2023-02-13 10:25:11.336831] [DEBUG] Pdftoppm version: 22.2.0
[2023-02-13 10:25:11.340303] [DEBUG] Qpdf version: 10.6.3
[2023-02-13 10:25:11.340382] [DEBUG] Temp dir is /tmp/pdf2pdfocr_O4M39/
[2023-02-13 10:25:11.340396] [DEBUG] Prefix is O4M39
[2023-02-13 10:25:11.340413] [DEBUG] Script dir is /home/eduardo/Área de Trabalho/pdf2pdfocr/
[2023-02-13 10:25:11.340442] [DEBUG] Parallel operations will use 8 CPUs
[2023-02-13 10:25:11.349594] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense  - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-02-13 10:25:11.350756] [LOG] Input file /home/eduardo/Documentos/PDF A-1b.pdf: type is application/pdf
[2023-02-13 10:25:11.352185] [DEBUG] User conversion params: 
[2023-02-13 10:25:11.352209] [DEBUG] Output file: pdfabr.pdf for PDF and pdfabr.pdf.txt for TXT
[2023-02-13 10:25:11.352249] [LOG] Converting input file to images
[2023-02-13 10:25:11.427845] [LOG] Checking blank pages
[2023-02-13 10:25:11.928593] [LOG] Starting OCR with tesseract...
[2023-02-13 10:25:13.430893] [LOG] OCR completed
[2023-02-13 10:25:13.431167] [DEBUG] We have 1 ocr'ed files
[2023-02-13 10:25:13.432973] [DEBUG] Joined ocr'ed PDF files
[2023-02-13 10:25:13.433220] [LOG] Created final text file
[2023-02-13 10:25:13.433241] [DEBUG] Merging with OCR
[2023-02-13 10:25:13.445310] [DEBUG] Autorotate skipped
[2023-02-13 10:25:13.445371] [DEBUG] Editing producer
[2023-02-13 10:25:13.458038] [DEBUG] Output file created
[2023-02-13 10:25:13.466554] [LOG] Success in 2.117 seconds!

Validation output:

file already has text and check text mode is enabled. Exiting.

Hi, i've obtained next error while trying to add ocr to pdf:
--> Errors/Warnings:
already has text and check text mode is enabled. Exiting.

You may find 'wrong' pdf from google drive:
https://drive.google.com/open?id=0B4mLkzBXmYycQ2N5OGpneWd5dzQ

Create language based Dockerimages

Hey, could you please create Dockerfiles for different languages and and upload the tagged images to the Docker Hub?

Alternatively you could add all tesseract ocr language packages to the Dockerfile, but this would nearly triple the image size:

larsk@MacBook-Pro pdf2pdfocr % docker image ls
REPOSITORY                           TAG                                              IMAGE ID            CREATED             SIZE
pdf2pdfocr                           all-lang                                         a74b8d22d02b        6 seconds ago       1.1GB
pdf2pdfocr                           latest                                           09eccd997dd3        6 minutes ago       417MB

Should I add a PR for this issue?

Zero OCR'ed files

File: D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov.pdf
[2023-01-14 19:20:35.717707] [DEBUG] Tesseract can 'textonly_pdf': True
[2023-01-14 19:20:35.733704] [DEBUG] Tesseract version: 5
[2023-01-14 19:20:35.736704] [DEBUG] cuneiform not available
[2023-01-14 19:20:35.781705] [DEBUG] Pdftoppm version: 22.12.0
[2023-01-14 19:20:35.811712] [DEBUG] Qpdf version: 11.2.0
[2023-01-14 19:20:35.811712] [DEBUG] Temp dir is C:\Users\ADMINI~1\AppData\Local\Temp\pdf2pdfocr_L3VRF
[2023-01-14 19:20:35.811712] [DEBUG] Prefix is L3VRF
[2023-01-14 19:20:35.811712] [DEBUG] Script dir is c:\Users\Administrator\anaconda3\Scripts
[2023-01-14 19:20:35.812712] [DEBUG] Parallel operations will use 20 CPUs
[2023-01-14 19:20:35.861715] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-01-14 19:20:35.903716] [LOG] Input file D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov.pdf: type is application/pdf
[2023-01-14 19:20:35.918716] [DEBUG] User conversion params: best
[2023-01-14 19:20:35.918716] [DEBUG] Output file: D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov-OCR.pdf for PDF and D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov-OCR.pdf.txt for TXT
[2023-01-14 19:20:35.918716] [LOG] Converting input file to images...
[2023-01-14 19:20:43.633767] [LOG] Checking blank pages
C:\Users\Administrator\anaconda3\lib\site-packages\PIL\Image.py:3074: DecompressionBombWarning: Image size (105023996 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
warnings.warn(
[2023-01-14 19:20:44.652767] [LOG] Starting OCR with tesseract...
[2023-01-14 19:20:45.154768] [LOG] OCR completed
[2023-01-14 19:20:45.155767] [DEBUG] We have 0 ocr'ed files
Error: No PDF files generated after OCR. This is not expected. Aborting.

Application icon

It would be nice for the installation script to create an icon for the gui that would appear as an Application. This would allow running without starting the terminal.

TypeError: expected str, bytes or os.PathLike object, not NoneType

]# python3 pdf2pdfocr.py -i /home/amuthuraman/NonOcrpdf/test.pdf
[2019-05-30 02:43:00.729260] [LOG] Tesseract can 'textonly_pdf': False
[2019-05-30 02:43:00.739282] [LOG] Tesseract version: 3
Traceback (most recent call last):
File "pdf2pdfocr.py", line 1214, in
pdf2ocr = Pdf2PdfOcr(args)
File "pdf2pdfocr.py", line 450, in init
self.check_external_tools()
File "pdf2pdfocr.py", line 531, in check_external_tools
if not self.test_convert():
File "pdf2pdfocr.py", line 1031, in test_convert
stderr=subprocess.DEVNULL, shell=self.shell_mode)
File "/usr/lib64/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1278, in _execute_child
executable = os.fsencode(executable)
File "/usr/lib64/python3.6/os.py", line 800, in fsencode
filename = fspath(filename) # Does type-checking of filename.
TypeError: expected str, bytes or os.PathLike object, not NoneType

autorotation is broken with tesseract 4

In tesseract 4, script always return error when using (-u) autorotate.

Output file could not be created

Any ideas why this would be failing? Unable to generate a final PDF

pdf2pdfocr.py -i test.pdf -o test2.pdf -v -k -r 200

[2020-11-03 11:16:18.697012] [LOG] Tesseract can 'textonly_pdf': True
[2020-11-03 11:16:18.702393] [LOG] Tesseract version: 4
[2020-11-03 11:16:18.702628] [DEBUG] cuneiform not available
[2020-11-03 11:16:18.716257] [DEBUG] Temp dir is /tmp/
[2020-11-03 11:16:18.716342] [DEBUG] Prefix is C6UIH
[2020-11-03 11:16:18.716374] [DEBUG] Script dir is /usr/local/bin/
[2020-11-03 11:16:18.716462] [DEBUG] Parallel operations will use 1 CPUs
[2020-11-03 11:16:18.716560] [LOG] Welcome to pdf2pdfocr version 1.6.1 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2020-11-03 11:16:18.719250] [LOG] Input file /home/john/test.pdf: type is application/pdf
[2020-11-03 11:16:18.720583] [DEBUG] Output file: test2.pdf for PDF and test2.pdf.txt for TXT
[2020-11-03 11:16:18.720644] [LOG] Converting input file to images...
[2020-11-03 11:16:19.544142] [LOG] Starting OCR with tesseract...
[2020-11-03 11:16:19.550422] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2020-11-03 11:16:24.553051] [LOG] OCR completed
[2020-11-03 11:16:24.553681] [DEBUG] We have 1 ocr'ed files
[2020-11-03 11:16:24.557630] [DEBUG] Joined ocr'ed PDF files
[2020-11-03 11:16:24.557677] [DEBUG] Merging with OCR
[2020-11-03 11:16:24.564783] [DEBUG] Fail to merge source PDF with extracted OCR text. Trying to fix source PDF to build final file...
[2020-11-03 11:16:25.222864] [DEBUG] Merging with OCR
Output file could not be created :( Exiting with error code.

what does cmd_file implies

cmd_file = 'file', may I know the intuition of this variable. Its path is always getting as None and I am not sure if we need to hard code it to some value or did I miss something during installation?

I have installed everything on windows.

PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.

While applying OCR to a PDF, using the docker image of the repo "leofcardoso/pdf2pdfocr:latest", this error occurred:

[2023-09-05 10:35:58.939733] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-09-05 10:35:58.959460] [LOG] Input file /home/docker/Dummy_IS.pdf: type is application/pdf
[2023-09-05 10:35:59.047502] [LOG] Converting input file to images...
[2023-09-05 10:36:38.577186] [LOG] Checking blank pages
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/usr/local/bin/pdf2pdfocr.py", line 249, in do_check_img_colors_size
im = Image.open(param_image_file)
File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3172, in open
im = _open_core(fp, filename, prefix, formats)
File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3159, in _open_core
_decompression_bomb_check(im.size)
File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3068, in _decompression_bomb_check
raise DecompressionBombError(
PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1530, in
pdf2ocr.ocr()
File "/usr/local/bin/pdf2pdfocr.py", line 712, in ocr
self.check_blank_pages(image_file_list)
File "/usr/local/bin/pdf2pdfocr.py", line 1010, in check_blank_pages
blank_map_values = colors_size_pool_map.get()
File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.

PyPDF2 moved PdfReadError from utils to errors

Followed the instruction guide for Windows and noticed an error when running the SendTo VBScript
"ModuleNotFoundError: No module named 'PyPDF2.utils'"

Looks like the latest version of PyPDF2 moved PdfReadError from utils to errors

Changing line 41 from

from PyPDF2.utils import PdfReadError

from PyPDF2.errors import PdfReadError

fixed the problem.

Cheers

result pdf file is blank

Text can be extracted, but all pages are blank.

join_ocred_pdf failing due to "cannot read an empty file"

The error I get is "PyPDF2.errors.PdfReadError: Cannot read an empty file". I experimented with the first 2 pages of this pdf; individually the two pages OCR'ed fine (neither page was empty, and the OCR'ed text was not empty either), but when I tried to do the 2 pages together, it gave me

Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1526, in
pdf2ocr.ocr()
File "/usr/local/bin/pdf2pdfocr.py", line 717, in ocr
self.join_ocred_pdf()
File "/usr/local/bin/pdf2pdfocr.py", line 952, in join_ocred_pdf
pdf_merger.append(PyPDF2.PdfFileReader(text_pdf_file, strict=False))
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1856, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 277, in init
self.read(stream)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1301, in read
raise PdfReadError("Cannot read an empty file")
PyPDF2.errors.PdfReadError: Cannot read an empty file.

Side note, on the successful runs it gave me the warnings

UserWarning: isString is deprecated and will be removed in PyPDF2 2.0.0. [_utils.py:76]
UserWarning: namedDestinations will be removed in PyPDF2 2.0.0. Use named_destinations instead. [_reader.py:519]
UserWarning: addMetadata is deprecated and will be removed in PyPDF2 2.0.0. Use add_metadata instead. [_writer.py:793]

In text extraction of pdf of characters are recognized double times.

Following OCR processing on these PDFs, attempts to extract text from the PDF using different techniques, such as code-based extraction or direct copying from the browser-rendered PDF, result in the entire text being duplicated / getting all the text twice than the text actually there in the pdf.

For instance, if the original text contains 5 characters, post-OCR, it recognizes and extracts 10 characters, effectively causing duplication of the content.

Font issue on Macos Catalina Dark Appearance

When running the gui I receive an error message as seen below. Note that this does not seem to have consequences when macos is configured in light mode, however in dark mode the non selected UI controls are displaying empty, see attached screenshot.

% pdf2pdfocr_gui.py
2020-04-11 08:16:46.702 Python[45232:697646] CoreText note: Client requested name ".SFNS-Regular", it will get Times-Roman rather than the intended font. All system UI font access should be through proper APIs such as CTFontCreateUIFontForLanguage() or +[NSFont systemFontOfSize:].
2020-04-11 08:16:46.702 Python[45232:697646] CoreText note: Set a breakpoint on CTFontLogSystemFontNameRequest to debug.

How to use in on Win 10? Can use paddleocr as a ocr engine?

No PDF files generated after OCR. This is not expected. Aborting.

Something seems to be wrong. I am running MacOS 10.13.6 with a fresh macports installation.

[2018-07-21 12:18:53.392458] [LOG]      Input file /Users/emoret/Downloads/01-19-2017.pdf: type is application/pdf
PdfReadWarning: Multiple definitions in dictionary at byte 0xa9769 for key /Outlines [generic.py:588]
[2018-07-21 12:18:53.400228] [DEBUG]    Output file: /Users/emoret/Downloads/01-19-2017-OCR.pdf for PDF and /Users/emoret/Downloads/01-19-2017-OCR.pdf.txt for TXT
[2018-07-21 12:18:53.400349] [LOG]      Converting input file to images...
[2018-07-21 12:18:54.256488] [LOG]      Starting OCR...
[2018-07-21 12:18:54.268721] [LOG]      Waiting for OCR to complete. 0/5 pages completed...
[2018-07-21 12:18:59.271505] [LOG]      OCR completed
[2018-07-21 12:18:59.273427] [DEBUG]    We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.

RuntimeError: can't start new thread

I did a quick test and got this error below
System information

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

Linux dev4-1 5.15.0-113-generic #123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Client: Docker Engine - Community
 Version:           27.0.3
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        7d4bcd8
 Built:             Sat Jun 29 00:02:33 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:02:33 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 nvidia:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Error log

❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./inby.pdf
Unable to find image 'leofcardoso/pdf2pdfocr:latest' locally
latest: Pulling from leofcardoso/pdf2pdfocr
37aaf24cf781: Pull complete 
da892f4d0cb0: Pull complete 
df89c9ce1e48: Pull complete 
d2a3165daa7e: Pull complete 
663286a455ab: Pull complete 
4f4fb700ef54: Pull complete 
35693ee7cdbf: Pull complete 
4215239b5448: Pull complete 
Digest: sha256:6f446c6fa612ffd304bede285556cc0190f53c6506f8a7200a69a603261643a6
Status: Downloaded newer image for leofcardoso/pdf2pdfocr:latest
-------------------------------------
File: ./inby.pdf
[2024-07-10 01:00:35.107971] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-10 01:00:35.117933] [DEBUG] Tesseract version: 4
[2024-07-10 01:00:35.144010] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-10 01:00:35.151576] [DEBUG] Qpdf version: 10.6.3
[2024-07-10 01:00:35.151798] [DEBUG] Temp dir is /tmp/pdf2pdfocr_F7DGC/
[2024-07-10 01:00:35.151836] [DEBUG] Prefix is F7DGC
[2024-07-10 01:00:35.151884] [DEBUG] Script dir is /usr/local/bin/
[2024-07-10 01:00:35.151972] [DEBUG] Parallel operations will use 40 CPUs
Traceback (most recent call last):
  File "/usr/local/bin/pdf2pdfocr.py", line 1509, in <module>
    pdf2ocr = Pdf2PdfOcr(pdf2ocr_args, file_to_process)
  File "/usr/local/bin/pdf2pdfocr.py", line 585, in __init__
    self.main_pool = multiprocessing.Pool(self.cpu_to_use)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 235, in __init__
    self._worker_handler.start()
  File "/usr/lib/python3.10/threading.py", line 935, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

GUI

Would be nice to have a simple gui to this tool. Maybe a contextual menu in finder, a print driver, or even an icon that could accept dragged and dropped pdf files ready to OCR?

Blank file

With some old poppler versions and specific PDFs, script is generating blank pages.

How to use this directly without docker on windows 11?

How can I use this tool directly on Windows 11 without Docker?
I'd like to utilize it as a python function API that accepts arguments and generates the OCR'd file.

Bad insertion text on PDF

PDF source :
Module 3 - .v2.pdf

I'm trying to OCR the text on my pdf for personal use.
I've check the TXT file generated, and it's working (I'm seeing the proper text).
But when I open my PDF file (with OCR), if I search text, it does not work. If I copy/paste text from PDF, it's a weird text :

22tropnosseccaevitartsinimdatimrepdluowspuorgeerhtllA.puorgrevresnoitacilppaehtotylnonepo
erucesylhgihyolpednacuoy,msinahcemsihthtiW.krowtenetaroprocs’remotsucehtmorfylnotub .snoitacilppa

Poor performance in docker container

Execution inside docker container takes too much time.

Documentation update

I found out that following the macports installation documentation did not properly install the modules with pip. In order to make it work, I had to use pip3 such as:

sudo pip3 install reportlab Gooey
sudo pip3 install https://github.com/mstamy2/PyPDF2/archive/master.zip
sudo pip3 install lxml beautifulsoup4

Error Message by OCR via GUI

Dear Leanardo
I´ve get an error when i try to OCR a Pdf file. Maybe you can help me ?
I use Windows 10 21H1 in Virtualbox with 4 cores and 16GB Memory for this vm.
Message is:
[2022-02-19 18:44:06.020876] [DEBUG] Tesseract can 'textonly_pdf': True
[2022-02-19 18:44:06.050413] [DEBUG] Tesseract version: 5
[2022-02-19 18:44:06.050413] [DEBUG] cuneiform not available
[2022-02-19 18:44:06.282093] [DEBUG] Pdftoppm version: 22.01.0
[2022-02-19 18:44:06.391073] [DEBUG] Qpdf version: 10.6.2
[2022-02-19 18:44:06.391073] [DEBUG] Temp dir is C:\Users\Martin\AppData\Local\Temp\pdf2pdfocr_ONWZ5
[2022-02-19 18:44:06.391073] [DEBUG] Prefix is ONWZ5
[2022-02-19 18:44:06.391073] [DEBUG] Script dir is C:\Users\Martin\pdf2pdfocr-venv\Scripts
[2022-02-19 18:44:06.391073] [DEBUG] Parallel operations will use 4 CPUs
[2022-02-19 18:44:06.507230] [LOG] Welcome to pdf2pdfocr version 1.9.1 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2022-02-19 18:44:06.641453] [LOG] Input file C:\Users\Martin\Desktop\471685214.pdf: type is application/pdf
[2022-02-19 18:44:06.641453] [DEBUG] Conversion params:
[2022-02-19 18:44:06.641453] [DEBUG] Output file: C:\Users\Martin\Desktop\471685214-OCR.pdf for PDF and C:\Users\Martin\Desktop\471685214-OCR.pdf.txt for TXT
[2022-02-19 18:44:06.641453] [LOG] Converting input file to images...
[2022-02-19 18:44:06.903005] [LOG] Starting OCR with tesseract...
[2022-02-19 18:44:07.365611] [LOG] OCR completed
[2022-02-19 18:44:07.365611] [DEBUG] We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.

Did I need cuneiform- i read your windows install.txt file and read this as optional, maybe I´m wrong.
It´s a interesting tool and would fit for me perfect to create a database of my private papers.
thx a lot
Martin

rebuild_and_merge fail in windows with big files

Sometimes, execution fail with "long command line" error in Windows when ImageMagick is called.

pdf2pdfocr gui error when selecting output file

If the output file is selected in pdf2pdfocr gui, this file must currently already exist, which is obviously not reasonable.

Do we have any parameter / flag for pdf compression here, to reduce pdf size after applying OCR?

script hangs on windows and python 3.7.2

script hangs forever with python 3.7.2 and windows.

merging multiple files into one pdf-file

At the moment, a pdf file is created for each file if the option "-i" is used for a directory. There should be an option that packs all the files into one single pdf file. This would be a useful option if there were several "edited" image files (e.g. processed with scantailor) in that directory after a scan (e.g. one for each page). An option for the "correct" sorting these pages has to be kept in mind as well.

file not found. Aborting...

python pdf2pdfocr.py -v -r 200 -i Dummy_IS_4.pdf

File: Dummy_IS_4.pdf
[2023-09-14 11:37:30.229202] [DEBUG] Tesseract can 'textonly_pdf': True
[2023-09-14 11:37:30.247081] [DEBUG] Tesseract version: 5
[2023-09-14 11:37:30.247081] [DEBUG] cuneiform not available
file not found. Aborting...

In GUI:

A rectangular block is the only portion being selected from within a paragraph.

As you can see in the below image,
any solution to this problem?

Tesseract 4 LSTM (--oem 1)

Is there a flag to set --oem 1 in for tesseract 4 like documented here?

pdf2pdfocr changing languages

This wasn't included in the readme file but some info for anyone else lost.
You can change the language model to download by editing this:
aria2c "https://github.com/tesseract-ocr/tessdata/blob/main/por.traineddata?raw=true" --dir="%TESSDATA_PREFIX%"
And change the language prefix to which language you want. As long as its available on the tesseract repo. For example here is Swedish - "swe":

Further info here:
https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#LANGUAGES

To change default language edit pdf2pdfocr.py on line 548 from Portuguese + English - "por+eng" to whichever. For me I use Swedish + English - "swe+eng"
self.tess_langs = "por+eng" # Default
to
self.tess_langs = "swe+eng" # Default

For example to get Swedish

the right edge of the text is not fully highlighted

Hello

in pdf, after recognition with pdf2pdfocr_gui.py, the right edge of the text is not fully highlighted if you select 'tesseract' for the '- e' option.
and if you select 'native', then the entire text is highlighted correctly,but there are no Cyrillic characters.
see the screenshots...

if you say that this is a tesseract problem, then this is incorrect, because I recognize djvu with 'ocrodjvu' and the text is highlighted correctly after recognition.

can this be corrected?

Integration with Google Vision API

Hello,
I want to replace tesseract engine with Google vision API. Can you please suggest me how to do the same.
thanks

Missing space

When I use pdf2pdfocr, the text generated includes no space between the words recognized. As a result when I copy/paste the resulting text it is difficult to use as I have to manually reintroduce all missing spaces.

TypeError: can't concat str to ByteStringObject in edit_producer

Specify Output Folder using pdf2pdfocr.vbs

Dear Leo,

I love your project but would like to directly push the OCRed files to a new directory.
Therefor I tried to add amend the defaultoptions: default_option = "-stp -j 0.9 -o %Userprofile%"
But no matter which directory I add, I always get a permission error:

Traceback (most recent call last):
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 1249, in
pdf2ocr.ocr()
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 605, in ocr
self.initial_cleanup()
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 952, in initial_cleanup
Pdf2PdfOcr.best_effort_remove(self.output_file)
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 1154, in best_effort_remove
os.remove(filename)
PermissionError: [WinError 5] Zugriff verweigert: 'C:\Users\Christoph'

Any ideas how to fix it?

Thank you so much.

BR
Christoph

Error/Warning: Mogrify from ImageMagick not found. Aborting ...

Hi there,

I always was looking for an open source tool suite like yours. I installed everything as explained for Windows7x64 system according to your README.

Right Click on PDF -> Send To -> VBS Script gives the error from above

However mogrify is installed. As I can run it from command line with "magick mogrify" successfully.

Looking into the python code it looks that it should work. Can you help me out?

Thank you so much.

leofcardoso / pdf2pdfocr Goto Github PK

pdf2pdfocr's People

Contributors

Stargazers

Watchers

Forkers

pdf2pdfocr's Issues

python pdf2pdfocr.py -v -r 200 -i Dummy_IS_4.pdf

Recommend Projects

Recommend Topics

Recommend Org