leofcardoso / pdf2pdfocr Goto Github PK
View Code? Open in Web Editor NEWA free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
License: Apache License 2.0
A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF. Use only open source tools. Please tip!
License: Apache License 2.0
Hey is there any option to ocr multiple files together ( or any script that I can use). Doing one by one takes alot of time. Thanks for this awesome tool btw!
If a PDF contains a large amount of text and a small amount of pictures, we only want to OCR the pictures. The script currently OCRs the whole pages, including any existing text, which is undesirable because of the CPU consumption, and degradation of existing text.
I want to implement a change (probably optional, enabled by a flag) to only run OCR on the images, not on any exising text. I would split the images away from the PDF using pdfimages
, and then somehow re-create the layer sandwich using only the OCR generated for those images. The original text inside the files should be left untouched.
Do you have any pointers on doing this? I have a couple of ideas I want to investigate:
pdfimages
to extract images from PDF (along with page number, img size and coordinates)script aborting
File used PDF A-1b.pdf
Site for validation: VeraPDF Demo
Terminal output:
eduardo@000563-desk:~/Área\ de Trabalho/testepdf$ python3 ~/Área\ de\ Trabalho/pdf2pdfocr/pdf2pdfocr.py -w -o pdfabr.pdf -v -l por -i ~/Documentos/PDF\ A-1b.pdf`
File: /home/eduardo/Documentos/PDF A-1b.pdf
[2023-02-13 10:25:11.326759] [DEBUG] Tesseract can 'textonly_pdf': True
[2023-02-13 10:25:11.329506] [DEBUG] Tesseract version: 4
[2023-02-13 10:25:11.329598] [DEBUG] cuneiform not available
[2023-02-13 10:25:11.336831] [DEBUG] Pdftoppm version: 22.2.0
[2023-02-13 10:25:11.340303] [DEBUG] Qpdf version: 10.6.3
[2023-02-13 10:25:11.340382] [DEBUG] Temp dir is /tmp/pdf2pdfocr_O4M39/
[2023-02-13 10:25:11.340396] [DEBUG] Prefix is O4M39
[2023-02-13 10:25:11.340413] [DEBUG] Script dir is /home/eduardo/Área de Trabalho/pdf2pdfocr/
[2023-02-13 10:25:11.340442] [DEBUG] Parallel operations will use 8 CPUs
[2023-02-13 10:25:11.349594] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-02-13 10:25:11.350756] [LOG] Input file /home/eduardo/Documentos/PDF A-1b.pdf: type is application/pdf
[2023-02-13 10:25:11.352185] [DEBUG] User conversion params:
[2023-02-13 10:25:11.352209] [DEBUG] Output file: pdfabr.pdf for PDF and pdfabr.pdf.txt for TXT
[2023-02-13 10:25:11.352249] [LOG] Converting input file to images
[2023-02-13 10:25:11.427845] [LOG] Checking blank pages
[2023-02-13 10:25:11.928593] [LOG] Starting OCR with tesseract...
[2023-02-13 10:25:13.430893] [LOG] OCR completed
[2023-02-13 10:25:13.431167] [DEBUG] We have 1 ocr'ed files
[2023-02-13 10:25:13.432973] [DEBUG] Joined ocr'ed PDF files
[2023-02-13 10:25:13.433220] [LOG] Created final text file
[2023-02-13 10:25:13.433241] [DEBUG] Merging with OCR
[2023-02-13 10:25:13.445310] [DEBUG] Autorotate skipped
[2023-02-13 10:25:13.445371] [DEBUG] Editing producer
[2023-02-13 10:25:13.458038] [DEBUG] Output file created
[2023-02-13 10:25:13.466554] [LOG] Success in 2.117 seconds!
Hi, i've obtained next error while trying to add ocr to pdf:
--> Errors/Warnings:
already has text and check text mode is enabled. Exiting.
You may find 'wrong' pdf from google drive:
https://drive.google.com/open?id=0B4mLkzBXmYycQ2N5OGpneWd5dzQ
Hey, could you please create Dockerfiles for different languages and and upload the tagged images to the Docker Hub?
Alternatively you could add all tesseract ocr language packages to the Dockerfile, but this would nearly triple the image size:
larsk@MacBook-Pro pdf2pdfocr % docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
pdf2pdfocr all-lang a74b8d22d02b 6 seconds ago 1.1GB
pdf2pdfocr latest 09eccd997dd3 6 minutes ago 417MB
Should I add a PR for this issue?
File: D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov.pdf
[2023-01-14 19:20:35.717707] [DEBUG] Tesseract can 'textonly_pdf': True
[2023-01-14 19:20:35.733704] [DEBUG] Tesseract version: 5
[2023-01-14 19:20:35.736704] [DEBUG] cuneiform not available
[2023-01-14 19:20:35.781705] [DEBUG] Pdftoppm version: 22.12.0
[2023-01-14 19:20:35.811712] [DEBUG] Qpdf version: 11.2.0
[2023-01-14 19:20:35.811712] [DEBUG] Temp dir is C:\Users\ADMINI~1\AppData\Local\Temp\pdf2pdfocr_L3VRF
[2023-01-14 19:20:35.811712] [DEBUG] Prefix is L3VRF
[2023-01-14 19:20:35.811712] [DEBUG] Script dir is c:\Users\Administrator\anaconda3\Scripts
[2023-01-14 19:20:35.812712] [DEBUG] Parallel operations will use 20 CPUs
[2023-01-14 19:20:35.861715] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-01-14 19:20:35.903716] [LOG] Input file D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov.pdf: type is application/pdf
[2023-01-14 19:20:35.918716] [DEBUG] User conversion params: best
[2023-01-14 19:20:35.918716] [DEBUG] Output file: D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov-OCR.pdf for PDF and D:\Google_drive_sola\Sola\2022-2023\ROP - Reologija polimerov\RLP - Reologija polimerov-OCR.pdf.txt for TXT
[2023-01-14 19:20:35.918716] [LOG] Converting input file to images...
[2023-01-14 19:20:43.633767] [LOG] Checking blank pages
C:\Users\Administrator\anaconda3\lib\site-packages\PIL\Image.py:3074: DecompressionBombWarning: Image size (105023996 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
warnings.warn(
[2023-01-14 19:20:44.652767] [LOG] Starting OCR with tesseract...
[2023-01-14 19:20:45.154768] [LOG] OCR completed
[2023-01-14 19:20:45.155767] [DEBUG] We have 0 ocr'ed files
Error: No PDF files generated after OCR. This is not expected. Aborting.
It would be nice for the installation script to create an icon for the gui that would appear as an Application. This would allow running without starting the terminal.
]# python3 pdf2pdfocr.py -i /home/amuthuraman/NonOcrpdf/test.pdf
[2019-05-30 02:43:00.729260] [LOG] Tesseract can 'textonly_pdf': False
[2019-05-30 02:43:00.739282] [LOG] Tesseract version: 3
Traceback (most recent call last):
File "pdf2pdfocr.py", line 1214, in
pdf2ocr = Pdf2PdfOcr(args)
File "pdf2pdfocr.py", line 450, in init
self.check_external_tools()
File "pdf2pdfocr.py", line 531, in check_external_tools
if not self.test_convert():
File "pdf2pdfocr.py", line 1031, in test_convert
stderr=subprocess.DEVNULL, shell=self.shell_mode)
File "/usr/lib64/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1278, in _execute_child
executable = os.fsencode(executable)
File "/usr/lib64/python3.6/os.py", line 800, in fsencode
filename = fspath(filename) # Does type-checking of filename
.
TypeError: expected str, bytes or os.PathLike object, not NoneType
In tesseract 4, script always return error when using (-u) autorotate.
Any ideas why this would be failing? Unable to generate a final PDF
pdf2pdfocr.py -i test.pdf -o test2.pdf -v -k -r 200
[2020-11-03 11:16:18.697012] [LOG] Tesseract can 'textonly_pdf': True
[2020-11-03 11:16:18.702393] [LOG] Tesseract version: 4
[2020-11-03 11:16:18.702628] [DEBUG] cuneiform not available
[2020-11-03 11:16:18.716257] [DEBUG] Temp dir is /tmp/
[2020-11-03 11:16:18.716342] [DEBUG] Prefix is C6UIH
[2020-11-03 11:16:18.716374] [DEBUG] Script dir is /usr/local/bin/
[2020-11-03 11:16:18.716462] [DEBUG] Parallel operations will use 1 CPUs
[2020-11-03 11:16:18.716560] [LOG] Welcome to pdf2pdfocr version 1.6.1 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2020-11-03 11:16:18.719250] [LOG] Input file /home/john/test.pdf: type is application/pdf
[2020-11-03 11:16:18.720583] [DEBUG] Output file: test2.pdf for PDF and test2.pdf.txt for TXT
[2020-11-03 11:16:18.720644] [LOG] Converting input file to images...
[2020-11-03 11:16:19.544142] [LOG] Starting OCR with tesseract...
[2020-11-03 11:16:19.550422] [LOG] Waiting for OCR to complete. 0/1 pages completed...
[2020-11-03 11:16:24.553051] [LOG] OCR completed
[2020-11-03 11:16:24.553681] [DEBUG] We have 1 ocr'ed files
[2020-11-03 11:16:24.557630] [DEBUG] Joined ocr'ed PDF files
[2020-11-03 11:16:24.557677] [DEBUG] Merging with OCR
[2020-11-03 11:16:24.564783] [DEBUG] Fail to merge source PDF with extracted OCR text. Trying to fix source PDF to build final file...
[2020-11-03 11:16:25.222864] [DEBUG] Merging with OCR
Output file could not be created :( Exiting with error code.
cmd_file = 'file', may I know the intuition of this variable. Its path is always getting as None and I am not sure if we need to hard code it to some value or did I miss something during installation?
I have installed everything on windows.
While applying OCR to a PDF, using the docker image of the repo "leofcardoso/pdf2pdfocr:latest", this error occurred:
[2023-09-05 10:35:58.939733] [LOG] Welcome to pdf2pdfocr version 1.12.0 marapurense - https://github.com/LeoFCardoso/pdf2pdfocr
[2023-09-05 10:35:58.959460] [LOG] Input file /home/docker/Dummy_IS.pdf: type is application/pdf
[2023-09-05 10:35:59.047502] [LOG] Converting input file to images...
[2023-09-05 10:36:38.577186] [LOG] Checking blank pages
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/usr/local/bin/pdf2pdfocr.py", line 249, in do_check_img_colors_size
im = Image.open(param_image_file)
File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3172, in open
im = _open_core(fp, filename, prefix, formats)
File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3159, in _open_core
_decompression_bomb_check(im.size)
File "/usr/local/lib/python3.10/dist-packages/PIL/Image.py", line 3068, in _decompression_bomb_check
raise DecompressionBombError(
PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1530, in
pdf2ocr.ocr()
File "/usr/local/bin/pdf2pdfocr.py", line 712, in ocr
self.check_blank_pages(image_file_list)
File "/usr/local/bin/pdf2pdfocr.py", line 1010, in check_blank_pages
blank_map_values = colors_size_pool_map.get()
File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
PIL.Image.DecompressionBombError: Image size (235978454 pixels) exceeds limit of 178956970 pixels, could be decompression bomb DOS attack.
Followed the instruction guide for Windows and noticed an error when running the SendTo VBScript
"ModuleNotFoundError: No module named 'PyPDF2.utils'"
Looks like the latest version of PyPDF2 moved PdfReadError from utils to errors
Changing line 41 from
from PyPDF2.utils import PdfReadError
to
from PyPDF2.errors import PdfReadError
fixed the problem.
Cheers
Text can be extracted, but all pages are blank.
The error I get is "PyPDF2.errors.PdfReadError: Cannot read an empty file". I experimented with the first 2 pages of this pdf; individually the two pages OCR'ed fine (neither page was empty, and the OCR'ed text was not empty either), but when I tried to do the 2 pages together, it gave me
Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1526, in
pdf2ocr.ocr()
File "/usr/local/bin/pdf2pdfocr.py", line 717, in ocr
self.join_ocred_pdf()
File "/usr/local/bin/pdf2pdfocr.py", line 952, in join_ocred_pdf
pdf_merger.append(PyPDF2.PdfFileReader(text_pdf_file, strict=False))
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1856, in init
super().init(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 277, in init
self.read(stream)
File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 1301, in read
raise PdfReadError("Cannot read an empty file")
PyPDF2.errors.PdfReadError: Cannot read an empty file.
Side note, on the successful runs it gave me the warnings
UserWarning: isString is deprecated and will be removed in PyPDF2 2.0.0. [_utils.py:76]
UserWarning: namedDestinations will be removed in PyPDF2 2.0.0. Usenamed_destinations
instead. [_reader.py:519]
UserWarning: addMetadata is deprecated and will be removed in PyPDF2 2.0.0. Use add_metadata instead. [_writer.py:793]
Following OCR processing on these PDFs, attempts to extract text from the PDF using different techniques, such as code-based extraction or direct copying from the browser-rendered PDF, result in the entire text being duplicated / getting all the text twice than the text actually there in the pdf.
For instance, if the original text contains 5 characters, post-OCR, it recognizes and extracts 10 characters, effectively causing duplication of the content.
When running the gui I receive an error message as seen below. Note that this does not seem to have consequences when macos is configured in light mode, however in dark mode the non selected UI controls are displaying empty, see attached screenshot.
% pdf2pdfocr_gui.py
2020-04-11 08:16:46.702 Python[45232:697646] CoreText note: Client requested name ".SFNS-Regular", it will get Times-Roman rather than the intended font. All system UI font access should be through proper APIs such as CTFontCreateUIFontForLanguage() or +[NSFont systemFontOfSize:].
2020-04-11 08:16:46.702 Python[45232:697646] CoreText note: Set a breakpoint on CTFontLogSystemFontNameRequest to debug.
How to use in on Win 10? Can use paddleocr as a ocr engine?
Something seems to be wrong. I am running MacOS 10.13.6 with a fresh macports installation.
[2018-07-21 12:18:53.392458] [LOG] Input file /Users/emoret/Downloads/01-19-2017.pdf: type is application/pdf
PdfReadWarning: Multiple definitions in dictionary at byte 0xa9769 for key /Outlines [generic.py:588]
[2018-07-21 12:18:53.400228] [DEBUG] Output file: /Users/emoret/Downloads/01-19-2017-OCR.pdf for PDF and /Users/emoret/Downloads/01-19-2017-OCR.pdf.txt for TXT
[2018-07-21 12:18:53.400349] [LOG] Converting input file to images...
[2018-07-21 12:18:54.256488] [LOG] Starting OCR...
[2018-07-21 12:18:54.268721] [LOG] Waiting for OCR to complete. 0/5 pages completed...
[2018-07-21 12:18:59.271505] [LOG] OCR completed
[2018-07-21 12:18:59.273427] [DEBUG] We have 0 ocr'ed files
No PDF files generated after OCR. This is not expected. Aborting.
I did a quick test and got this error below
System information
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
Linux dev4-1 5.15.0-113-generic #123-Ubuntu SMP Mon Jun 10 08:16:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Client: Docker Engine - Community
Version: 27.0.3
API version: 1.46
Go version: go1.21.11
Git commit: 7d4bcd8
Built: Sat Jun 29 00:02:33 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 27.0.3
API version: 1.46 (minimum version 1.24)
Go version: go1.21.11
Git commit: 662f78c
Built: Sat Jun 29 00:02:33 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.18
GitCommit: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
nvidia:
Version: 1.7.18
GitCommit: v1.1.13-0-g58aa920
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Error log
❯ docker run --rm -v "$(pwd):/home/docker" leofcardoso/pdf2pdfocr -v -i ./inby.pdf
Unable to find image 'leofcardoso/pdf2pdfocr:latest' locally
latest: Pulling from leofcardoso/pdf2pdfocr
37aaf24cf781: Pull complete
da892f4d0cb0: Pull complete
df89c9ce1e48: Pull complete
d2a3165daa7e: Pull complete
663286a455ab: Pull complete
4f4fb700ef54: Pull complete
35693ee7cdbf: Pull complete
4215239b5448: Pull complete
Digest: sha256:6f446c6fa612ffd304bede285556cc0190f53c6506f8a7200a69a603261643a6
Status: Downloaded newer image for leofcardoso/pdf2pdfocr:latest
-------------------------------------
File: ./inby.pdf
[2024-07-10 01:00:35.107971] [DEBUG] Tesseract can 'textonly_pdf': True
[2024-07-10 01:00:35.117933] [DEBUG] Tesseract version: 4
[2024-07-10 01:00:35.144010] [DEBUG] Pdftoppm version: 22.2.0
[2024-07-10 01:00:35.151576] [DEBUG] Qpdf version: 10.6.3
[2024-07-10 01:00:35.151798] [DEBUG] Temp dir is /tmp/pdf2pdfocr_F7DGC/
[2024-07-10 01:00:35.151836] [DEBUG] Prefix is F7DGC
[2024-07-10 01:00:35.151884] [DEBUG] Script dir is /usr/local/bin/
[2024-07-10 01:00:35.151972] [DEBUG] Parallel operations will use 40 CPUs
Traceback (most recent call last):
File "/usr/local/bin/pdf2pdfocr.py", line 1509, in <module>
pdf2ocr = Pdf2PdfOcr(pdf2ocr_args, file_to_process)
File "/usr/local/bin/pdf2pdfocr.py", line 585, in __init__
self.main_pool = multiprocessing.Pool(self.cpu_to_use)
File "/usr/lib/python3.10/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/usr/lib/python3.10/multiprocessing/pool.py", line 235, in __init__
self._worker_handler.start()
File "/usr/lib/python3.10/threading.py", line 935, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Would be nice to have a simple gui to this tool. Maybe a contextual menu in finder, a print driver, or even an icon that could accept dragged and dropped pdf files ready to OCR?
With some old poppler versions and specific PDFs, script is generating blank pages.
How can I use this tool directly on Windows 11 without Docker?
I'd like to utilize it as a python function API that accepts arguments and generates the OCR'd file.
Hi
PDF source :
Module 3 - .v2.pdf
I'm trying to OCR the text on my pdf for personal use.
I've check the TXT file generated, and it's working (I'm seeing the proper text).
But when I open my PDF file (with OCR), if I search text, it does not work. If I copy/paste text from PDF, it's a weird text :
22tropnosseccaevitartsinimdatimrepdluowspuorgeerhtllA.puorgrevresnoitacilppaehtotylnonepo
erucesylhgihyolpednacuoy,msinahcemsihthtiW.krowtenetaroprocs’remotsucehtmorfylnotub .snoitacilppa
Execution inside docker container takes too much time.
I found out that following the macports installation documentation did not properly install the modules with pip. In order to make it work, I had to use pip3 such as:
sudo pip3 install reportlab Gooey
sudo pip3 install https://github.com/mstamy2/PyPDF2/archive/master.zip
sudo pip3 install lxml beautifulsoup4
Did I need cuneiform- i read your windows install.txt file and read this as optional, maybe I´m wrong.
It´s a interesting tool and would fit for me perfect to create a database of my private papers.
thx a lot
Martin
Sometimes, execution fail with "long command line" error in Windows when ImageMagick is called.
If the output file is selected in pdf2pdfocr gui, this file must currently already exist, which is obviously not reasonable.
script hangs forever with python 3.7.2 and windows.
At the moment, a pdf file is created for each file if the option "-i" is used for a directory. There should be an option that packs all the files into one single pdf file. This would be a useful option if there were several "edited" image files (e.g. processed with scantailor) in that directory after a scan (e.g. one for each page). An option for the "correct" sorting these pages has to be kept in mind as well.
Is there a flag to set --oem 1
in for tesseract 4 like documented here?
This wasn't included in the readme file but some info for anyone else lost.
You can change the language model to download by editing this:
aria2c "https://github.com/tesseract-ocr/tessdata/blob/main/por.traineddata?raw=true" --dir="%TESSDATA_PREFIX%"
And change the language prefix to which language you want. As long as its available on the tesseract repo. For example here is Swedish - "swe":
Further info here:
https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#LANGUAGES
To change default language edit pdf2pdfocr.py on line 548 from Portuguese + English - "por+eng" to whichever. For me I use Swedish + English - "swe+eng"
self.tess_langs = "por+eng" # Default
to
self.tess_langs = "swe+eng" # Default
Hello
in pdf, after recognition with pdf2pdfocr_gui.py, the right edge of the text is not fully highlighted if you select 'tesseract' for the '- e' option.
and if you select 'native', then the entire text is highlighted correctly,but there are no Cyrillic characters.
see the screenshots...
if you say that this is a tesseract problem, then this is incorrect, because I recognize djvu with 'ocrodjvu' and the text is highlighted correctly after recognition.
Hello,
I want to replace tesseract engine with Google vision API. Can you please suggest me how to do the same.
thanks
When I use pdf2pdfocr, the text generated includes no space between the words recognized. As a result when I copy/paste the resulting text it is difficult to use as I have to manually reintroduce all missing spaces.
Dear Leo,
I love your project but would like to directly push the OCRed files to a new directory.
Therefor I tried to add amend the defaultoptions: default_option = "-stp -j 0.9 -o %Userprofile%"
But no matter which directory I add, I always get a permission error:
Traceback (most recent call last):
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 1249, in
pdf2ocr.ocr()
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 605, in ocr
self.initial_cleanup()
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 952, in initial_cleanup
Pdf2PdfOcr.best_effort_remove(self.output_file)
File "C:\Users\Christoph\pdf2pdfocr-venv\Scripts\pdf2pdfocr.py", line 1154, in best_effort_remove
os.remove(filename)
PermissionError: [WinError 5] Zugriff verweigert: 'C:\Users\Christoph'
Any ideas how to fix it?
Thank you so much.
BR
Christoph
Hi there,
I always was looking for an open source tool suite like yours. I installed everything as explained for Windows7x64 system according to your README.
Right Click on PDF -> Send To -> VBS Script gives the error from above
However mogrify is installed. As I can run it from command line with "magick mogrify" successfully.
Looking into the python code it looks that it should work. Can you help me out?
Thank you so much.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.