cmccambridge / ocrmypdf-auto Goto Github PK
View Code? Open in Web Editor NEWDocker container to automate use of OCRmyPDF to process documents.
License: MIT License
Docker container to automate use of OCRmyPDF to process documents.
License: MIT License
Related bad behaviors:
After merging #6 and updating the unraid template accordingly, the standard container configuration when run under unraid is broken. unraid always provides an environment variable to a container, even if the value is set to an empty string. Inside the container, we note the presence of OCR_LANGUAGES
and decide to map it to --language
but end up passing nothing for the required value to that parameter. This causes ocrmypdf
to fail 100% of the time.
If OCR_LANGUAGES
is present in the environment and set to a valid space-delimited value as defined for ocrmypdf-auto
, we append that directly to the --langauge
parameter for ocrmypdf
, which is improper. We should be converting it to a '+'
-delimited string of languages.
And more problematically, I have zero tests for any of this, including the basic good path of "start up like the unraid template does and make sure you can scan a document."
When an output file is moved or deleted quickly after processing completes, especially in parallel processing of many files, OcrTask
may not yet have been scheduled to sanity check the final timestamp and activate on-success actions such as deleting or archiving the input file before the output is no longer accessible, causing output_mtime
measurement to fail.
Should either remove the timestamp sanity check and rely on the return code from ocrmypdf
or find a way to win this race, e.g. moving to the final path from /ocrtemp
as a final step.
hi,
I tested in chi-tra and chi-sim but not work, it may be related - and _ problem??
Hi,
my scanner puts all scanned documents in a SMB-Share.
This Share is then mounted with smb into the container.
At startup all Documents get ocr'ed, but if any documents get added after startup they will not recognized.
In my limited understanding i think that inotify does not work with docker and smb-shares.
Thank you very much...
mg
version: '3'
services:
######## ocrmypdf-auto ########
ocrmypdf-auto:
container_name: "ocrmypdf-auto"
image: cmccambridge/ocrmypdf-auto
restart: always
environment:
- TZ=Europe/Berlin
- 'OCR_LANGUAGES=deu eng'
- OCR_OUTPUT_MODE=SINGLE_FOLDER
- OCR_PROCESS_EXISTING_ON_START=1
- OCR_ACTION_ON_SUCCESS=NOTHING
- UID=1000
- GID=1000
- USERMAP_UID=1000
- USERMAP_GIH=1000
volumes:
- scan_input:/input
- scan_output:/output
- config:/config
######## Volumes ########
volumes:
config:
scan_input:
driver: local
driver_opts:
type: "cifs"
o: "user=ocrmypdf,password=XXXXX,rw"
device: "//192.168.2.36/scans"
scan_output:
driver: local
driver_opts:
type: "cifs"
o: "user=ocrmypdf,password=XXXXX,rw"
device: "//192.168.2.36/scans/output"
ocrmypdf-auto | 2021-01-21 14:27:49 - Watching /input
ocrmypdf-auto | 2021-01-21 14:27:49 - Processing: /input/20210109_000066.pdf -> /output/20210109_000066.pdf
ocrmypdf-auto | 2021-01-21 14:27:49 - Processing: /input/20210112_000122.pdf -> /output/20210112_000122.pdf
ocrmypdf-auto | 2021-01-21 14:27:49 - Processing: /input/20210109_000070.pdf -> /output/20210109_000070.pdf
ocrmypdf-auto | 2021-01-21 14:27:53 - Processing complete in 3.720000 seconds with status 5: /input/20210109_000066.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210109_000066.pdf/output/20210109_000066.pdf53.720000
ocrmypdf-auto | 2021-01-21 14:27:53 - Processing: /input/20210112_000108.pdf -> /output/20210112_000108.pdf
ocrmypdf-auto | 2021-01-21 14:27:53 - Processing complete in 3.720000 seconds with status 5: /input/20210109_000070.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210109_000070.pdf/output/20210109_000070.pdf53.720000
ocrmypdf-auto | 2021-01-21 14:27:53 - Processing: /input/20210109_000077.pdf -> /output/20210109_000077.pdf
ocrmypdf-auto | 2021-01-21 14:27:53 - Processing complete in 3.760000 seconds with status 5: /input/20210112_000122.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210112_000122.pdf/output/20210112_000122.pdf53.760000
ocrmypdf-auto | 2021-01-21 14:27:53 - Processing: /input/20210115_000222.pdf -> /output/20210115_000222.pdf
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing complete in 0.760000 seconds with status 5: /input/20210109_000077.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210109_000077.pdf/output/20210109_000077.pdf50.760000
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing: /input/20210121_000249.pdf -> /output/20210121_000249.pdf
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing complete in 0.830000 seconds with status 5: /input/20210112_000108.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210112_000108.pdf/output/20210112_000108.pdf50.830000
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing: /input/20210118_000237.pdf -> /output/20210118_000237.pdf
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing complete in 0.790000 seconds with status 5: /input/20210115_000222.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210115_000222.pdf/output/20210115_000222.pdf50.790000
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing: /input/20210112_000217.pdf -> /output/20210112_000217.pdf
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing complete in 0.780000 seconds with status 5: /input/20210121_000249.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210121_000249.pdf/output/20210121_000249.pdf50.780000
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing: /input/20210112_000172.pdf -> /output/20210112_000172.pdf
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing complete in 0.750000 seconds with status 5: /input/20210118_000237.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210118_000237.pdf/output/20210118_000237.pdf50.750000
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing: /input/20210111_000105.pdf -> /output/20210111_000105.pdf
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing complete in 0.780000 seconds with status 5: /input/20210112_000217.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210112_000217.pdf/output/20210112_000217.pdf50.780000
ocrmypdf-auto | 2021-01-21 14:27:54 - Processing: /input/20210109_000072.pdf -> /output/20210109_000072.pdf
ocrmypdf-auto | 2021-01-21 14:27:55 - Processing complete in 0.750000 seconds with status 5: /input/20210112_000172.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210112_000172.pdf/output/20210112_000172.pdf50.750000
ocrmypdf-auto | 2021-01-21 14:27:55 - Processing: /input/20210121_000243.pdf -> /output/20210121_000243.pdf
ocrmypdf-auto | 2021-01-21 14:27:55 - Processing complete in 0.790000 seconds with status 5: /input/20210111_000105.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210111_000105.pdf/output/20210111_000105.pdf50.790000
ocrmypdf-auto | 2021-01-21 14:27:55 - Processing: /input/20210115_000226.pdf -> /output/20210115_000226.pdf
ocrmypdf-auto | 2021-01-21 14:27:55 - Processing complete in 0.770000 seconds with status 5: /input/20210109_000072.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210109_000072.pdf/output/20210109_000072.pdf50.770000
ocrmypdf-auto | 2021-01-21 14:27:55 - Processing: /input/20210109_000056.pdf -> /output/20210109_000056.pdf
ocrmypdf-auto | 2021-01-21 14:27:56 - Processing complete in 0.730000 seconds with status 5: /input/20210121_000243.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210121_000243.pdf/output/20210121_000243.pdf50.730000
ocrmypdf-auto | 2021-01-21 14:27:56 - Processing: /input/20210119_000239.pdf -> /output/20210119_000239.pdf
ocrmypdf-auto | 2021-01-21 14:27:56 - Processing complete in 0.770000 seconds with status 5: /input/20210115_000226.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210115_000226.pdf/output/20210115_000226.pdf50.770000
ocrmypdf-auto | 2021-01-21 14:27:56 - Processing: /input/20210112_000215.pdf -> /output/20210112_000215.pdf
ocrmypdf-auto | 2021-01-21 14:27:56 - Processing complete in 0.780000 seconds with status 5: /input/20210109_000056.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210109_000056.pdf/output/20210109_000056.pdf50.780000
ocrmypdf-auto | 2021-01-21 14:27:56 - Processing: /input/20210112_000203.pdf -> /output/20210112_000203.pdf
ocrmypdf-auto | 2021-01-21 14:27:57 - Processing complete in 0.880000 seconds with status 5: /input/20210119_000239.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210119_000239.pdf/output/20210119_000239.pdf50.880000
ocrmypdf-auto | 2021-01-21 14:27:57 - Processing: /input/20210114_000218.pdf -> /output/20210114_000218.pdf
ocrmypdf-auto | 2021-01-21 14:27:57 - Processing complete in 0.850000 seconds with status 5: /input/20210112_000203.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210112_000203.pdf/output/20210112_000203.pdf50.850000
ocrmypdf-auto | 2021-01-21 14:27:57 - Processing: /input/20210112_000161.pdf -> /output/20210112_000161.pdf
ocrmypdf-auto | 2021-01-21 14:27:57 - Processing complete in 0.870000 seconds with status 5: /input/20210112_000215.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210112_000215.pdf/output/20210112_000215.pdf50.870000
ocrmypdf-auto | 2021-01-21 14:27:57 - Processing: /input/20210121_000245.pdf -> /output/20210121_000245.pdf
ocrmypdf-auto | 2021-01-21 14:27:58 - Processing complete in 0.780000 seconds with status 5: /input/20210114_000218.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210114_000218.pdf/output/20210114_000218.pdf50.780000
ocrmypdf-auto | 2021-01-21 14:27:58 - Processing complete in 0.780000 seconds with status 5: /input/20210112_000161.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210112_000161.pdf/output/20210112_000161.pdf50.780000
ocrmypdf-auto | 2021-01-21 14:27:58 - Processing complete in 0.810000 seconds with status 5: /input/20210121_000245.pdf
ocrmypdf-auto | TESTOCR_PROCESS_RESULT/input/20210121_000245.pdf/output/20210121_000245.pdf50.810000
^CGracefully stopping... (press Ctrl+C again to force)
2021-03-20 21:17:26 - Watching /input,
2021-03-20 21:17:56 - Processing: /input/HP scan -0046.pdf -> /output/HP scan -0046.pdf,
2021-03-20 21:18:01 - Processing complete in 4.140000 seconds with status 5: /input/HP scan -0046.pdf,
TESTOCR_PROCESS_RESULT/input/HP scan -0047.pdf/output/HP scan -0047.pdf54.020000
I installed this as a docker container and when a pdf is loaded into the input folder, the log gives me this output. No pdf is converted and placed in the output folder.
What is status 5?
Expose an OCR_LANGUAGES
variable that can be used to instruct the container (in docker-entrypoint.sh
, probably?) to install additional tesseract language packages.
Open question: Also add a default -l <list>
option to ocrmypdf
configuration?
When I start the container and move files in the input folder, nothing happens. I don't even get any output in the logs.
Start command:
docker run \
-v ./test-files/input/:/input \
-v ./test-files/output/:/output \
-v ./test-files/config/:/config \
--env OCR_VERBOSITY=debug \
--env OCR_LANGUAGES=deu \
cmccambridge/ocrmypdf-auto
The config file ./test-files/config/ocr.config
:
ocr.config
It also did not work with the example configuration.
The only thing in the logs beside the installation of the language pack is:
2022-05-18 15:58:45 [MainThread] - Watching /input
^C
2022-05-18 15:59:47 [MainThread] - Signal 2 (SIGINT) Received. Shutting down...
2022-05-18 15:59:47 [MainThread] - Shutting down filesystem watchdog...
2022-05-18 15:59:47 [MainThread] - Canceling all 0 in-flight tasks...
2022-05-18 15:59:47 [MainThread] - Shutting down threadpool...
2022-05-18 15:59:47 [MainThread] - Cleaning up filesystem watchdog...
I've tested the container on two systems:
Host OS: Ubuntu 21.04 // Windows 1121H1 (22000.675)
Docker Version: 20.10.14 (a224086) // 20.10.14 (a224086)
Docker Desktop Version: 4.8.1 (78998) // 4.8.2 (79419) with WSL2 support
Both have the same issue. I think, on the first system the problem occurred after updating docker, but I don't know from which to which version and if I had Docker Desktop installed already. Maybe the issue is also related to Docker Desktop.
Do you have any idea why the container is not working? Thanks in advance
.
Add prefix/suffix to filename with human-readable timestamp (datetime.now().strftime("%Y-%m-%d %H:%M:%S")
) and/or unique number before moving to archive folder. In my opinion, all files must stay in archive folder without overwriting.
My scanner (HP 426dn) can't generate unique filename with timestamp by himself, and when he scanning to empty folder (/inbox), filename is always myscan0001.pdf, meaning after ocrmypdf-auto proccessing this file and moving it to archive folder, previously scanned file will be overwritten.
Not sure how much size savings could be realized by switching to Alpine, given how many other packages get pulled in to satisfy ocrmypdf
dependencies and their dependencies, but the biggest obstacle up front is that the only tesseract-ocr
available for Alpine seems to be v3.05, which is considerably poorer performing than the not-yet-release v4 code.
As far as I can see:
PDFs scanned at black/white or grayscale got a little 5-10% increased file size
My guess --> just because of the additional text layer - fine.
PDFs scanned in color get a highly decreased file size - sometimes nearly 50% !
40-50% decrease is not possible without a high "loss" of information when compressing.
Can you provide such an option via the unRAID template to just set another parameter like
CompressionLevel=veryhigh, high, normal, low, verylow, none
Additionally it would be nice to be able to set the output dpi.
To get a optimal OCR the document should always scanned at a minimum of 300dpi.
But there is no need to save the file after the recognition with 300dpi. Instead 200dpi is almost enough, decreasing the file size without any seeable quality losts.
Thank you
I'd like someone to help me set my webhook url, i've been reading
https://ocrmypdf.readthedocs.io/en/v9.7.1/api.html
without success..
Working configuration over portainer on a machine with physical resources.
Thank you in advance!
Removed from unRAID templates since they're not required but unRAID generates invalid docker commandlines without a specified mount for every volume in a template.
Should add documentation to the unRAID Integration setting to explain how to add these back in. Also consider adding support within the container to detect mounted vs unmounted /archive
share, so that users don't inadvertently archive into the container...
Sorry -- noob question ahead.
First of all thank you for the fantastic project. Unfortunately, I was not able to install. I got to the part where I run the docker create
command. However, it returns:
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm/v7) and no specific platform was requested
Can anyone give me a primer on how to install ocrmypdf-auto on Raspbian? Any help would be much appreciated.
I have the problem that I can't get any processing to happen. I set up the Container like in the Quick & Easy example. Everything seems to be fine and I get the Watching Folder notification. But if I copy a PDF in the input folder nothing happens. Any clue on why?
I tried different PDFs and checked if the volumes are mounted correctly. I did not use any custom config file.
Current image pins ocrmypdf
at 6.2.0. Need to write more end-to-end regression tests and then upgrade to the latest (8.1.0 as of this issue).
i installed under omv 4 docker.
when i look into the log file i can see, that the test.pdf was ocr(ed) and moved to the output directory. but when i look into my output directory there is no file.
The ocr.config file indicates that common OCRmyPDF options are allowed. No matter what options i add in or take out any time i use the "--redo-ocr" option it just hangs and does nothing. There is no CPU usage from the docker container and nothing happens. It just stays in the "input" folder.
Any advice would be apreciated. Thanks!
Hello,
I use your container for my paperless office, and it works great. The scanner stores the PDF in a folder (Input) on my server which the users don't have access to. Then ocrmypdf-auto does its work and saves the finished PDF on a shared drive. From time to time images (JPG) are also scanned. These scans are not processed by ocrmypdf-auto because the file extension is not *.pdf, this is correct. Would it be possible to include an option to simply move files that do not end with *.pdf to the Output folder?
Hello, I've set up my container with the following parameters
docker create \
--name=ocrmypdf-auto \
-v /srv/faa137b2-b75f-42c4-a777-497ac74524ef/altered/ocr/input:/input \
-v /srv/faa137b2-b75f-42c4-a777-497ac74524ef/altered/ocr/output:/output \
-v /srv/dev-disk-by-label-fsb/altered/appdata/ocrmypdf_auto:/config \
-v /srv/faa137b2-b75f-42c4-a777-497ac74524ef/altered/ocr/temp:/ocrtemp \
-v /srv/faa137b2-b75f-42c4-a777-497ac74524ef/altered/ocr/archive:/archive \
-e OCR_LANGUAGES="nor chi-sim dan eng swe" \
-e OCR_OUTPUT_MODE=MIRROR_TREE \
-e OCR_PROCESS_EXISTING_ON_START=1 \
-e OCR_ACTION_ON_SUCCESS=DELETE_INPUT_FILES \
-e PUID=1000 \
-e PGID=100 \
-e UMASK_SET=000 \
quay.io/cmccambridge/ocrmypdf-auto
In the portainer logs I get the following (error?)
2020-08-09 20:31:52 - Watching /input,
2020-08-09 20:31:52 - Processing: /input/Scan-037.pdf -> /output/Scan-037.pdf,
2020-08-09 20:31:52 - Processing: /input/._Scan-037.pdf -> /output/._Scan-037.pdf,
2020-08-09 20:31:55 - Processing complete in 3.350000 seconds with status 3: /input/._Scan-037.pdf,
TESTOCR_PROCESS_RESULT/input/._Scan-037.pdf/output/._Scan-037.pdf33.350000,
2020-08-09 20:31:55 - Processing complete in 3.350000 seconds with status 3: /input/Scan-037.pdf,
TESTOCR_PROCESS_RESULT/input/Scan-037.pdf/output/Scan-037.pdf33.350000
The input files stays in the input folder and are not deleted as I set them up to be. Output folder is empty.
I've tried resetting folder permissions to no avail.
I realize this may be thoroughly outside the intended scope of this project, but it would be wonderful if it would process not just PDF files, but a variety of image files (tiff and jpg come to mind). Perhaps passing them directly to to tesseract-ocr and outputting the results as text files?
Thanks for the fantastic Unraid docker container, and for your consideration!
I tried putting the archives into another location outside the container to get them into an archive but this is resulting in
2022-12-18 22:06:53 [ThreadPoolExecutor-0_2] - Error in OcrTask.process: Traceback (most recent call last):
File "/usr/lib/python3.8/shutil.py", line 788, in move
os.rename(src, real_dst)
OSError: [Errno 18] Invalid cross-device link: '/input/standard/MFC-L8650cdw_003873.pdf' -> '/archive/standard/MFC-L8650cdw_003873.pdf'
Could this be turned into copy+unlink?
move() in shutil.py should detect the cross-fs move (see
If the destination is on our current filesystem, then rename() is used.
Otherwise, src is copied to the destination and then removed. Symlinks are
recreated under the new name if os.rename() fails because of cross
filesystem renames.
but something seems to have gone wrong although mount shows
/dev/nvme0n1p4 on /archive type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)
/dev/nvme0n1p4 on /config type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)
/dev/nvme0n1p4 on /input type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)
/dev/nvme0n1p4 on /output type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)
/dev/nvme0n1p4 on /ocrtemp type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)
I have a separate drive for input, output, ocrtemp etc, however having issued with OS drive filling up during conversions
Appears that "/var/lib/docker/overlay2/container ID/diff/tmp" is actually being used, folder looks like:
Docker command below:
docker run -d \
--name=ocrmypdf \
-v /media/data/ocrmypdf/files/input:/input \
-v /media/data/ocrmypdf/files/output:/output \
-v /media/data/ocrmypdf/files/archive:/archive \
-v /media/data/ocrmypdf/ocrtemp:/ocrtemp \
-v /media/data/ocrmypdf/config:/config \
-e OCR_PROCESS_EXISTING_ON_START=1 \
-e OCR_ACTION_ON_SUCCESS=DELETE_INPUT_FILES \
-e OCR_USE_POLLING_SCHEDULER=1 \
-e USERMAP_UID=1001 \
-e USERMAP_GID=1001 \
--restart unless-stopped \
cmccambridge/ocrmypdf-auto
Convert if possible, for size savings.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.