cmccambridge / ocrmypdf-auto Goto Github PK

View Code? Open in Web Editor NEW

120.0 120.0 22.0 336 KB

Docker container to automate use of OCRmyPDF to process documents.

License: MIT License

Shell 6.98% Python 84.45% Dockerfile 5.61% Makefile 2.95%

ocrmypdf-auto's People

Contributors

Stargazers

Watchers

ocrmypdf-auto's Issues

`OCR_LANGUAGES` is converted incorrectly to `--language` parameters to `ocrmypdf`

Related bad behaviors:

After merging #6 and updating the unraid template accordingly, the standard container configuration when run under unraid is broken. unraid always provides an environment variable to a container, even if the value is set to an empty string. Inside the container, we note the presence of OCR_LANGUAGES and decide to map it to --language but end up passing nothing for the required value to that parameter. This causes ocrmypdf to fail 100% of the time.
If OCR_LANGUAGES is present in the environment and set to a valid space-delimited value as defined for ocrmypdf-auto, we append that directly to the --langauge parameter for ocrmypdf, which is improper. We should be converting it to a '+'-delimited string of languages.

And more problematically, I have zero tests for any of this, including the basic good path of "start up like the unraid template does and make sure you can scan a document."

Race condition in evaluating post-processing timestamp

When an output file is moved or deleted quickly after processing completes, especially in parallel processing of many files, OcrTask may not yet have been scheduled to sanity check the final timestamp and activate on-success actions such as deleting or archiving the input file before the output is no longer accessible, causing output_mtime measurement to fail.

Should either remove the timestamp sanity check and rely on the return code from ocrmypdf or find a way to win this race, e.g. moving to the final path from /ocrtemp as a final step.

issue on chi-tra and chi-sim

hi,

I tested in chi-tra and chi-sim but not work, it may be related - and _ problem??

New Files dont get recognized on Docker SMB-Volume

Hi,

my scanner puts all scanned documents in a SMB-Share.
This Share is then mounted with smb into the container.

At startup all Documents get ocr'ed, but if any documents get added after startup they will not recognized.

In my limited understanding i think that inotify does not work with docker and smb-shares.

Thank you very much...
mg

docker-compose

version: '3'
services:
######## ocrmypdf-auto ########
  ocrmypdf-auto:
    container_name: "ocrmypdf-auto"
    image: cmccambridge/ocrmypdf-auto
    restart: always
    environment:
      - TZ=Europe/Berlin
      - 'OCR_LANGUAGES=deu eng'
      - OCR_OUTPUT_MODE=SINGLE_FOLDER
      - OCR_PROCESS_EXISTING_ON_START=1
      - OCR_ACTION_ON_SUCCESS=NOTHING
      - UID=1000
      - GID=1000
      - USERMAP_UID=1000
      - USERMAP_GIH=1000
    volumes:
      - scan_input:/input
      - scan_output:/output
      - config:/config


######## Volumes ########
volumes:
  config:
  scan_input:
    driver: local
    driver_opts:
      type: "cifs"
      o: "user=ocrmypdf,password=XXXXX,rw"
      device: "//192.168.2.36/scans"
  scan_output:
    driver: local
    driver_opts:
      type: "cifs"
      o: "user=ocrmypdf,password=XXXXX,rw"
      device: "//192.168.2.36/scans/output"

Log after lat startup

ocrmypdf-auto    | 2021-01-21 14:27:49 - Watching /input
ocrmypdf-auto    | 2021-01-21 14:27:49 - Processing: /input/20210109_000066.pdf -> /output/20210109_000066.pdf
ocrmypdf-auto    | 2021-01-21 14:27:49 - Processing: /input/20210112_000122.pdf -> /output/20210112_000122.pdf
ocrmypdf-auto    | 2021-01-21 14:27:49 - Processing: /input/20210109_000070.pdf -> /output/20210109_000070.pdf
ocrmypdf-auto    | 2021-01-21 14:27:53 - Processing complete in 3.720000 seconds with status 5: /input/20210109_000066.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210109_000066.pdf/output/20210109_000066.pdf53.720000
ocrmypdf-auto    | 2021-01-21 14:27:53 - Processing: /input/20210112_000108.pdf -> /output/20210112_000108.pdf
ocrmypdf-auto    | 2021-01-21 14:27:53 - Processing complete in 3.720000 seconds with status 5: /input/20210109_000070.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210109_000070.pdf/output/20210109_000070.pdf53.720000
ocrmypdf-auto    | 2021-01-21 14:27:53 - Processing: /input/20210109_000077.pdf -> /output/20210109_000077.pdf
ocrmypdf-auto    | 2021-01-21 14:27:53 - Processing complete in 3.760000 seconds with status 5: /input/20210112_000122.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210112_000122.pdf/output/20210112_000122.pdf53.760000
ocrmypdf-auto    | 2021-01-21 14:27:53 - Processing: /input/20210115_000222.pdf -> /output/20210115_000222.pdf
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing complete in 0.760000 seconds with status 5: /input/20210109_000077.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210109_000077.pdf/output/20210109_000077.pdf50.760000
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing: /input/20210121_000249.pdf -> /output/20210121_000249.pdf
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing complete in 0.830000 seconds with status 5: /input/20210112_000108.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210112_000108.pdf/output/20210112_000108.pdf50.830000
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing: /input/20210118_000237.pdf -> /output/20210118_000237.pdf
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing complete in 0.790000 seconds with status 5: /input/20210115_000222.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210115_000222.pdf/output/20210115_000222.pdf50.790000
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing: /input/20210112_000217.pdf -> /output/20210112_000217.pdf
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing complete in 0.780000 seconds with status 5: /input/20210121_000249.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210121_000249.pdf/output/20210121_000249.pdf50.780000
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing: /input/20210112_000172.pdf -> /output/20210112_000172.pdf
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing complete in 0.750000 seconds with status 5: /input/20210118_000237.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210118_000237.pdf/output/20210118_000237.pdf50.750000
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing: /input/20210111_000105.pdf -> /output/20210111_000105.pdf
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing complete in 0.780000 seconds with status 5: /input/20210112_000217.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210112_000217.pdf/output/20210112_000217.pdf50.780000
ocrmypdf-auto    | 2021-01-21 14:27:54 - Processing: /input/20210109_000072.pdf -> /output/20210109_000072.pdf
ocrmypdf-auto    | 2021-01-21 14:27:55 - Processing complete in 0.750000 seconds with status 5: /input/20210112_000172.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210112_000172.pdf/output/20210112_000172.pdf50.750000
ocrmypdf-auto    | 2021-01-21 14:27:55 - Processing: /input/20210121_000243.pdf -> /output/20210121_000243.pdf
ocrmypdf-auto    | 2021-01-21 14:27:55 - Processing complete in 0.790000 seconds with status 5: /input/20210111_000105.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210111_000105.pdf/output/20210111_000105.pdf50.790000
ocrmypdf-auto    | 2021-01-21 14:27:55 - Processing: /input/20210115_000226.pdf -> /output/20210115_000226.pdf
ocrmypdf-auto    | 2021-01-21 14:27:55 - Processing complete in 0.770000 seconds with status 5: /input/20210109_000072.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210109_000072.pdf/output/20210109_000072.pdf50.770000
ocrmypdf-auto    | 2021-01-21 14:27:55 - Processing: /input/20210109_000056.pdf -> /output/20210109_000056.pdf
ocrmypdf-auto    | 2021-01-21 14:27:56 - Processing complete in 0.730000 seconds with status 5: /input/20210121_000243.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210121_000243.pdf/output/20210121_000243.pdf50.730000
ocrmypdf-auto    | 2021-01-21 14:27:56 - Processing: /input/20210119_000239.pdf -> /output/20210119_000239.pdf
ocrmypdf-auto    | 2021-01-21 14:27:56 - Processing complete in 0.770000 seconds with status 5: /input/20210115_000226.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210115_000226.pdf/output/20210115_000226.pdf50.770000
ocrmypdf-auto    | 2021-01-21 14:27:56 - Processing: /input/20210112_000215.pdf -> /output/20210112_000215.pdf
ocrmypdf-auto    | 2021-01-21 14:27:56 - Processing complete in 0.780000 seconds with status 5: /input/20210109_000056.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210109_000056.pdf/output/20210109_000056.pdf50.780000
ocrmypdf-auto    | 2021-01-21 14:27:56 - Processing: /input/20210112_000203.pdf -> /output/20210112_000203.pdf
ocrmypdf-auto    | 2021-01-21 14:27:57 - Processing complete in 0.880000 seconds with status 5: /input/20210119_000239.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210119_000239.pdf/output/20210119_000239.pdf50.880000
ocrmypdf-auto    | 2021-01-21 14:27:57 - Processing: /input/20210114_000218.pdf -> /output/20210114_000218.pdf
ocrmypdf-auto    | 2021-01-21 14:27:57 - Processing complete in 0.850000 seconds with status 5: /input/20210112_000203.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210112_000203.pdf/output/20210112_000203.pdf50.850000
ocrmypdf-auto    | 2021-01-21 14:27:57 - Processing: /input/20210112_000161.pdf -> /output/20210112_000161.pdf
ocrmypdf-auto    | 2021-01-21 14:27:57 - Processing complete in 0.870000 seconds with status 5: /input/20210112_000215.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210112_000215.pdf/output/20210112_000215.pdf50.870000
ocrmypdf-auto    | 2021-01-21 14:27:57 - Processing: /input/20210121_000245.pdf -> /output/20210121_000245.pdf
ocrmypdf-auto    | 2021-01-21 14:27:58 - Processing complete in 0.780000 seconds with status 5: /input/20210114_000218.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210114_000218.pdf/output/20210114_000218.pdf50.780000
ocrmypdf-auto    | 2021-01-21 14:27:58 - Processing complete in 0.780000 seconds with status 5: /input/20210112_000161.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210112_000161.pdf/output/20210112_000161.pdf50.780000
ocrmypdf-auto    | 2021-01-21 14:27:58 - Processing complete in 0.810000 seconds with status 5: /input/20210121_000245.pdf
ocrmypdf-auto    | TESTOCR_PROCESS_RESULT/input/20210121_000245.pdf/output/20210121_000245.pdf50.810000
^CGracefully stopping... (press Ctrl+C again to force)

No output file - Status 5

2021-03-20 21:17:26 - Watching /input,
2021-03-20 21:17:56 - Processing: /input/HP scan -0046.pdf -> /output/HP scan -0046.pdf,
2021-03-20 21:18:01 - Processing complete in 4.140000 seconds with status 5: /input/HP scan -0046.pdf,
TESTOCR_PROCESS_RESULT/input/HP scan -0047.pdf/output/HP scan -0047.pdf54.020000

I installed this as a docker container and when a pdf is loaded into the input folder, the log gives me this output. No pdf is converted and placed in the output folder.

What is status 5?

Feature: Implement auto-install of tesseract language packages

Expose an OCR_LANGUAGES variable that can be used to instruct the container (in docker-entrypoint.sh, probably?) to install additional tesseract language packages.

Open question: Also add a default -l <list> option to ocrmypdf configuration?

Input files don't get recognized

When I start the container and move files in the input folder, nothing happens. I don't even get any output in the logs.

Start command:

docker run \
  -v ./test-files/input/:/input \
  -v ./test-files/output/:/output \
  -v ./test-files/config/:/config \
  --env OCR_VERBOSITY=debug \
  --env OCR_LANGUAGES=deu \
  cmccambridge/ocrmypdf-auto

The config file ./test-files/config/ocr.config:
ocr.config

It also did not work with the example configuration.

The only thing in the logs beside the installation of the language pack is:

2022-05-18 15:58:45 [MainThread] - Watching /input
^C
2022-05-18 15:59:47 [MainThread] - Signal 2 (SIGINT) Received. Shutting down...
2022-05-18 15:59:47 [MainThread] - Shutting down filesystem watchdog...
2022-05-18 15:59:47 [MainThread] - Canceling all 0 in-flight tasks...
2022-05-18 15:59:47 [MainThread] - Shutting down threadpool...
2022-05-18 15:59:47 [MainThread] - Cleaning up filesystem watchdog...

I've tested the container on two systems:

Host OS: Ubuntu 21.04 // Windows 1121H1 (22000.675)

Docker Version: 20.10.14 (a224086) // 20.10.14 (a224086)

Docker Desktop Version: 4.8.1 (78998) // 4.8.2 (79419) with WSL2 support

Both have the same issue. I think, on the first system the problem occurred after updating docker, but I don't know from which to which version and if I had Docker Desktop installed already. Maybe the issue is also related to Docker Desktop.

Do you have any idea why the container is not working? Thanks in advance

working again??

Feature request: Add prefix/suffix to filename with timestamp and/or unique number before moving to archive folder

Add prefix/suffix to filename with human-readable timestamp (datetime.now().strftime("%Y-%m-%d %H:%M:%S")) and/or unique number before moving to archive folder. In my opinion, all files must stay in archive folder without overwriting.

My scanner (HP 426dn) can't generate unique filename with timestamp by himself, and when he scanning to empty folder (/inbox), filename is always myscan0001.pdf, meaning after ocrmypdf-auto proccessing this file and moving it to archive folder, previously scanned file will be overwritten.

Feature: Experiment with building tesseract v4 on Alpine Linux

Not sure how much size savings could be realized by switching to Alpine, given how many other packages get pulled in to satisfy ocrmypdf dependencies and their dependencies, but the biggest obstacle up front is that the only tesseract-ocr available for Alpine seems to be v3.05, which is considerably poorer performing than the not-yet-release v4 code.

Feature Request: Setting resize / compression levels

As far as I can see:

PDFs scanned at black/white or grayscale got a little 5-10% increased file size
My guess --> just because of the additional text layer - fine.
PDFs scanned in color get a highly decreased file size - sometimes nearly 50% !
40-50% decrease is not possible without a high "loss" of information when compressing.

Can you provide such an option via the unRAID template to just set another parameter like
CompressionLevel=veryhigh, high, normal, low, verylow, none

Additionally it would be nice to be able to set the output dpi.
To get a optimal OCR the document should always scanned at a minimum of 300dpi.
But there is no need to save the file after the recognition with 300dpi. Instead 200dpi is almost enough, decreasing the file size without any seeable quality losts.

Thank you

OCR_NOTIFY_URL for MS Teams Webhook

I'd like someone to help me set my webhook url, i've been reading
https://ocrmypdf.readthedocs.io/en/v9.7.1/api.html
without success..

Working configuration over portainer on a machine with physical resources.
Thank you in advance!

Docs: Document recommendations for optional `/archive` and `/ocrtemp` volumes

Removed from unRAID templates since they're not required but unRAID generates invalid docker commandlines without a specified mount for every volume in a template.

Should add documentation to the unRAID Integration setting to explain how to add these back in. Also consider adding support within the container to detect mounted vs unmounted /archive share, so that users don't inadvertently archive into the container...

Install on Raspberry Pi

Sorry -- noob question ahead.

First of all thank you for the fantastic project. Unfortunately, I was not able to install. I got to the part where I run the docker create command. However, it returns:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm/v7) and no specific platform was requested

Can anyone give me a primer on how to install ocrmypdf-auto on Raspbian? Any help would be much appreciated.

Not starting processing of files

I have the problem that I can't get any processing to happen. I set up the Container like in the Quick & Easy example. Everything seems to be fine and I get the Watching Folder notification. But if I copy a PDF in the input folder nothing happens. Any clue on why?

I tried different PDFs and checked if the volumes are mounted correctly. I did not use any custom config file.

Update to the latest version of `ocrmypdf`

Current image pins ocrmypdf at 6.2.0. Need to write more end-to-end regression tests and then upgrade to the latest (8.1.0 as of this issue).

no file in output

i installed under omv 4 docker.
when i look into the log file i can see, that the test.pdf was ocr(ed) and moved to the output directory. but when i look into my output directory there is no file.

Fails when using OCRmyPDF options

The ocr.config file indicates that common OCRmyPDF options are allowed. No matter what options i add in or take out any time i use the "--redo-ocr" option it just hangs and does nothing. There is no CPU usage from the docker container and nothing happens. It just stays in the "input" folder.

Any advice would be apreciated. Thanks!

Feature Request: Option to move non PDF to Output Folder

Hello,
I use your container for my paperless office, and it works great. The scanner stores the PDF in a folder (Input) on my server which the users don't have access to. Then ocrmypdf-auto does its work and saves the finished PDF on a shared drive. From time to time images (JPG) are also scanned. These scans are not processed by ocrmypdf-auto because the file extension is not *.pdf, this is correct. Would it be possible to include an option to simply move files that do not end with *.pdf to the Output folder?

Not processing files (Status 3)

Hello, I've set up my container with the following parameters

    docker create \
  --name=ocrmypdf-auto \
  -v /srv/faa137b2-b75f-42c4-a777-497ac74524ef/altered/ocr/input:/input \
  -v /srv/faa137b2-b75f-42c4-a777-497ac74524ef/altered/ocr/output:/output \
  -v /srv/dev-disk-by-label-fsb/altered/appdata/ocrmypdf_auto:/config \
  -v /srv/faa137b2-b75f-42c4-a777-497ac74524ef/altered/ocr/temp:/ocrtemp \
  -v /srv/faa137b2-b75f-42c4-a777-497ac74524ef/altered/ocr/archive:/archive \
  -e OCR_LANGUAGES="nor chi-sim dan eng swe" \
  -e OCR_OUTPUT_MODE=MIRROR_TREE \
  -e OCR_PROCESS_EXISTING_ON_START=1 \
  -e OCR_ACTION_ON_SUCCESS=DELETE_INPUT_FILES \
  -e PUID=1000 \
  -e PGID=100 \
  -e UMASK_SET=000 \
  quay.io/cmccambridge/ocrmypdf-auto

In the portainer logs I get the following (error?)

2020-08-09 20:31:52 - Watching /input,
2020-08-09 20:31:52 - Processing: /input/Scan-037.pdf -> /output/Scan-037.pdf,
2020-08-09 20:31:52 - Processing: /input/._Scan-037.pdf -> /output/._Scan-037.pdf,
2020-08-09 20:31:55 - Processing complete in 3.350000 seconds with status 3: /input/._Scan-037.pdf,
TESTOCR_PROCESS_RESULT/input/._Scan-037.pdf/output/._Scan-037.pdf33.350000,
2020-08-09 20:31:55 - Processing complete in 3.350000 seconds with status 3: /input/Scan-037.pdf,
TESTOCR_PROCESS_RESULT/input/Scan-037.pdf/output/Scan-037.pdf33.350000

The input files stays in the input folder and are not deleted as I set them up to be. Output folder is empty.

I've tried resetting folder permissions to no avail.

Feature request: Processing of non-PDF sources

I realize this may be thoroughly outside the intended scope of this project, but it would be wonderful if it would process not just PDF files, but a variety of image files (tiff and jpg come to mind). Perhaps passing them directly to to tesseract-ocr and outputting the results as text files?

Thanks for the fantastic Unraid docker container, and for your consideration!

Please replace self.input_path.move(self.archive_path) by copying and deleting

I tried putting the archives into another location outside the container to get them into an archive but this is resulting in

2022-12-18 22:06:53 [ThreadPoolExecutor-0_2] - Error in OcrTask.process: Traceback (most recent call last):
  File "/usr/lib/python3.8/shutil.py", line 788, in move
     os.rename(src, real_dst)
     OSError: [Errno 18] Invalid cross-device link: '/input/standard/MFC-L8650cdw_003873.pdf' -> '/archive/standard/MFC-L8650cdw_003873.pdf'

Could this be turned into copy+unlink?

move() in shutil.py should detect the cross-fs move (see

    If the destination is on our current filesystem, then rename() is used.
    Otherwise, src is copied to the destination and then removed. Symlinks are
    recreated under the new name if os.rename() fails because of cross
    filesystem renames.

but something seems to have gone wrong although mount shows

/dev/nvme0n1p4 on /archive type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)
/dev/nvme0n1p4 on /config type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)
/dev/nvme0n1p4 on /input type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)
/dev/nvme0n1p4 on /output type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)
/dev/nvme0n1p4 on /ocrtemp type btrfs (rw,noatime,ssd,space_cache,subvolid=262,subvol=/opt)

"ocrtemp" folder not used

I have a separate drive for input, output, ocrtemp etc, however having issued with OS drive filling up during conversions

Appears that "/var/lib/docker/overlay2/container ID/diff/tmp" is actually being used, folder looks like:

Docker command below:

docker run -d \
--name=ocrmypdf \
-v /media/data/ocrmypdf/files/input:/input \
-v /media/data/ocrmypdf/files/output:/output \
-v /media/data/ocrmypdf/files/archive:/archive \
-v /media/data/ocrmypdf/ocrtemp:/ocrtemp \
-v /media/data/ocrmypdf/config:/config \
-e OCR_PROCESS_EXISTING_ON_START=1 \
-e OCR_ACTION_ON_SUCCESS=DELETE_INPUT_FILES \
-e OCR_USE_POLLING_SCHEDULER=1 \
-e USERMAP_UID=1001 \
-e USERMAP_GID=1001 \
--restart unless-stopped \
cmccambridge/ocrmypdf-auto

Evaluate Alpine Linux base image

Convert if possible, for size savings.

cmccambridge / ocrmypdf-auto Goto Github PK

ocrmypdf-auto's People

Contributors

Stargazers

Watchers

Forkers

ocrmypdf-auto's Issues

docker-compose

Log after lat startup

Recommend Projects

Recommend Topics

Recommend Org