Integration of OCR-D and Kitodo
Docker integration of Kitodo.Production and OCR-D
View full documentation here.
Maintainers
If you have any questions or encounter any problems, please do not hesitate to contact us.
Docker integration of Kitodo.Production and OCR-D
License: MIT License
Docker integration of Kitodo.Production and OCR-D
View full documentation here.
If you have any questions or encounter any problems, please do not hesitate to contact us.
The kitodo-app container cannot be started by Docker, because it lacks the permission to the startup.sh script.
After chmod +x
everything works fine.
Due to the dynamic configuration possibilities of Kitodo.Production the image and ocr directory is not fixed.
Kitodo.Production
When i run make stack (prepare, build and start) with root user, I run into following problem when i execute the script_ocr.sh
over action:runscript in Kitodo.Production
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: ocr_init initialize variables and directory structure
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: running with 3 26 /data/3 deu Fraktur true /data/3/ocr-workflow.sh CONTROLLER=ocrd-controller:22
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: using workflow '/data/3/ocr-workflow.sh':
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: "tesserocr-recognize -P segmentation_level region -P model frak2021 -I OCR-D-IMG -O OCR-D-OCR" "fileformat-transform -P from-to \"page alto\" -P script-args \"--no-check-border --dummy-word\" -I OCR-D-OCR -O FULLTEXT"
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: ocr_exit in async mode - immediate termination of the script
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images' -> 'ocr-d//data/3'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0010_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0010_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0011_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0011_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0012_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0012_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0013_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0013_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0014_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0014_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: Permission denied, please try again.#015
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: Permission denied, please try again.#015
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: ocrd@ocrd-controller: Permission denied (publickey,password).#015
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.2]
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: terminating with error $?=255 from rsync -av -e "ssh -p $CONTROLLERPORT -l ocrd" "$WORKDIR/" $CONTROLLERHOST:/data/$REMOTEDIR
When running make prepare
followed by make build
the build command fails because build-resources/kitodo.war
is missing.
I had to download the Kitodo release files manually to that folder in the Kitodo submodule in order to launch all the services.
This behavior is not documented in the Readme and there should probably just be a build rule to grab the necessary files from the Kitodo release pages if they aren't already there.
Instead of testing existence of alto file, check active mq if step close message was sent.
We still need some way to simulate
We should add a make test
target which delegates to the Manager's eponymic rule (but exposes the correct DATA
and NETWORK
variables from compose).
Currently, OCR workflows must be installed into Production in advance by placing the ocrd process
script files into ./kitodo/data/ocr_workflows
with a .sh
suffix, and then configuring them in the projects settings (thereby tying them to new processes).
But what if a user wants to use a different OCR workflow for some processes in a project, or change the workflow for existing processes (because they did not work / run through or the results do not look good)?
For now, one would need to edit the file ocr-workflow.sh
in the process directory, and re-trigger the OCR script. (OCR Processing itself is already incremental, so the workflow will then continue to build what ever is still necessary or out-of-date.) But that is tedious and requires access to the file system (Manager share).
The user experience could be much better if we made workflows configurable on the web pages of the Monitor. Crucially, we should allow editing and re-running OCR workflows:
kitodo/data/ocr_workflows
to be shared by Production, Manager and MonitorSo if a task cannot be finished, because the OCR workflow failed (which in the future could also mean that it did not meet the configured quality threshold), then one will manually trigger said re-processing.
We could even provide a null workflow that will always fail and therefore force you to choose your custom workflow dynamically (per-process).
Saved workflows could also be version-controlled. The workflows should have a free-form description, but their file name should be a hash of their (non-comment, non-whitespace) content.
Also, the Manager should collect statistics about all workflows (which ones ran how often and with what success or quality level), so the Monitor can show them.
When launching the meta-repository with make start
the controller does not expose the port set in CONTROLLER_PORT_SSH
to the host, making it unusable over the network.
There is a docker-compose.override.yml
in _modules/ocrd_controller
that exposes this port, but it is not being loaded, since the ocrd-controller
service in our base docker-compose.yml
explicitly extends _modules/ocrd_controller/docker-compose.yml
.
It seems there is still an SSH key issue with ocrd-manager
and ocrd-controller
.
When executing the script step from Kitodo, the ocrd-manager
logs show the following:
May 9 06:38:06 ocrd-manager for_production.sh: running with 3 26 /data/3 deu Fraktur ocr.sh CONTROLLER=ocrd-controller:22
May 9 06:38:06 ocrd-manager for_production.sh: '/data/3/images' -> 'ocr-d//data/3'
May 9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0010_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0010_ORIGINAL.jpg'
May 9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0011_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0011_ORIGINAL.jpg'
May 9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0012_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0012_ORIGINAL.jpg'
May 9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0013_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0013_ORIGINAL.jpg'
May 9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0014_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0014_ORIGINAL.jpg'
May 9 06:38:06 ocrd-manager for_production.sh: async mode - exit and signal end of processing using active mq client
May 9 06:38:06 ocrd-manager for_production.sh: Warning: Permanently added the ECDSA host key for IP address '172.19.0.5' to the list of known hosts.#015
May 9 06:38:06 ocrd-manager for_production.sh: Permission denied, please try again.#015
May 9 06:38:06 ocrd-manager for_production.sh: Permission denied, please try again.#015
May 9 06:38:06 ocrd-manager for_production.sh: ocrd@ocrd-controller: Permission denied (publickey,password).#015
Atm after unzip script permissions are 444 but should be 544.
disown -a
?)docker-compose -f docker-compose.yml -f docker-compose-kitodo.yml -f ../ocrd_controller/docker-compose-controller.yml
, or native controller docker-compose -f docker-compose.yml -f docker-compose-kitodo.yml -f docker-compose.override.yml
– configuration mechanisms...ocrd-import
/ workspace init
as part of for_production.sh, not of the workflow itselfocrd process
syntax instead of bare shell script (entails previous)ocrd validate tasks
before running--return --cleanup --transfer
etcocrd
user)ocrd workflow server
+ ocrd workflow client process
instead of ocrd process
), at least for preconfigured workflowsocrd-PROCESSOR --server HOST PORT WORKERS
+ ocrd workflow client process
...)I am starting a project from scratch (make clean, make prepare, make build, make start) with normal user inside docker group.
Following error occurs in the OCR-D Manager when running the ocr script in Kitodo.Production.
today at 12:48:25Sep 1 10:48:24 ocrd-manager for_production.sh: insufficient permissions on /data volume
The data permissions and user group assignments are following in "kitodo" directory. For the OCR-D Manager only the "metadata" directory is mounted:
dr-xrwxrwx 2 root root 4096 Sep 1 10:29 config
drwxrwxrwx 2 root root 4096 Sep 1 10:29 debug
drwxrwxrwx 2 root root 4096 Sep 1 10:29 diagrams
dr-xrwxrwx 2 root root 4096 Sep 1 10:29 import
drwxrwxrwx 2 root root 4096 Sep 1 12:36 logs
dr-xrwxrwx 2 root root 4096 Sep 1 10:29 messages
drwxr-xr-x 3 weigelt domänen-benutzer 4096 Apr 5 10:14 metadata
drwxrwxrwx 2 root root 4096 Sep 1 10:29 modules
drwxr-xr-x 2 weigelt domänen-benutzer 4096 Jun 30 14:01 ocr_workflows
dr-xrwxrwx 7 root root 4096 Sep 1 10:29 plugins
drwxr-xr-x 2 weigelt domänen-benutzer 4096 Jun 17 12:16 rulesets
drwxr-xr-x 2 weigelt domänen-benutzer 4096 Jun 15 11:22 scripts
drwxrwxrwx 2 root root 4096 Sep 1 10:29 swap
drwxrwxrwx 2 root root 4096 Sep 1 10:29 temp
drwxrwxrwx 2 root root 4096 Sep 1 10:29 users
drwxr-xr-x 2 weigelt domänen-benutzer 4096 Jul 1 15:37 xslt
I think unzipping the resource data.zip explains the rights of files and directories that are not root. So i can change them to root but i wonder why the files and directories run on the root user. Cause with our UID and GID implementations the files and directories should have the current user and group as owne.
In my case current user id is "946828167" and group id "946800513" and for root user id is "0" group id "0".
When i go into the OCR-D Manager container UID and GID have the default value "1001" of .env
file. So it seems env variables are not set correctly and probably root is then the fallback.
When i print the environment variables of Makefile following output is generated:
CONTROLLER_ENV_UID is 946828167
CONTROLLER_ENV_GID is 946800513
MANAGER_ENV_UID is 946828167
MANAGER_ENV_GID is 946800513
MODE is managed
COMPOSE_FILE is docker-compose.yml:docker-compose.kitodo-app.yml:docker-compose.managed.
docker inspect kitodo_production_ocrd-ocrd-manager-1
shows that env variables are not replaced with environment variables defined in Makefile.
"Config": {
"Hostname": "ocrd-manager",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": true,
"AttachStderr": true,
"ExposedPorts": {
"22": {},
"22/tcp": {}
},
"Tty": true,
"OpenStdin": false,
"StdinOnce": false,
"Env": [
"GID=1001",
"UMASK=0002",
"CONTROLLER=ocrd-controller:22",
"ACTIVEMQ=kitodo-mq:61616",
"UID=1001",
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"DEBIAN_FRONTEND=noninteractive",
"PYTHONIOENCODING=utf8",
"LC_ALL=C.UTF-8",
"LANG=C.UTF-8",
"HOME=/",
"ACTIVEMQ_CLIENT_LOG4J2=/opt/kitodo-activemq-client/log4j2.properties",
"ACTIVEMQ_CLIENT=/opt/kitodo-activemq-client/kitodo-activemq-client-0.2.jar",
"PREFIX=/usr",
"VIRTUAL_ENV=/usr"
],
In Kitodo.Production repository a detailed ticket for integrating the automatic structuring was created.
I regularly get confused by the naming of the environment variables for SSH keys. For example, when I read MANAGER_KEY
, I immediately assume that it's the key to access the manager when it's actually the manager's key to access the controller.
Similarly, there are MANAGER_KEYS
and CONTROLLER_KEYS
which I wouldn't know how to use either if it weren't for the explanatory comments next to them in the example .env
file.
Therefore, I propose the following new names:
MANAGER_KEY
-> MANAGER__CONTROLLER_ACCESS_KEY
MANAGER_KEYS
-> MANAGER__AUTHORIZED_KEYS
CONTROLLER_KEYS
-> CONTROLLER__AUTHORIZED_KEYS
Note that I included double underscores after the name of the service that the respective variables belong to. I find that it especially communicates the purpose of MANAGER__CONTROLLER_ACCESS_KEY
more clearly that way.
Extend the make prepare
target to support standalone of OCR-D Controller and Kitodo.Production.
After cloning and running make build
I tried to run make start
, but got the following error
ERROR: build path /root/kitodo_production_ocrd/_modules/kitodo-production-docker/docker-image either does not exist, is not accessible, or is not a valid URL.
make: *** [Makefile:38: start] Error 1
I believe this happens, because the submodules are not getting pulled during the build step.
Sometimes, processing might fail due to temporary downtimes. Or bugs in tools which get fixed subsequently.
Regardless, it should be easy to re-run the same workflow on a workspace again. To that end, ocrd process
offers --overwrite
(as does ocrd workflow client process
, and ocrd-make
always uses it).
But what about badly written workflows or data imported from presentation – is overwrite
always the right thing to do?
What does tiffwriter.conf? Why is this file generated?
Using: docker-compose -f docker-compose.yml -f ./_modules/ocrd_controller/docker-compose.yml up --build -d
Results in the following error:
Step 18/29 : COPY ${BUILD_RESOURCES}/kitodo.war /tmp/kitodo/kitodo.war
COPY failed: file not found in build context or excluded by .dockerignore: stat build-resources/kitodo.war: file does not exist
ERROR: Service 'kitodo-app' failed to build : Build failed
The Makefile
has also not been updated to the new project structure. It still refers to the _tmp
folder as well as the no longer existing docker-compose-controller.yml
In .gitmodules
_modules/ocrd_manager
and _modules/ocrd_controller
have SSH URLs. This causes permission denied errors when trying to update them and no SSH key is registered for GitHub
(replacing MODE
with COMPOSE_PROFILES
etc.)
make build
failed due to a missing slash at the end of ./ocrd/manager/.ssh
on line 28 of Makefile-i ~/.ssh/id_rsa
option for the ssh command in for_production.sh
Directory '' already is a workspace
(probably intended behaviour).The path to Kitodo Production is currently hardcoded to C:/Users/weigelt/Work/kitodo/kitodo-production/Kitodo/target/kitodo-3.4.1-SNAPSHOT.war
The ocrd-manager
and ocrd-monitor
services depend on the ocrd-controller
service.
Therefore, I cannot launch the docker compose configuration without activating the controller.
We should facilitate our documentation (ATM: Readme) and configuration (i.e.: Makefile) for the common use-case of running a native (non-Dockerized) Kitodo.Production, or combining a Docker instance with an existing database (of settings, users, projects, processes, index).
Separations of concerns is currently a bit messy between the submodules and the main repository. The submodules each provide a single service, but at the same time contain configuration and scripts to make them work together. E.g. kitodo-production-docker
contains ssh configuration in order to work with ocrd manager in startup.sh
, for_production.sh
in ocrd_manager
contains the commands to trigger ocrd
in ocrd_controller
. And all of the above only works if one generates the SSH keys and configuration in the main repository.
This leads to a situation where one sometimes has to either change the main repository or a submodule in order to adjust the configuration on how the services work together.
Therefore I suggest the following approach:
The submodules kitodo-production-docker
, ocrd_manager
and ocrd_controller
should only provide docker-compose
files that can set up the services.
The main repository should contain all the details on how the services work together and provide docker-compose
files that selectively override settings from the submodules as needed.
I've already started working on docker-compose
files that override some settings for the communication, mainly to remove duplication. If we decide to go along with this I can make a couple of PRs to main repository and the submodules in the next couple of days.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.