Coder Social home page Coder Social logo

ocrd_kitodo's Introduction

ocrd_kitodo's People

Contributors

bertsky avatar markusweigelt avatar stweil avatar svenmarcus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ocrd_kitodo's Issues

Add parameter -img-subdir and --ocr-subdir in action script when calling process_images.sh

Due to the dynamic configuration possibilities of Kitodo.Production the image and ocr directory is not fixed.

Kitodo.Production

cannot login to Controller when running as host's root

When i run make stack (prepare, build and start) with root user, I run into following problem when i execute the script_ocr.sh over action:runscript in Kitodo.Production

today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: ocr_init initialize variables and directory structure
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: running with 3 26 /data/3 deu Fraktur true /data/3/ocr-workflow.sh CONTROLLER=ocrd-controller:22
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: using workflow '/data/3/ocr-workflow.sh':
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: "tesserocr-recognize -P segmentation_level region -P model frak2021 -I OCR-D-IMG -O OCR-D-OCR" "fileformat-transform -P from-to \"page alto\" -P script-args \"--no-check-border --dummy-word\" -I OCR-D-OCR -O FULLTEXT" 
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: ocr_exit in async mode - immediate termination of the script
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images' -> 'ocr-d//data/3'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0010_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0010_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0011_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0011_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0012_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0012_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0013_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0013_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: '/data/3/images/FILE_0014_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0014_ORIGINAL.jpg'
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: Permission denied, please try again.#015
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: Permission denied, please try again.#015
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: ocrd@ocrd-controller: Permission denied (publickey,password).#015
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: rsync: connection unexpectedly closed (0 bytes received so far) [sender]
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: rsync error: unexplained error (code 255) at io.c(235) [sender=3.1.2]
today at 17:07:28Sep 22 15:07:28 ocrd-manager for_production.sh: terminating with error $?=255 from rsync -av -e "ssh -p $CONTROLLERPORT -l ocrd" "$WORKDIR/" $CONTROLLERHOST:/data/$REMOTEDIR

make prepare + make build does not build Kitodo

When running make prepare followed by make build the build command fails because build-resources/kitodo.war is missing.
I had to download the Kitodo release files manually to that folder in the Kitodo submodule in order to launch all the services.
This behavior is not documented in the Readme and there should probably just be a build rule to grab the necessary files from the Kitodo release pages if they aren't already there.

create script/makefile to init and test everything

We still need some way to simulate

  • creating the auth credentials for the SSH
  • configuring (customizing) the names/addresses of the server
  • starting up the services
  • downloading some test data and presenting them like Production would
  • checking/plausibilizing the result

add regression test

We should add a make test target which delegates to the Manager's eponymic rule (but exposes the correct DATA and NETWORK variables from compose).

configure and edit workflows flexibly via Monitor

Currently, OCR workflows must be installed into Production in advance by placing the ocrd process script files into ./kitodo/data/ocr_workflows with a .sh suffix, and then configuring them in the projects settings (thereby tying them to new processes).

But what if a user wants to use a different OCR workflow for some processes in a project, or change the workflow for existing processes (because they did not work / run through or the results do not look good)?

For now, one would need to edit the file ocr-workflow.sh in the process directory, and re-trigger the OCR script. (OCR Processing itself is already incremental, so the workflow will then continue to build what ever is still necessary or out-of-date.) But that is tedious and requires access to the file system (Manager share).

The user experience could be much better if we made workflows configurable on the web pages of the Monitor. Crucially, we should allow editing and re-running OCR workflows:

  1. create a volume for kitodo/data/ocr_workflows to be shared by Production, Manager and Monitor
  2. add an endpoint (and reference it on th index page) for listing existing workflows
  3. make workflows editable (in a simple text form field, perhaps with syntax highlighting), create a new version when saving
  4. in the workspace view, make workspaces multi-selectable and add an action button for (re-)processing with a selectable workflow
  5. in the job view, add an action button for re-processing with a selectable workflow

So if a task cannot be finished, because the OCR workflow failed (which in the future could also mean that it did not meet the configured quality threshold), then one will manually trigger said re-processing.

We could even provide a null workflow that will always fail and therefore force you to choose your custom workflow dynamically (per-process).

Saved workflows could also be version-controlled. The workflows should have a free-form description, but their file name should be a hash of their (non-comment, non-whitespace) content.

Also, the Manager should collect statistics about all workflows (which ones ran how often and with what success or quality level), so the Monitor can show them.

Controller does not expose SSH Port

When launching the meta-repository with make start the controller does not expose the port set in CONTROLLER_PORT_SSH to the host, making it unusable over the network.
There is a docker-compose.override.yml in _modules/ocrd_controller that exposes this port, but it is not being loaded, since the ocrd-controller service in our base docker-compose.yml explicitly extends _modules/ocrd_controller/docker-compose.yml.

SSH communication between ocrd-manager and ocrd-controller fails

It seems there is still an SSH key issue with ocrd-manager and ocrd-controller.
When executing the script step from Kitodo, the ocrd-manager logs show the following:

May  9 06:38:06 ocrd-manager for_production.sh: running with 3 26 /data/3 deu Fraktur ocr.sh CONTROLLER=ocrd-controller:22
May  9 06:38:06 ocrd-manager for_production.sh: '/data/3/images' -> 'ocr-d//data/3'
May  9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0010_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0010_ORIGINAL.jpg'
May  9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0011_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0011_ORIGINAL.jpg'
May  9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0012_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0012_ORIGINAL.jpg'
May  9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0013_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0013_ORIGINAL.jpg'
May  9 06:38:06 ocrd-manager for_production.sh: '/data/3/images/FILE_0014_ORIGINAL.jpg' -> 'ocr-d//data/3/FILE_0014_ORIGINAL.jpg'
May  9 06:38:06 ocrd-manager for_production.sh: async mode - exit and signal end of processing using active mq client
May  9 06:38:06 ocrd-manager for_production.sh: Warning: Permanently added the ECDSA host key for IP address '172.19.0.5' to the list of known hosts.#015
May  9 06:38:06 ocrd-manager for_production.sh: Permission denied, please try again.#015
May  9 06:38:06 ocrd-manager for_production.sh: Permission denied, please try again.#015
May  9 06:38:06 ocrd-manager for_production.sh: ocrd@ocrd-controller: Permission denied (publickey,password).#015

Provide new release

  • adapt Readme
  • reset Docker image tags to latest in .env
  • test again thoroughly (in all host scenarios)
    • MANAGED
    • STANDALONE
  • git-tag and Docker-tag+push the module images to Dockerhub as stable
  • set Docker image tags to above stable versions in .env, git-commit to (new) stable branch, git-tag and make GH release
  • continue working on main with latest in .env (esp. by explaining how to use the stable releases)

Next steps

  • make asynchronous option work (disown -a?)
  • separate docker-compose files for Kitodo/Manager vs. Controller, e.g. docker-compose -f docker-compose.yml -f docker-compose-kitodo.yml -f ../ocrd_controller/docker-compose-controller.yml, or native controller docker-compose -f docker-compose.yml -f docker-compose-kitodo.yml -f docker-compose.override.yml – configuration mechanisms...
  • generalize error handling for script tasks in Production: differentiate between errors and asynchronous tasks (and thus allow actual error handling)
    or alternatively, keep errors out of Production (task stays open until successful), and offer GUI in Manager for history, processing status, error monitoring and workflow modification+restart)
  • refactor for_production.sh (functions, file includes) to make re-usable for different scenarios (for_presentation, ...)
  • Secure active mq call slub/ocrd_manager#14
  • ocrd-import / workspace init as part of for_production.sh, not of the workflow itself
  • workflow script: ocrd process syntax instead of bare shell script (entails previous)
  • workflow script: ocrd validate tasks before running
  • versioning and labelling of Docker images (manager → core and controller → ocrd_all)
  • decoupling Production process share and Manager data volume (explicit copy to and fro)
  • decoupling Manager volume and Controller volume – more options?
  • decoupling Kitodo+Manager services (Docker compose file) and Controller services (Docker compose file)
  • smarter and more efficient processing in Controller
    • to prevent oversubscription, use semaphores (via .profile for ocrd user)
    • or delegate to Nextflow as worfklow engine
    • or delegate to workflow server (i.e. ocrd workflow server + ocrd workflow client process instead of ocrd process), at least for preconfigured workflows
    • and/or delegate to processing server (i.e. ocrd-PROCESSOR --server HOST PORT WORKERS + ocrd workflow client process...)
  • more flexible workflows:
    • advertise existing (installed) workflows in Production ("drop-down list")
    • inherit workflows from preconfigured (installed or edited) templates, then copy to process directory
    • edit workflows in Production (simple text editor, perhaps syntax highlighting)
  • parallelization and job pipeline

Insufficient permissions on /data volume

I am starting a project from scratch (make clean, make prepare, make build, make start) with normal user inside docker group.

Following error occurs in the OCR-D Manager when running the ocr script in Kitodo.Production.

today at 12:48:25Sep 1 10:48:24 ocrd-manager for_production.sh: insufficient permissions on /data volume

The data permissions and user group assignments are following in "kitodo" directory. For the OCR-D Manager only the "metadata" directory is mounted:

dr-xrwxrwx  2 root    root             4096 Sep  1 10:29 config
drwxrwxrwx  2 root    root             4096 Sep  1 10:29 debug
drwxrwxrwx  2 root    root             4096 Sep  1 10:29 diagrams
dr-xrwxrwx  2 root    root             4096 Sep  1 10:29 import
drwxrwxrwx  2 root    root             4096 Sep  1 12:36 logs
dr-xrwxrwx  2 root    root             4096 Sep  1 10:29 messages
drwxr-xr-x  3 weigelt domänen-benutzer 4096 Apr  5 10:14 metadata
drwxrwxrwx  2 root    root             4096 Sep  1 10:29 modules
drwxr-xr-x  2 weigelt domänen-benutzer 4096 Jun 30 14:01 ocr_workflows
dr-xrwxrwx  7 root    root             4096 Sep  1 10:29 plugins
drwxr-xr-x  2 weigelt domänen-benutzer 4096 Jun 17 12:16 rulesets
drwxr-xr-x  2 weigelt domänen-benutzer 4096 Jun 15 11:22 scripts
drwxrwxrwx  2 root    root             4096 Sep  1 10:29 swap
drwxrwxrwx  2 root    root             4096 Sep  1 10:29 temp
drwxrwxrwx  2 root    root             4096 Sep  1 10:29 users
drwxr-xr-x  2 weigelt domänen-benutzer 4096 Jul  1 15:37 xslt

I think unzipping the resource data.zip explains the rights of files and directories that are not root. So i can change them to root but i wonder why the files and directories run on the root user. Cause with our UID and GID implementations the files and directories should have the current user and group as owne.

In my case current user id is "946828167" and group id "946800513" and for root user id is "0" group id "0".

When i go into the OCR-D Manager container UID and GID have the default value "1001" of .env file. So it seems env variables are not set correctly and probably root is then the fallback.

When i print the environment variables of Makefile following output is generated:

CONTROLLER_ENV_UID is 946828167
CONTROLLER_ENV_GID is 946800513
MANAGER_ENV_UID is 946828167
MANAGER_ENV_GID is 946800513
MODE is managed
COMPOSE_FILE is docker-compose.yml:docker-compose.kitodo-app.yml:docker-compose.managed.

docker inspect kitodo_production_ocrd-ocrd-manager-1 shows that env variables are not replaced with environment variables defined in Makefile.

        "Config": {
            "Hostname": "ocrd-manager",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": true,
            "AttachStderr": true,
            "ExposedPorts": {
                "22": {},
                "22/tcp": {}
            },
            "Tty": true,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "GID=1001",
                "UMASK=0002",
                "CONTROLLER=ocrd-controller:22",
                "ACTIVEMQ=kitodo-mq:61616",
                "UID=1001",
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "DEBIAN_FRONTEND=noninteractive",
                "PYTHONIOENCODING=utf8",
                "LC_ALL=C.UTF-8",
                "LANG=C.UTF-8",
                "HOME=/",
                "ACTIVEMQ_CLIENT_LOG4J2=/opt/kitodo-activemq-client/log4j2.properties",
                "ACTIVEMQ_CLIENT=/opt/kitodo-activemq-client/kitodo-activemq-client-0.2.jar",
                "PREFIX=/usr",
                "VIRTUAL_ENV=/usr"
            ],

Improve naming of SSH key variable names

I regularly get confused by the naming of the environment variables for SSH keys. For example, when I read MANAGER_KEY, I immediately assume that it's the key to access the manager when it's actually the manager's key to access the controller.
Similarly, there are MANAGER_KEYS and CONTROLLER_KEYS which I wouldn't know how to use either if it weren't for the explanatory comments next to them in the example .env file.

Therefore, I propose the following new names:
MANAGER_KEY -> MANAGER__CONTROLLER_ACCESS_KEY
MANAGER_KEYS -> MANAGER__AUTHORIZED_KEYS
CONTROLLER_KEYS -> CONTROLLER__AUTHORIZED_KEYS

Note that I included double underscores after the name of the service that the respective variables belong to. I find that it especially communicates the purpose of MANAGER__CONTROLLER_ACCESS_KEY more clearly that way.

Make start fails

After cloning and running make build I tried to run make start, but got the following error

ERROR: build path /root/kitodo_production_ocrd/_modules/kitodo-production-docker/docker-image either does not exist, is not accessible, or is not a valid URL.
make: *** [Makefile:38: start] Error 1

I believe this happens, because the submodules are not getting pulled during the build step.

allow re-runs

Sometimes, processing might fail due to temporary downtimes. Or bugs in tools which get fixed subsequently.

Regardless, it should be easy to re-run the same workflow on a workspace again. To that end, ocrd process offers --overwrite (as does ocrd workflow client process, and ocrd-make always uses it).

But what about badly written workflows or data imported from presentation – is overwrite always the right thing to do?

Build instructions no longer working

Using: docker-compose -f docker-compose.yml -f ./_modules/ocrd_controller/docker-compose.yml up --build -d
Results in the following error:

Step 18/29 : COPY ${BUILD_RESOURCES}/kitodo.war /tmp/kitodo/kitodo.war
COPY failed: file not found in build context or excluded by .dockerignore: stat build-resources/kitodo.war: file does not exist
ERROR: Service 'kitodo-app' failed to build : Build failed

The Makefile has also not been updated to the new project structure. It still refers to the _tmp folder as well as the no longer existing docker-compose-controller.yml

Summary of issues experienced when deploying to TU BS OCR-D Server

  • make build failed due to a missing slash at the end of ./ocrd/manager/.ssh on line 28 of Makefile
  • ocrd-manager couldn't connect to ocrd-controller due to missing host key of controller
  • after adding the host manually, ssh authentication failed from ocrd-manager to controller.
    • can be fixed by providing the -i ~/.ssh/id_rsa option for the ssh command in for_production.sh
  • after running the ocrd workflow once, it fails with Directory '' already is a workspace (probably intended behaviour).

Cannot launch without controller

The ocrd-manager and ocrd-monitor services depend on the ocrd-controller service.
Therefore, I cannot launch the docker compose configuration without activating the controller.

docs: how to run with existing Kitodo deployment

We should facilitate our documentation (ATM: Readme) and configuration (i.e.: Makefile) for the common use-case of running a native (non-Dockerized) Kitodo.Production, or combining a Docker instance with an existing database (of settings, users, projects, processes, index).

  • What steps are necessary to replace our example database with an existing one?
  • How to configure / use our docker-compose files to use an external / standalone Kitodo.Production?

Improving separation of concerns

Separations of concerns is currently a bit messy between the submodules and the main repository. The submodules each provide a single service, but at the same time contain configuration and scripts to make them work together. E.g. kitodo-production-docker contains ssh configuration in order to work with ocrd manager in startup.sh, for_production.sh in ocrd_manager contains the commands to trigger ocrd in ocrd_controller. And all of the above only works if one generates the SSH keys and configuration in the main repository.
This leads to a situation where one sometimes has to either change the main repository or a submodule in order to adjust the configuration on how the services work together.

Therefore I suggest the following approach:
The submodules kitodo-production-docker, ocrd_manager and ocrd_controller should only provide docker-compose files that can set up the services.
The main repository should contain all the details on how the services work together and provide docker-compose files that selectively override settings from the submodules as needed.

I've already started working on docker-compose files that override some settings for the communication, mainly to remove duplication. If we decide to go along with this I can make a couple of PRs to main repository and the submodules in the next couple of days.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.