Coder Social home page Coder Social logo

jack12816 / alf-tengine-ocr Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aborroy/alf-tengine-ocr

0.0 0.0 0.0 117 KB

Alfresco Transformer For ACS 70+ from PDF to OCRd PDF

License: GNU General Public License v2.0

Shell 7.94% Java 40.72% HTML 4.85% Batchfile 7.25% Dockerfile 39.24%

alf-tengine-ocr's Introduction

Alfresco Transformer from PDF to OCRd PDF

This project includes a simple Transformer for Alfresco from PDF to OCRd PDF to be used with ACS Community 7.0+

OCR Transformation is performed by ocrmypdf, a wrapper of Tesseract that includes additional features in order to improve the accuracy of the process.

The Transformer ats-transformer-ocr uses the new Alfresco Local Transform API, that allows to register a Spring Boot Application as a local transformation service.

The folder embed-metadata-action includes an Alfresco Repository Addon that enables the action embed-metadata in Folder Rule feature.

Local testing

Build Docker Image for Alfresco OCR Transformer

Building the Alfresco OCR Transformer Docker Image is required before running the Docker Compose template provided.

$ cd ats-transformer-ocr

$ mvn clean package

Maven will create a Docker Image named alfresco/tengine-ocr:latest

Starting

$ docker run -p 8090:8090 alfresco/tengine-ocr:latest

Testing

A sample web page has been created in order to test the transformer is working:

http://localhost:8090

Deployment with ACS Stack

Obtaining Repository Addon to enable Embed Metadata Action

Before deploying Alfresco OCR Transformer, embed-metadata-action Repository Addon should be built.

$ cd embed-metadata-action

$ mvn clean package

$ ls target/embed-metadata-action-1.0.0.jar
target/embed-metadata-action-1.0.0.jar

Alternatively embed-metadata-action-1.0.0.jar can be download from Releases

Deploying Repository Addon to enable Embed Metadata Action

Use some of the available alternatives to deploy embed-metadata-action-1.0.0.jar in alfresco service, like adding the JAR to alfresco/modules/jar folder when using Alfresco Docker Installer tool.

Adding Alfresco OCR Transformer to Docker Compose (Local Transformer - HTTP) - Community Edition

Review that the following configuration is applied to docker-compose.yml file.

services:
    alfresco:
        environment:
            JAVA_OPTS : "
                -DlocalTransform.core-aio.url=http://transform-core-aio:8090/
                -DlocalTransform.ocr.url=http://transform-ocr:8090/
            "

    transform-core-aio:
        image: alfresco/alfresco-transform-core-aio:2.3.10
        mem_limit: 1536m
        environment:
            JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80"

    transform-ocr:
        image: alfresco/tengine-ocr:latest
        mem_limit: 1536m
        environment:
            JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80"
  • Include the localTransform URL for OCR Transformer in alfresco Docker Container, http://transform-ocr:8090/ by default
  • Declare the new transform-ocr Docker Container

Remember that you need to build Docker Image for alfresco/tengine-ocr before running this composition

Start ACS Stack from folder containing docker-compose.yml file.

$ docker-compose up --build --force-recreate

Sample deployment is available in docker folder.

Adding Alfresco OCR Transformer to Docker Compose (Async Transformer - ActiveMQ) - Enterprise Edition

Review that the following configuration is applied to docker-compose.yml file.

services:
    alfresco:
        environment:
            JAVA_OPTS : "
              -Dlocal.transform.service.enabled=false
              -Dtransform.service.enabled=true
              -Dtransform.service.url=http://transform-router:8095
              -Dsfs.url=http://shared-file-store:8099/
            "

    transform-router:
      image: quay.io/alfresco/alfresco-transform-router:${TRANSFORM_ROUTER_TAG}
      environment:
        JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80"
        ACTIVEMQ_URL: "nio://activemq:61616"
        CORE_AIO_URL: "http://transform-core-aio:8090"
        TRANSFORMER_URL_OCR: "http://transform-ocr:8090"
        TRANSFORMER_QUEUE_OCR: "ocr-engine-queue"
        FILE_STORE_URL: "http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file"

    transform-ocr:
      image: alfresco/tengine-ocr:latest
      mem_limit: 1536m
      environment:
        JAVA_OPTS: " -XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80"
        ACTIVEMQ_URL: "nio://activemq:61616"
        FILE_STORE_URL: "http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file"

Remember that you need to build Docker Image for alfresco/tengine-ocr before running this composition

Start ACS Stack from folder containing docker-compose.yml file.

$ docker-compose up --build --force-recreate

Sample deployment is available in docker-enterprise folder.

Defining the OCR Rule in Alfresco Share

Use your browser to access to Alfresco Share App (by default available in http://localhost:8080/share/)

Create a folder and add following rule (Manage Rules folder option):

  • When: Items are created or enter this folder
  • If all criteria are met: Mimetype is 'Adobe PDF Document'
  • Perform Action: Embed properties as metadata in content

To limit the amount of parallel OCR processing threads, use the Run rule in background checkbox.

From that point, every PDF File uploaded to the folder will be OCRd. Original version for the PDF file will remain as 1.0 version, while the one with text layer on it will be labeled as 1.1 version.

Customizing ocrmypdf arguments

By default, Alfresco OCR Transformer is providing following ocrmypdf configuration.

# Executable command for ocrmypdf program
ocrmypdf.path=ocrmypdf

# Arguments for ocrmypdf invocation
ocrmypdf.arguments=--skip-text

Configuration can be changed by using Docker environment variables from command line.

$ docker run -p 8090:8090 -e OCRMYPDF_ARGUMENTS='--skip-text -l eng' alfresco/tengine-ocr:latest

Or with the equivalent notation in docker-compose.yml

transform-ocr:
    image: alfresco/tengine-ocr:latest
    mem_limit: 1536m
    environment:
      JAVA_OPTS: "-XX:MinRAMPercentage=50 -XX:MaxRAMPercentage=80"
      OCRMYPDF_ARGUMENTS: "--skip-text -l eng"

Additional contributors

alf-tengine-ocr's People

Contributors

aborroy avatar dgradecak avatar tpage-alfresco avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.