Coder Social home page Coder Social logo

ocrd_pagetopdf's Introduction

ocrd-pagetopdf

OCR-D wrapper for prima-page-to-pdf

Transforms all PAGE-XML+IMG to PDF with text layer and (optionally) polygon outlines.

(Converts original images together with text and layout annotations of all pages in the PAGE input file group to PDF. The text is rendered as an overlay.)

Requirements

  • GNU make
  • Python 3 with pip and venv
  • OCR-D
  • Java runtime (OpenJDK 8 works for PageToPdf 1.1.2)

Installation

Once you have installed Java, make, Python, and set up your virtual environment, do:

make deps # or: pip install ocrd
make install # copies into PREFIX or VIRTUAL_ENV

Usage

The command-line interface conforms to OCR-D processor specifications.

Assuming you have an OCR-D workspace in your current working directory, simply do:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word"}'

This will run the script and create PDF files for each page with a text layer based on word-level annotations.

There is also an option to create an additional multipage file with name merged.pdf, which contain all single pages in correct order:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "multipage":"merged"}'

FAQ

  • Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 to method java.nio.DirectByteBuffer.cleaner() If that appears, try installing OpenJDK 8.

  • java.lang.NullPointerException If that appears, try (a little workaround) and set negative coordinates to zero:

    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "negative2zero": true}'
    
  • Some letters are illegible? Please note that the standard displayed font (AletheiaSans.ttf) does not support all Unicode glyphs. In case yours are missing, set a (monospace) Unicode font yourself:

    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "font": "/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"}'
    
  • The multipage file pagelabelnames can be changed, e.g. consecutively pagenumber.

    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "multipage":"merged", "pagelabelname":"pagenumber"}'
    

ocrd_pagetopdf's People

Contributors

bertsky avatar jkamlah avatar kba avatar konstantinschulz avatar stweil avatar sulzbals avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ocrd_pagetopdf's Issues

image input file group requirement

Thanks @JKamlah for making this great tool!

Would it be much effort to remove the requirement to have an explicit second input file group for the image? This should be just dereferenced from the /Page/@imageFilename in the PAGE file (relative to METS file path).

Also, line 35: in_grps[1]: unbound variable is not a good error message IMO.

Installation fails on Debian 10

Make target deps-ubuntu fails on Debian 10 (buster), due to dependency openjdk-8-jre-headless not being in the standard packages. Is ocrd_pagetopdf compatible with Java 11 and would it be possible to add to your Makefile a conditional along the lines of

ifeq ($(shell lsb_release -rs),10)
apt-get install -y openjdk-11-jre-headless
endif

...or something like this in order to have installation work on Debian 10?

allow creating multi-page PDFs

IIUC the package currently aims to create one PDF file per page in the output file group.

It would be a thrill to have an option to concatenate all pages into one multi-page PDF file (with proper PDF meta-data about the physical page numbers).

itextpdf installation does not work

After doing make install on a Ubuntu 18.04 with OpenJDK 11.0.5 and running on an example workflow, I get:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 (file:venv/share/ocrd_pagetopdf/ptp/lib/itextpdf-5.5.2.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of com.itextpdf.text.io.ByteBufferRandomAccessSource$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

And the PDFs generated have no text layer or visual annotation.

Is there a specific Java runtime required?

does not work on two input fileGrps anymore

ocrd-pagetopdf -I OCR-D-OCR,OCR-D-BIN-WAN -O OCR-D-OUT -P textequiv_level word -P multipage merged --overwrite
Traceback (most recent call last): File "/data/ocr-d/ocrd_all/venv/bin/ocrd", line 8, in <module> sys.exit(cli())
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx)
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) 
File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx))
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params)
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs)
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/ocrd/cli/bashlib.py", line 115, in bashlib_input_files for input_file in processor.input_files:
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/ocrd/processor/base.py", line 291, in input_files assert len(ret[0]) == 1, 'Use zip_input_files() instead of input_files when processing multiple input fileGrps'
 AssertionError: Use zip_input_files() instead of input_files when processing multiple input fileGrps 13:54:28.798

Usage example for converting page xml to searchable pdf?

Hi,

Is it possible to convert page xml into searchable pdf? Although I got a pdf after using ocrd-pagetopdf -I OCR-D-OCR -O OCR-D-PDF -P textequiv_level word, it is not searchable. The workflow I used to produce page xml:

ocrd process \ "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \ "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \ "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \ "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \ "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \ "tesserocr-recognize -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P textequiv_level word -P segmentation_level word -P overwrite_segments true" \ "cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP" \ "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"

I don't see the bug there? Any idea, where is the change or addition in the workflow required? Maybe it hat to do with region recognizer?

Best
Aysoltan

workaround for pagetopdf.jar exceptions

Currently, the PageToPdf.jar backend from PRImA does not signal exceptions with a non-zero exit status, cf. PRImA-Research-Lab/prima-page-to-pdf#5.

Could we try to do some workaround this from here for the time being?

Options I can see:

  • parse stdout for exceptions
  • check the resulting PDF file to be non-zero in size (i.e. -s "$out_file" instead of -f "$out_file")

throw error if input-filegrp doesn't exist

If you use a non-existing input-filegrp, currently the only reaction is:

WARNING ocrd-pagetopdf - Without a second input file group for images, the original imageFilename will be used

There should rather be an error message stating that the specified input filegrp doesn't exist in this workspace

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.