ub-mannheim / ocrd_pagetopdf Goto Github PK

View Code? Open in Web Editor NEW

7.0 6.0 5.0 2.77 MB

OCR-D wrapper for prima-pagetopdf

License: Apache License 2.0

Dockerfile 4.51% Makefile 12.86% Shell 44.85% Python 37.78%

ocr-d prima-pagetopdf ocr

ocrd_pagetopdf's Introduction

ocrd-pagetopdf

OCR-D wrapper for prima-page-to-pdf

Transforms all PAGE-XML+IMG to PDF with text layer and (optionally) polygon outlines.

(Converts original images together with text and layout annotations of all pages in the PAGE input file group to PDF. The text is rendered as an overlay.)

Requirements

GNU make
Python 3 with pip and venv
OCR-D
Java runtime (OpenJDK 8 works for PageToPdf 1.1.2)

Installation

Once you have installed Java, make, Python, and set up your virtual environment, do:

make deps # or: pip install ocrd
make install # copies into PREFIX or VIRTUAL_ENV

Usage

The command-line interface conforms to OCR-D processor specifications.

Assuming you have an OCR-D workspace in your current working directory, simply do:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word"}'

This will run the script and create PDF files for each page with a text layer based on word-level annotations.

There is also an option to create an additional multipage file with name merged.pdf, which contain all single pages in correct order:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "multipage":"merged"}'

FAQ

Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 to method java.nio.DirectByteBuffer.cleaner() If that appears, try installing OpenJDK 8.
java.lang.NullPointerException If that appears, try (a little workaround) and set negative coordinates to zero:
```
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "negative2zero": true}'
```
Some letters are illegible? Please note that the standard displayed font (AletheiaSans.ttf) does not support all Unicode glyphs. In case yours are missing, set a (monospace) Unicode font yourself:
```
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "font": "/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"}'
```

The multipage file pagelabelnames can be changed, e.g. consecutively pagenumber.

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "multipage":"merged", "pagelabelname":"pagenumber"}'

ocrd_pagetopdf's People

Contributors

Stargazers

Watchers

Forkers

bertsky sulzbals kba openfnord konstantinschulz

ocrd_pagetopdf's Issues

run without showing commands executed on stdout

Please remove the set -x line in ocrd-pagetopdf for productive use.

image input file group requirement

Thanks @JKamlah for making this great tool!

Would it be much effort to remove the requirement to have an explicit second input file group for the image? This should be just dereferenced from the /Page/@imageFilename in the PAGE file (relative to METS file path).

Also, line 35: in_grps[1]: unbound variable is not a good error message IMO.

Installation fails on Debian 10

Make target deps-ubuntu fails on Debian 10 (buster), due to dependency openjdk-8-jre-headless not being in the standard packages. Is ocrd_pagetopdf compatible with Java 11 and would it be possible to add to your Makefile a conditional along the lines of

ifeq ($(shell lsb_release -rs),10)
apt-get install -y openjdk-11-jre-headless
endif

...or something like this in order to have installation work on Debian 10?

allow creating multi-page PDFs

IIUC the package currently aims to create one PDF file per page in the output file group.

It would be a thrill to have an option to concatenate all pages into one multi-page PDF file (with proper PDF meta-data about the physical page numbers).

itextpdf installation does not work

After doing make install on a Ubuntu 18.04 with OpenJDK 11.0.5 and running on an example workflow, I get:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 (file:venv/share/ocrd_pagetopdf/ptp/lib/itextpdf-5.5.2.jar) to method java.nio.DirectByteBuffer.cleaner()
WARNING: Please consider reporting this to the maintainers of com.itextpdf.text.io.ByteBufferRandomAccessSource$1
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

And the PDFs generated have no text layer or visual annotation.

Is there a specific Java runtime required?

does not work on two input fileGrps anymore

ocrd-pagetopdf -I OCR-D-OCR,OCR-D-BIN-WAN -O OCR-D-OUT -P textequiv_level word -P multipage merged --overwrite
Traceback (most recent call last): File "/data/ocr-d/ocrd_all/venv/bin/ocrd", line 8, in <module> sys.exit(cli())
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1130, in __call__ return self.main(*args, **kwargs) File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx)
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) 
File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx))
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params)
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs)
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/ocrd/cli/bashlib.py", line 115, in bashlib_input_files for input_file in processor.input_files:
 File "/data/ocr-d/ocrd_all/venv/lib/python3.7/site-packages/ocrd/processor/base.py", line 291, in input_files assert len(ret[0]) == 1, 'Use zip_input_files() instead of input_files when processing multiple input fileGrps'
 AssertionError: Use zip_input_files() instead of input_files when processing multiple input fileGrps 13:54:28.798

Usage example for converting page xml to searchable pdf?

Hi,

Is it possible to convert page xml into searchable pdf? Although I got a pdf after using ocrd-pagetopdf -I OCR-D-OCR -O OCR-D-PDF -P textequiv_level word, it is not searchable. The workflow I used to produce page xml:

ocrd process \ "cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN" \ "anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP" \ "skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li" \ "skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page" \ "tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page" \ "tesserocr-recognize -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P textequiv_level word -P segmentation_level word -P overwrite_segments true" \ "cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP" \ "calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"

I don't see the bug there? Any idea, where is the change or addition in the workflow required? Maybe it hat to do with region recognizer?

Best
Aysoltan

workaround for pagetopdf.jar exceptions

Currently, the PageToPdf.jar backend from PRImA does not signal exceptions with a non-zero exit status, cf. PRImA-Research-Lab/prima-page-to-pdf#5.

Could we try to do some workaround this from here for the time being?

Options I can see:

parse stdout for exceptions
check the resulting PDF file to be non-zero in size (i.e. -s "$out_file" instead of -f "$out_file")

throw error if input-filegrp doesn't exist

If you use a non-existing input-filegrp, currently the only reaction is:

WARNING ocrd-pagetopdf - Without a second input file group for images, the original imageFilename will be used

There should rather be an error message stating that the specified input filegrp doesn't exist in this workspace

Add license

The LICENSE file is currently missing.

The code is based on the PRImA PAGE to PDF Converter (restrictive license, Apache 2.0) which uses itextpdf (AGPL) and refers to the DejaVu fonts license.

Add as transform script to ocr-fileformat?

Wouldn't this be more versatile if it were integrated into ocr-fileformat / ocrd_fileformat ?