Coder Social home page Coder Social logo

ocrd_fileformat's Issues

Converting from PAGE to hocr creates double results

hOCR files converted from PAGE have each TextEquiv in them, as opposed one variant, and, for fontshape - the style determined by fontshape.

I start with an empty workspace, add an image to it, and run
ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model pol -I images -O OCR-D-OCR"
then I annotate it with
ocrd-tesserocr-fontshape -I OCR-D-OCR -O OCR-D-OCR-FONTSHAPE -P model pol
and finally, convert it to hocr
ocrd-fileformat-transform -I OCR-D-OCR-FONTSHAPE -O hocr -P from-to "page hocr"

The resulting file has the words/segments doubled, and when fontshape is used - tripled.

Table extraction?

It would be very useful to have a transformation that extracts any tables from PAGE-XML to CSV.

Missing mets:fptr for generated ALTO files

For dfg-viewer and other viewers, the METS file must contain a FULLTEXT mets:fileGrp. This can be generated using the conversion "page alto". In the following example the file ´LOCTYPE` was replaced by a URL:

<mets:fileGrp USE="FULLTEXT">
  <mets:file MIMETYPE="application/alto+xml" ID="IMG_FULLTEXT_459867">
    <mets:FLocat xmlns:xlink="" LOCTYPE="URL" xlink:href=""/>

The dfg-viewer expects that all generated ID entries also occur in mets:fptr tags, but those are missing. They should look like this:

        <mets:structMap TYPE="PHYSICAL">
          <mets:div TYPE="physSequence" ID="physroot">
            <mets:div TYPE="page" LABEL="[Seite]" ID="phys459867" ORDER="1">
              <mets:fptr FILEID="IMG_FULLTEXT_459867"/>
              <mets:fptr FILEID="IMG_DEFAULT_459867"/>
              <mets:fptr FILEID="IMG_THUMBS_459867"/>
              <mets:fptr FILEID="IMG_MIN_459867"/>
              <mets:fptr FILEID="IMG_MAX_459867"/>

This looks like a general problem because other OCR-D processors also create new files without adding them to physical or logical pages.

--overwrite does not work on ocrd-fileformat-transform

The --overwrite command is not taken into consideration in the ocrd-fileformat-transform processor, probably because the ocrd workspace add -g catalog46muse_0023 -G OCR-D-TXT -m text/plain -i OCR-D-TXT_catalog46muse_0023 OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml does not use --overwrite
Original CLI call:
docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite


+ which ocrd
+ SHAREDIR=/usr/local/share/ocrd_fileformat
+ SCRIPT_NAME=ocrd-fileformat-transform
++ ocrd bashlib constants MIMETYPE_PAGE
+ MIMETYPE_PAGE=application/
+ main -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
++ ocrd bashlib filename
+ source /usr/lib/python3.6/site-packages/ocrd/lib.bash
+ ocrd__minversion 2.10.2
+ local minversion=2.10.2
++ sed 's/ocrd, version //'
++ ocrd --version
+ local version=2.15.0
+ local IFS=.
+ version=($version)
+ minversion=($minversion)
+ ((  2 > 2  ))
+ ((  2 == 2  ))
+ ((  15 > 10  ))
+ return
+ ocrd__wrap /usr/local/share/ocrd_fileformat/ocrd-tool.json ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
+ declare -gx OCRD_TOOL_JSON=/usr/local/share/ocrd_fileformat/ocrd-tool.json
+ declare -gx OCRD_TOOL_NAME=ocrd-fileformat-transform
+ shift
+ shift
+ declare -Agx params
+ params=()
+ declare -Agx ocrd__argv
+ ocrd__argv=()
+ which ocrd
+ declare -p OCRD_TOOL_JSON
+ [[ ! -r /usr/local/share/ocrd_fileformat/ocrd-tool.json ]]
+ [[ -z ocrd-fileformat-transform ]]
+ grep -q ocrd-fileformat-transform
+ ocrd ocrd-tool /usr/local/share/ocrd_fileformat/ocrd-tool.json list-tools
+ ocrd__parse_argv -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
+ declare -p ocrd__argv
+ declare -p params
+ ocrd__argv[overwrite]=false
+ __parameters=()
+ local __parameters
+ __parameter_overrides=()
+ local __parameter_overrides
+ [[ -I = -* ]]
+ case "$1" in
+ ocrd__argv[input_file_grp]=OCR-D-OCR
+ shift
+ shift
+ [[ -O = -* ]]
+ case "$1" in
+ ocrd__argv[output_file_grp]=OCR-D-TXT
+ shift
+ shift
+ [[ -p = -* ]]
+ case "$1" in
+ __parameters+=(-p "$2")
+ shift
+ shift
+ [[ --overwrite = -* ]]
+ case "$1" in
+ ocrd__argv[overwrite]=true
+ shift
+ [[ '' = -* ]]
+ [[ ! -r /data/mets.xml ]]
++ dirname /data/mets.xml
+ [[ ! -d /data ]]
+ [[ -z OCR-D-OCR ]]
+ [[ -z OCR-D-TXT ]]
+ local params_parsed retval
++ ocrd ocrd-tool /usr/local/share/ocrd_fileformat/ocrd-tool.json tool ocrd-fileformat-transform parse-params -p '{"from-to": "page text"}'
+ params_parsed='params["from-to"]="page text"
+ eval 'params["from-to"]="page text"
++ params["from-to"]='page text'
++ params["ext"]=.xml
++ params["script-args"]=
+ cd /data
+ page_id=
+ in_file_grp=OCR-D-OCR
+ out_file_grp=OCR-D-TXT
+ mkdir -p OCR-D-TXT
+ local from_to script_args output_extension
+ script_args=(${params['script-args']:-})
+ from_to=(${params['from-to']})
+ output_extension=.xml
+ local 'IFS=
+ files=($(ocrd workspace find         ${page_id:+-g} ${page_id:-}         -G $in_file_grp         -k local_filename         -k ID         -k pageId         --download))
++ ocrd workspace find -G OCR-D-OCR -k local_filename -k ID -k pageId --download
+ local 'IFS= 	
+ local n=0 zeros=0000
+ for csv in "${files[@]}"
+ let n+=1
+ local 'IFS=	'
+ fields=($csv)
+ local fields
+ local 'IFS= 	
+ local in_file=OCR-D-OCR/OCR-D-OCR_catalog46muse_0023.xml
+ local in_id=OCR-D-OCR_catalog46muse_0023
+ local pageid=catalog46muse_0023
+ test -f OCR-D-OCR/OCR-D-OCR_catalog46muse_0023.xml
+ local out_id=OCR-D-TXT_catalog46muse_0023
+ '[' xOCR-D-TXT_catalog46muse_0023 = xOCR-D-OCR_catalog46muse_0023 ']'
+ local out_file=OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml
+ local output_mimetype
+ case "${from_to[1]}" in
+ output_mimetype=text/plain
+ ocrd__log info 'page --> text: input file OCR-D-OCR_catalog46muse_0023 (catalog46muse_0023)'
+ local log_level=INFO
+ [[ -n INFO ]]
+ ocrd -l INFO log info 'page --> text: input file OCR-D-OCR_catalog46muse_0023 (catalog46muse_0023)'
2020-09-08 12:02:37,895.895 INFO root - Overriding log level globally to INFO
2020-09-08 12:02:37,896.896 INFO ocrd-fileformat-transform - page --> text: input file OCR-D-OCR_catalog46muse_0023 (catalog46muse_0023)
+ ocr-transform page text OCR-D-OCR/OCR-D-OCR_catalog46muse_0023.xml OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml --
+ declare -a options
+ '[' -n catalog46muse_0023 ']'
+ options=(-g $pageid)
+ options+=(-G $out_file_grp -m "$output_mimetype" -i "$out_id" "$out_file")
+ ocrd workspace add -g catalog46muse_0023 -G OCR-D-TXT -m text/plain -i OCR-D-TXT_catalog46muse_0023 OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml
Traceback (most recent call last):
  File "/usr/bin/ocrd", line 8, in <module>
  File "/usr/lib/python3.6/site-packages/click/", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/", line 73, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/cli/", line 206, in workspace_add_file
  File "/usr/lib/python3.6/site-packages/ocrd_models/", line 262, in add_file
    raise Exception("File with ID='%s' already exists" % ID)
Exception: File with ID='OCR-D-TXT_catalog46muse_0023' already exists

When pip is not installed yet

(This issue is not super important to me, I just noticed this small inconvenience when testing ocrd_fileformat.)

Because I often use rather "blank" container images while testing I've encountered this line in the Makefile:

PIP ?= $(shell which pip)

When pip is not installed yet, this sets $PIP to nothing and the error message I get in line 37 ( is from the system's install (not from pip install!). This would yield a better error:

PIP ?= pip

I'm not sure if the $(shell which pip) construct has another purpose I'm not aware of.

Side note: doesn't use $PIP

Offline use of PAGE → ALTO conversion

Currently converting from PAGE to ALTO - so one of the primary use cases for me - requires a working network connection and possibly a working HTTP proxy configuration to - apparently - load the ALTO schema. (#29) I also noticed that this conversion also needs to load - at least - xlink.xsd from the network.

There is code in PrimaDla.jar to load from a schema folder (searchForAdditionalSchemas), we should probably explore this first with the aim to pre-install all schemas in such a folder.

Proxy support

When a HTTP proxy is needed, conversion from PAGE to ALTO is failing:

# ocrd-fileformat-transform -I OCR-D-GT-PAGE -O ALTO
14:36:13.086 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-GT-PAGE_00000024 (PHYS_0024) Connection timed out (Connection timed out)
        at java.base/ Method)
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/<init>(
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/
        at java.base/
Could not initialise ALTO XML writer
14:38:23.306 ERROR ocrd-fileformat-transform - Transformation exited with return value 0 but no file was written.

Unfortunately with the network setup here, this also is a long wait for a connection error because packets are simply dropped...

The preferred solution for me if ocr-fileformat would parse the somewhat standard http_proxy environment variable and passes the correct parameter to java:

java -Dhttp.proxyPort=3128 [...other parameters...]

Fix error handling

Conversion bails out with the following error while converting PAGE to ALTO:

+ ocr-transform page alto TEXT/FILE_0001_TEXT.xml ALTO/FILE_0001_ALTO.xml --
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/$
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/
	at java.xml/javax.xml.validation.SchemaFactory.newSchema(
Could not initialise ALTO XML writer

Consequently, no ALTO file is created. However, an entry in the METS file is created nonetheless. I.e., while rerunning:

+ declare -a options
+ '[' -n PHYS_0001 ']'
+ options=(-g $pageid)
+ options+=(-G $out_file_grp -m "$output_mimetype" -i "$out_id" "$out_file")
+ ocrd workspace add -g PHYS_0001 -G ALTO -m application/alto+xml -i FILE_0001_ALTO ALTO/FILE_0001_ALTO.xml
Traceback (most recent call last):
  File "/home/kmw/OCR-D/env/bin/ocrd", line 8, in <module>
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/", line 717, in main
    rv = self.invoke(ctx)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd/cli/", line 178, in workspace_add_file
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd_models/", line 261, in add_file
    raise Exception("File with ID='%s' already exists" % ID)
Exception: File with ID='FILE_0001_ALTO' already exists

pass script-args

Currently, the script-args param does not get passed over to ocr-transform. IIUC, this should be used for the Saxon parameters in the last position.

Wrap validation

ocr-fileformat also supports validation of many OCR formats, we should wrap that as well in addition to the core-provided XSD validation.

(ht @wrznr)

Convert from ALTO-XML (V4.1) to PAGE-XML is failing

Since there seems to be a lot of interest in fas training, I decided to look at the data in OpenITI/OCR_GS_Data to give it try with tesstrain.

OCR_GS_Data/TypeFaces/persian* has png files and ALTO xml. These are v4.1. Since these did not work directly with ocrd-segment-extract-lines. I thought to convert them to PAGE format:

for i in ALTO/*; do base=$(basename "$i" .xml); ocrd workspace add "$i" -G ALTO -i "${base}_alto" -g "$base" -m application/alto+xml; done
for i in ALTO/*; do base=$(basename "$i" .xml); ocrd-fileformat-transform -P from-to "alto page" -I ALTO -O PAGE -g "$base" ; done

No PAGE files were generated, so I tried just for a single file. While "alto text" conversion is working, "alto page" is not.

(venv) (base) ubuntu@tesseract-ocr-1:~/fasGS$ ocrd-fileformat-transform -P from-to "alto page" -I ALTO -O PAGE -g ahsan_at_tavarikh_1
09:53:17.685 INFO ocrd-fileformat-transform - alto --> page: input file ahsan_at_tavarikh_1_alto (ahsan_at_tavarikh_1)
(venv) (base) ubuntu@tesseract-ocr-1:~/fasGS$  ocrd-fileformat-transform -P from-to "alto text" -I ALTO -O PAGE -g ahsan_at_tavarikh_1
09:54:53.183 INFO ocrd-fileformat-transform - alto --> text: input file ahsan_at_tavarikh_1_alto (ahsan_at_tavarikh_1)
09:55:00.625 INFO ocrd-fileformat-transform - Successfully executed: ocr-transform alto text ALTO/ahsan_at_tavarikh_1.xml PAGE/PAGE_0001.txt --
09:55:01.949 INFO ocrd.workspace.save_mets - Saving mets '/home/ubuntu/fasGS/mets.xml'

Is this the correct workflow to follow to split the ALTO page level info to lines?

failure due to (inherited?) nounset

I got:

+ocr-transform page alto OCR-D-OCR/OCR-D-OCR_0001.xml FULLTEXT/FULLTEXT_0001.xml -- --no-check-border --dummy-word
+source /usr/share/ocr-fileformat/
/usr/bin/ocr-transform: line 4: COLORTERM: unbound variable

The latter is the fourth line of's

if [[ -n "$COLORTERM" || "$TERM" = *color* || "$TERM" = xterm* ]];then

The only way this can fail AFAICS is if the shell has nounset enabled (i.e. set -u).

But ocr-fileformat's itself only uses set -e.

We have set -u here in ocrd-fileformat-transform, but IIUC these shell options cannot be inherited.

So where does this come from?

Text output files should use .txt file extension

Currently the conversion from PAGE XML to text ("page text") creates text files with the file extension .xml which is unexpected and can cause problems with viewers which expect XML but get pure text.

post-process ALTO→PAGE

The alto page transform does not set /PcGts/Page/@imageFilename if the input had no /alto/description/sourceImageInformation/@fileName. It is impossible to fix that with OCR-D means (even ocrd workspace).

It would be very helpful if this processor had some fix-up capability for this important case (and probably others).

My suggestion would be to try to find the "correct" image file by looking up the physical pageId for the ALTO file and then among the image-only fileGrps taking the first (or the largest, or a parameter-configured) entry for that page.

does not work with nondefault METS basename

When running on a workspace with URL ending in anything but mets.xml, the input gets processed, but the METS is not updated.

The reason seems to be that the final ocrd workspace bulk-add command should use ocrd workspace -m ${ocrd__argv[mets]} bulk-add (but after we chdired we should strip the directory part).

Bad error handling when converting from PAGE to ALTO (was: Error writing target ALTO XML file)

Using the workspace I get the following error:

% ocrd-fileformat-transform -I OCR-D-OCR-TESS -O TMP.$RANDOM
17:05:35.001 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-OCR-TESS_00000024 (PHYS_0024)
Error writing target ALTO XML file
cvc-length-valid: Value 'oͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'oͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
cvc-length-valid: Value 'uͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'uͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
cvc-length-valid: Value 'uͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'uͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.

[ ... more messages like the above ...]

cvc-attribute.3: The value 'aͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
17:05:38.950 INFO ocrd-fileformat-transform - Successfully executed: ocr-transform page alto OCR-D-OCR-TESS/OCR-D-OCR-TESS_00000024.xml TMP.25711/TMP.25711_00000024.xml -- 
17:05:39.621 INFO ocrd.workspace.save_mets - Saving mets '/home/mike/devel/ocrd-galley/actevedef_718448162.first-page/mets.xml'

The file TMP.25711/TMP.25711_00000024.xml does not exist, so that Successfully executed is misleading ;-)

OCR-D-OCR-TESS was created using ocrd_tesserocr, so maybe there is a problem there too.

document usage by example

The README could use a usage section with an (real-life) example call. Also, the tool json is somewhat incomplete (descriptions, FS/dir vs METS/filegrp perspective).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.