Coder Social home page Coder Social logo

ocrd_fileformat's Introduction

ocrd-fileformat

OCR-D wrapper for ocr-fileformat

CircleCI

Prerequisities

  • GNU make
  • Python && pip
  • OpenJDK (required by submodule)
  • optional: Docker CE for building container images

Installation

Clone the repository and it's submodule recursive:

git clone --recursive https://github.com/OCR-D/ocrd_fileformat.git

Step into local clone, build and install ocr-fileformat and the ocrd_fileformat OCR-D wrapper:

make -C ocrd_fileformat install

Alternatively, for the Docker option, just get:

docker pull ocrd/fileformat

Usage

After successful installation type ocrd-fileformat-transform --help to get an idea which conversions are supported already:

ocrd-fileformat-transform -h
Usage: ocrd-fileformat-transform [OPTIONS]

Convert between OCR file formats

> Processor base class and helper functions. A processor is a tool > that implements the uniform OCR-D command-line interface for run- > time data processing. That is, it executes a single workflow step, > or a combination of workflow steps, on the workspace (represented by > local METS). It reads input files for all or requested physical > pages of the input fileGrp(s), and writes output files for them into > the output fileGrp(s). It may take a number of optional or > mandatory parameters. Process the :py:attr:workspace from the > given :py:attr:input_file_grp to the given > :py:attr:output_file_grp for the given :py:attr:page_id under > the given :py:attr:parameter.

> (This contains the main functionality and needs to be overridden by > subclasses.)

Options: -I, --input-file-grp USE File group(s) used as input -O, --output-file-grp USE File group(s) used as output -g, --page-id ID Physical page ID(s) to process --overwrite Remove existing output pages/images (with --page-id, remove only those) -p, --parameter JSON-PATH Parameters, either verbatim JSON string or JSON file path -P, --param-override KEY VAL Override a single JSON object key-value pair, taking precedence over --parameter -s, --server HOST PORT WORKERS Run web server instead of one-shot processing (shifts mets/working-dir/page-id options to HTTP request arguments); pass network interface to bind to, TCP port, number of worker processes -m, --mets URL-PATH URL or file path of METS to process -w, --working-dir PATH Working directory of local workspace -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE] Log level -C, --show-resource RESNAME Dump the content of processor resource RESNAME -L, --list-resources List names of processor resources -J, --dump-json Dump tool description as JSON and exit -h, --help This help message -V, --version Show version

Parameters: "from-to" [string - "page alto"] Transformation scenario, see ocr-fileformat -L Possible values: ["abbyy hocr", "abbyy page", "alto2.0 alto3.0", "alto2.0 alto3.1", "alto2.0 hocr", "alto2.1 alto3.0", "alto2.1 alto3.1", "alto2.1 hocr", "alto page", "alto text", "gcv hocr", "gcv page", "hocr alto2.0", "hocr alto2.1", "hocr page", "hocr text", "page alto", "page hocr", "page page2019", "page text", "tei hocr"] "ext" [string - ""] Output extension. Set to empty string to derive extension from the media type. "script-args" [string - ""] Arguments to Saxon (for XSLT transformations) or to transformation script

With the OCR-D CLI wrapper the ocr-fileformat converter integrates fluently into existing OCR-D tool workflows.

Given a previous step which produces PAGE-XML under the file group OCR, a conversion into plain text under the file group OCR-TXT can be achieved with:

ocrd-fileformat-transform -I OCR -O OCR-TXT -P from-to "page text"
OCR-TXT: OCR
OCR-TXT: TOOL = ocrd-fileformat-transform
OCR-TXT: PARAMS = "from-to": "page text"

Since the conversion from PAGE-XML to ALTO-XML (V4.1) is such a common requirement, it is the default value for the parameter from-to. Therefore, parameters can be omitted completely:

ocrd-fileformat-transform -I OCR -O OCR-ALTO
OCR-ALTO: OCR
OCR-ALTO: TOOL = ocrd-fileformat-transform

However, typically the ALTO converter itself will require additional parameters to be able to cope with the kind of annotations present. For example, if you have no cropping in the workflow, and OCR text is only annotated on the line level, then you will need to add:

ocrd-fileformat-transform -I OCR -O OCR-ALTO -P script-args "--no-check-border --no-check-words --dummy-word"
OCR-ALTO: OCR
OCR-ALTO: TOOL = ocrd-fileformat-transform
OCR-ALTO: PARAMS = "script-args": "--no-check-border --no-check-words --dummy-word"

To run the program via Docker, just spin up a container analogously:

docker run --rm -v $PWD:/data ocrd/fileformat ocrd-fileformat-transform -I OCR -O OCR-ALTO

ocrd_fileformat's People

Contributors

bertsky avatar kba avatar m3ssman avatar stweil avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

bertsky stweil

ocrd_fileformat's Issues

does not work with nondefault METS basename

When running on a workspace with URL ending in anything but mets.xml, the input gets processed, but the METS is not updated.

The reason seems to be that the final ocrd workspace bulk-add command should use ocrd workspace -m ${ocrd__argv[mets]} bulk-add (but after we chdired we should strip the directory part).

Proxy support

When a HTTP proxy is needed, conversion from PAGE to ALTO is failing:

# ocrd-fileformat-transform -I OCR-D-GT-PAGE -O ALTO
14:36:13.086 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-GT-PAGE_00000024 (PHYS_0024)
java.net.ConnectException: Connection timed out (Connection timed out)
        at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
        at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
        at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
        at java.base/java.net.Socket.connect(Socket.java:609)
        at java.base/java.net.Socket.connect(Socket.java:558)
        at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:182)
        at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
        at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
        at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
        at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
        at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)
        at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
        at java.base/java.net.URL.openStream(URL.java:1140)
        at org.primaresearch.io.xml.XmlValidator.getSchema(XmlValidator.java:53)
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:200)
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:115)
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:282)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
Could not initialise ALTO XML writer
java.lang.NullPointerException
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:200)
        at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:115)
        at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:282)
        at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
14:38:23.306 ERROR ocrd-fileformat-transform - Transformation exited with return value 0 but no file was written.

Unfortunately with the network setup here, this also is a long wait for a connection error because packets are simply dropped...

The preferred solution for me if ocr-fileformat would parse the somewhat standard http_proxy environment variable and passes the correct parameter to java:

java -Dhttp.proxyHost=http-proxy.sbb.spk-berlin.de -Dhttp.proxyPort=3128 [...other parameters...]

Missing mets:fptr for generated ALTO files

For dfg-viewer and other viewers, the METS file must contain a FULLTEXT mets:fileGrp. This can be generated using the conversion "page alto". In the following example the file ´LOCTYPE` was replaced by a URL:

<mets:fileGrp USE="FULLTEXT">
  <mets:file MIMETYPE="application/alto+xml" ID="IMG_FULLTEXT_459867">
    <mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" LOCTYPE="URL" xlink:href="https://digi.bib.uni-mannheim.de/fileadmin/vl/ubmaweick/451435/FULLTEXT/IMG_FULLTEXT_459867.xml"/>
  </mets:file>
  [...]

The dfg-viewer expects that all generated ID entries also occur in mets:fptr tags, but those are missing. They should look like this:

        <mets:structMap TYPE="PHYSICAL">
          <mets:div TYPE="physSequence" ID="physroot">
            <mets:div TYPE="page" LABEL="[Seite]" ID="phys459867" ORDER="1">
              <mets:fptr FILEID="IMG_FULLTEXT_459867"/>
              <mets:fptr FILEID="IMG_DEFAULT_459867"/>
              <mets:fptr FILEID="IMG_THUMBS_459867"/>
              <mets:fptr FILEID="IMG_MIN_459867"/>
              <mets:fptr FILEID="IMG_MAX_459867"/>
            </mets:div>
           [...]

This looks like a general problem because other OCR-D processors also create new files without adding them to physical or logical pages.

--overwrite does not work on ocrd-fileformat-transform

The --overwrite command is not taken into consideration in the ocrd-fileformat-transform processor, probably because the ocrd workspace add -g catalog46muse_0023 -G OCR-D-TXT -m text/plain -i OCR-D-TXT_catalog46muse_0023 OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml does not use --overwrite
Original CLI call:
docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite

Output:

+ which ocrd
+ SHAREDIR=/usr/local/share/ocrd_fileformat
+ SCRIPT_NAME=ocrd-fileformat-transform
++ ocrd bashlib constants MIMETYPE_PAGE
+ MIMETYPE_PAGE=application/vnd.prima.page+xml
+ main -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
++ ocrd bashlib filename
+ source /usr/lib/python3.6/site-packages/ocrd/lib.bash
++ (( BASH_VERSINFO<4 || BASH_VERSINFO==4 && BASH_VERSINFO[1]<4 ))
+ ocrd__minversion 2.10.2
+ local minversion=2.10.2
++ sed 's/ocrd, version //'
++ ocrd --version
+ local version=2.15.0
+ local IFS=.
+ version=($version)
+ minversion=($minversion)
+ ((  2 > 2  ))
+ ((  2 == 2  ))
+ ((  15 > 10  ))
+ return
+ ocrd__wrap /usr/local/share/ocrd_fileformat/ocrd-tool.json ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
+ declare -gx OCRD_TOOL_JSON=/usr/local/share/ocrd_fileformat/ocrd-tool.json
+ declare -gx OCRD_TOOL_NAME=ocrd-fileformat-transform
+ shift
+ shift
+ declare -Agx params
+ params=()
+ declare -Agx ocrd__argv
+ ocrd__argv=()
+ which ocrd
+ declare -p OCRD_TOOL_JSON
+ [[ ! -r /usr/local/share/ocrd_fileformat/ocrd-tool.json ]]
+ [[ -z ocrd-fileformat-transform ]]
+ grep -q ocrd-fileformat-transform
+ ocrd ocrd-tool /usr/local/share/ocrd_fileformat/ocrd-tool.json list-tools
+ ocrd__parse_argv -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
+ declare -p ocrd__argv
+ declare -p params
+ ocrd__argv[overwrite]=false
+ __parameters=()
+ local __parameters
+ __parameter_overrides=()
+ local __parameter_overrides
+ [[ -I = -* ]]
+ case "$1" in
+ ocrd__argv[input_file_grp]=OCR-D-OCR
+ shift
+ shift
+ [[ -O = -* ]]
+ case "$1" in
+ ocrd__argv[output_file_grp]=OCR-D-TXT
+ shift
+ shift
+ [[ -p = -* ]]
+ case "$1" in
+ __parameters+=(-p "$2")
+ shift
+ shift
+ [[ --overwrite = -* ]]
+ case "$1" in
+ ocrd__argv[overwrite]=true
+ shift
+ [[ '' = -* ]]
+ [[ ! -r /data/mets.xml ]]
++ dirname /data/mets.xml
+ [[ ! -d /data ]]
+ [[ ! INFO =~ OFF|ERROR|WARN|INFO|DEBUG|TRACE ]]
+ [[ -z OCR-D-OCR ]]
+ [[ -z OCR-D-TXT ]]
+ local params_parsed retval
++ ocrd ocrd-tool /usr/local/share/ocrd_fileformat/ocrd-tool.json tool ocrd-fileformat-transform parse-params -p '{"from-to": "page text"}'
+ params_parsed='params["from-to"]="page text"
params["ext"]=".xml"
params["script-args"]=""'
+ eval 'params["from-to"]="page text"
params["ext"]=".xml"
params["script-args"]=""'
++ params["from-to"]='page text'
++ params["ext"]=.xml
++ params["script-args"]=
+ cd /data
+ page_id=
+ in_file_grp=OCR-D-OCR
+ out_file_grp=OCR-D-TXT
+ mkdir -p OCR-D-TXT
+ local from_to script_args output_extension
+ script_args=(${params['script-args']:-})
+ from_to=(${params['from-to']})
+ output_extension=.xml
+ local 'IFS=
'
+ files=($(ocrd workspace find         ${page_id:+-g} ${page_id:-}         -G $in_file_grp         -k local_filename         -k ID         -k pageId         --download))
++ ocrd workspace find -G OCR-D-OCR -k local_filename -k ID -k pageId --download
+ local 'IFS= 	
'
+ local n=0 zeros=0000
+ for csv in "${files[@]}"
+ let n+=1
+ local 'IFS=	'
+ fields=($csv)
+ local fields
+ local 'IFS= 	
'
+ local in_file=OCR-D-OCR/OCR-D-OCR_catalog46muse_0023.xml
+ local in_id=OCR-D-OCR_catalog46muse_0023
+ local pageid=catalog46muse_0023
+ test -f OCR-D-OCR/OCR-D-OCR_catalog46muse_0023.xml
+ local out_id=OCR-D-TXT_catalog46muse_0023
+ '[' xOCR-D-TXT_catalog46muse_0023 = xOCR-D-OCR_catalog46muse_0023 ']'
+ local out_file=OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml
+ local output_mimetype
+ case "${from_to[1]}" in
+ output_mimetype=text/plain
+ ocrd__log info 'page --> text: input file OCR-D-OCR_catalog46muse_0023 (catalog46muse_0023)'
+ local log_level=INFO
+ [[ -n INFO ]]
+ ocrd -l INFO log info 'page --> text: input file OCR-D-OCR_catalog46muse_0023 (catalog46muse_0023)'
2020-09-08 12:02:37,895.895 INFO root - Overriding log level globally to INFO
2020-09-08 12:02:37,896.896 INFO ocrd-fileformat-transform - page --> text: input file OCR-D-OCR_catalog46muse_0023 (catalog46muse_0023)
+ ocr-transform page text OCR-D-OCR/OCR-D-OCR_catalog46muse_0023.xml OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml --
+ declare -a options
+ '[' -n catalog46muse_0023 ']'
+ options=(-g $pageid)
+ options+=(-G $out_file_grp -m "$output_mimetype" -i "$out_id" "$out_file")
+ ocrd workspace add -g catalog46muse_0023 -G OCR-D-TXT -m text/plain -i OCR-D-TXT_catalog46muse_0023 OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml
Traceback (most recent call last):
  File "/usr/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/decorators.py", line 73, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/cli/workspace.py", line 206, in workspace_add_file
    workspace.mets.add_file(**kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd_models/ocrd_mets.py", line 262, in add_file
    raise Exception("File with ID='%s' already exists" % ID)
Exception: File with ID='OCR-D-TXT_catalog46muse_0023' already exists

Converting from PAGE to hocr creates double results

hOCR files converted from PAGE have each TextEquiv in them, as opposed one variant, and, for fontshape - the style determined by fontshape.

I start with an empty workspace, add an image to it, and run
ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model pol -I images -O OCR-D-OCR"
then I annotate it with
ocrd-tesserocr-fontshape -I OCR-D-OCR -O OCR-D-OCR-FONTSHAPE -P model pol
and finally, convert it to hocr
ocrd-fileformat-transform -I OCR-D-OCR-FONTSHAPE -O hocr -P from-to "page hocr"

The resulting file has the words/segments doubled, and when fontshape is used - tripled.

Wrap validation

ocr-fileformat also supports validation of many OCR formats, we should wrap that as well in addition to the core-provided XSD validation.

(ht @wrznr)

pass script-args

Currently, the script-args param does not get passed over to ocr-transform. IIUC, this should be used for the Saxon parameters in the last position.

Offline use of PAGE → ALTO conversion

Currently converting from PAGE to ALTO - so one of the primary use cases for me - requires a working network connection and possibly a working HTTP proxy configuration to - apparently - load the ALTO schema. (#29) I also noticed that this conversion also needs to load - at least - xlink.xsd from the network.

There is code in PrimaDla.jar to load from a schema folder (searchForAdditionalSchemas), we should probably explore this first with the aim to pre-install all schemas in such a folder.

failure due to (inherited?) nounset

I got:

+ocr-transform page alto OCR-D-OCR/OCR-D-OCR_0001.xml FULLTEXT/FULLTEXT_0001.xml -- --no-check-border --dummy-word
+SHAREDIR=/usr/share/ocr-fileformat
+source /usr/share/ocr-fileformat/lib.sh
/usr/bin/ocr-transform: line 4: COLORTERM: unbound variable

The latter is the fourth line of lib.sh's

if [[ -n "$COLORTERM" || "$TERM" = *color* || "$TERM" = xterm* ]];then

The only way this can fail AFAICS is if the shell has nounset enabled (i.e. set -u).

But ocr-fileformat's ocr-transform.sh itself only uses set -e.

We have set -u here in ocrd-fileformat-transform, but IIUC these shell options cannot be inherited.

So where does this come from?

Fix error handling

Conversion bails out with the following error while converting PAGE to ALTO:

+ ocr-transform page alto TEXT/FILE_0001_TEXT.xml ALTO/FILE_0001_ALTO.xml --
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
	at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:204)
	at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:178)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1471)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1013)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
	at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
	at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaParsingConfig.parse(SchemaParsingConfig.java:640)
	at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaParsingConfig.parse(SchemaParsingConfig.java:696)
	at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaDOMParser.parse(SchemaDOMParser.java:530)
	at java.xml/com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.getSchemaDocument(XSDHandler.java:2226)
	at java.xml/com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.parseSchema(XSDHandler.java:588)
	at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadSchema(XMLSchemaLoader.java:617)
	at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:576)
	at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:542)
	at java.xml/com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory.newSchema(XMLSchemaFactory.java:276)
	at java.xml/javax.xml.validation.SchemaFactory.newSchema(SchemaFactory.java:669)
	at org.primaresearch.io.xml.XmlValidator.getSchema(XmlValidator.java:55)
	at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:186)
	at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:101)
	at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:232)
	at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)
Could not initialise ALTO XML writer
java.lang.NullPointerException
	at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:186)
	at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:101)
	at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:232)
	at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)

Consequently, no ALTO file is created. However, an entry in the METS file is created nonetheless. I.e., while rerunning:

+ declare -a options
+ '[' -n PHYS_0001 ']'
+ options=(-g $pageid)
+ options+=(-G $out_file_grp -m "$output_mimetype" -i "$out_id" "$out_file")
+ ocrd workspace add -g PHYS_0001 -G ALTO -m application/alto+xml -i FILE_0001_ALTO ALTO/FILE_0001_ALTO.xml
Traceback (most recent call last):
  File "/home/kmw/OCR-D/env/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd/cli/workspace.py", line 178, in workspace_add_file
    workspace.mets.add_file(**kwargs)
  File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd_models/ocrd_mets.py", line 261, in add_file
    raise Exception("File with ID='%s' already exists" % ID)
Exception: File with ID='FILE_0001_ALTO' already exists

FILE_0001_TEXT.xml.zip

post-process ALTO→PAGE

The alto page transform does not set /PcGts/Page/@imageFilename if the input had no /alto/description/sourceImageInformation/@fileName. It is impossible to fix that with OCR-D means (even ocrd workspace).

It would be very helpful if this processor had some fix-up capability for this important case (and probably others).

My suggestion would be to try to find the "correct" image file by looking up the physical pageId for the ALTO file and then among the image-only fileGrps taking the first (or the largest, or a parameter-configured) entry for that page.

Bad error handling when converting from PAGE to ALTO (was: Error writing target ALTO XML file)

Using the workspace actevedef_718448162.first-page.ocrd_fileformat_fail.zip I get the following error:

% ocrd-fileformat-transform -I OCR-D-OCR-TESS -O TMP.$RANDOM
17:05:35.001 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-OCR-TESS_00000024 (PHYS_0024)
Error writing target ALTO XML file
cvc-length-valid: Value 'oͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'oͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
cvc-length-valid: Value 'uͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'uͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
cvc-length-valid: Value 'uͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'uͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.

[ ... more messages like the above ...]

cvc-attribute.3: The value 'aͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
17:05:38.950 INFO ocrd-fileformat-transform - Successfully executed: ocr-transform page alto OCR-D-OCR-TESS/OCR-D-OCR-TESS_00000024.xml TMP.25711/TMP.25711_00000024.xml -- 
17:05:39.621 INFO ocrd.workspace.save_mets - Saving mets '/home/mike/devel/ocrd-galley/actevedef_718448162.first-page/mets.xml'

The file TMP.25711/TMP.25711_00000024.xml does not exist, so that Successfully executed is misleading ;-)

OCR-D-OCR-TESS was created using ocrd_tesserocr, so maybe there is a problem there too.

When pip is not installed yet

(This issue is not super important to me, I just noticed this small inconvenience when testing ocrd_fileformat.)

Because I often use rather "blank" container images while testing I've encountered this line in the Makefile:

PIP ?= $(shell which pip)

When pip is not installed yet, this sets $PIP to nothing and the error message I get in line 37 (https://github.com/OCR-D/ocrd_fileformat/blob/master/Makefile#L37) is from the system's install (not from pip install!). This would yield a better error:

PIP ?= pip

I'm not sure if the $(shell which pip) construct has another purpose I'm not aware of.

Side note: https://github.com/OCR-D/ocrd_fileformat/blob/master/Makefile#L90 doesn't use $PIP

Table extraction?

It would be very useful to have a transformation that extracts any tables from PAGE-XML to CSV.

Convert from ALTO-XML (V4.1) to PAGE-XML is failing

Since there seems to be a lot of interest in fas training, I decided to look at the data in OpenITI/OCR_GS_Data to give it try with tesstrain.

OCR_GS_Data/TypeFaces/persian* has png files and ALTO xml. These are v4.1. Since these did not work directly with ocrd-segment-extract-lines. I thought to convert them to PAGE format:

for i in ALTO/*; do base=$(basename "$i" .xml); ocrd workspace add "$i" -G ALTO -i "${base}_alto" -g "$base" -m application/alto+xml; done
for i in ALTO/*; do base=$(basename "$i" .xml); ocrd-fileformat-transform -P from-to "alto page" -I ALTO -O PAGE -g "$base" ; done

No PAGE files were generated, so I tried just for a single file. While "alto text" conversion is working, "alto page" is not.

(venv) (base) ubuntu@tesseract-ocr-1:~/fasGS$ ocrd-fileformat-transform -P from-to "alto page" -I ALTO -O PAGE -g ahsan_at_tavarikh_1
09:53:17.685 INFO ocrd-fileformat-transform - alto --> page: input file ahsan_at_tavarikh_1_alto (ahsan_at_tavarikh_1)
(venv) (base) ubuntu@tesseract-ocr-1:~/fasGS$  ocrd-fileformat-transform -P from-to "alto text" -I ALTO -O PAGE -g ahsan_at_tavarikh_1
09:54:53.183 INFO ocrd-fileformat-transform - alto --> text: input file ahsan_at_tavarikh_1_alto (ahsan_at_tavarikh_1)
09:55:00.625 INFO ocrd-fileformat-transform - Successfully executed: ocr-transform alto text ALTO/ahsan_at_tavarikh_1.xml PAGE/PAGE_0001.txt --
09:55:01.949 INFO ocrd.workspace.save_mets - Saving mets '/home/ubuntu/fasGS/mets.xml'

Is this the correct workflow to follow to split the ALTO page level info to lines?

document usage by example

The README could use a usage section with an (real-life) example call. Also, the tool json is somewhat incomplete (descriptions, FS/dir vs METS/filegrp perspective).

Text output files should use .txt file extension

Currently the conversion from PAGE XML to text ("page text") creates text files with the file extension .xml which is unexpected and can cause problems with viewers which expect XML but get pure text.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.