ocr-d / ocrd_fileformat Goto Github PK
View Code? Open in Web Editor NEWOCR-D wrapper for ocr-fileformat
OCR-D wrapper for ocr-fileformat
hOCR files converted from PAGE have each TextEquiv in them, as opposed one variant, and, for fontshape - the style determined by fontshape.
I start with an empty workspace, add an image to it, and run
ocrd process "tesserocr-recognize -P segmentation_level region -P textequiv_level word -P find_tables true -P model pol -I images -O OCR-D-OCR"
then I annotate it with
ocrd-tesserocr-fontshape -I OCR-D-OCR -O OCR-D-OCR-FONTSHAPE -P model pol
and finally, convert it to hocr
ocrd-fileformat-transform -I OCR-D-OCR-FONTSHAPE -O hocr -P from-to "page hocr"
The resulting file has the words/segments doubled, and when fontshape is used - tripled.
It would be very useful to have a transformation that extracts any tables from PAGE-XML to CSV.
Line 37 in 5022408
This effectively bypasses any installed external versions.
See OCR-D/ocrd_all#354.
We should instead:
--update
)ocrd_fileformat/ocrd-fileformat-transform
Line 23 in 5022408
For dfg-viewer and other viewers, the METS file must contain a FULLTEXT
mets:fileGrp
. This can be generated using the conversion "page alto"
. In the following example the file ´LOCTYPE` was replaced by a URL:
<mets:fileGrp USE="FULLTEXT">
<mets:file MIMETYPE="application/alto+xml" ID="IMG_FULLTEXT_459867">
<mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" LOCTYPE="URL" xlink:href="https://digi.bib.uni-mannheim.de/fileadmin/vl/ubmaweick/451435/FULLTEXT/IMG_FULLTEXT_459867.xml"/>
</mets:file>
[...]
The dfg-viewer expects that all generated ID
entries also occur in mets:fptr
tags, but those are missing. They should look like this:
<mets:structMap TYPE="PHYSICAL">
<mets:div TYPE="physSequence" ID="physroot">
<mets:div TYPE="page" LABEL="[Seite]" ID="phys459867" ORDER="1">
<mets:fptr FILEID="IMG_FULLTEXT_459867"/>
<mets:fptr FILEID="IMG_DEFAULT_459867"/>
<mets:fptr FILEID="IMG_THUMBS_459867"/>
<mets:fptr FILEID="IMG_MIN_459867"/>
<mets:fptr FILEID="IMG_MAX_459867"/>
</mets:div>
[...]
This looks like a general problem because other OCR-D processors also create new files without adding them to physical or logical pages.
The --overwrite
command is not taken into consideration in the ocrd-fileformat-transform
processor, probably because the ocrd workspace add -g catalog46muse_0023 -G OCR-D-TXT -m text/plain -i OCR-D-TXT_catalog46muse_0023 OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml
does not use --overwrite
Original CLI call:
docker run --rm -u $(id -u) -v $PWD:/data -w /data -- ocrd/all:maximum ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
Output:
+ which ocrd
+ SHAREDIR=/usr/local/share/ocrd_fileformat
+ SCRIPT_NAME=ocrd-fileformat-transform
++ ocrd bashlib constants MIMETYPE_PAGE
+ MIMETYPE_PAGE=application/vnd.prima.page+xml
+ main -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
++ ocrd bashlib filename
+ source /usr/lib/python3.6/site-packages/ocrd/lib.bash
++ (( BASH_VERSINFO<4 || BASH_VERSINFO==4 && BASH_VERSINFO[1]<4 ))
+ ocrd__minversion 2.10.2
+ local minversion=2.10.2
++ sed 's/ocrd, version //'
++ ocrd --version
+ local version=2.15.0
+ local IFS=.
+ version=($version)
+ minversion=($minversion)
+ (( 2 > 2 ))
+ (( 2 == 2 ))
+ (( 15 > 10 ))
+ return
+ ocrd__wrap /usr/local/share/ocrd_fileformat/ocrd-tool.json ocrd-fileformat-transform -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
+ declare -gx OCRD_TOOL_JSON=/usr/local/share/ocrd_fileformat/ocrd-tool.json
+ declare -gx OCRD_TOOL_NAME=ocrd-fileformat-transform
+ shift
+ shift
+ declare -Agx params
+ params=()
+ declare -Agx ocrd__argv
+ ocrd__argv=()
+ which ocrd
+ declare -p OCRD_TOOL_JSON
+ [[ ! -r /usr/local/share/ocrd_fileformat/ocrd-tool.json ]]
+ [[ -z ocrd-fileformat-transform ]]
+ grep -q ocrd-fileformat-transform
+ ocrd ocrd-tool /usr/local/share/ocrd_fileformat/ocrd-tool.json list-tools
+ ocrd__parse_argv -I OCR-D-OCR -O OCR-D-TXT -p '{"from-to": "page text"}' --overwrite
+ declare -p ocrd__argv
+ declare -p params
+ ocrd__argv[overwrite]=false
+ __parameters=()
+ local __parameters
+ __parameter_overrides=()
+ local __parameter_overrides
+ [[ -I = -* ]]
+ case "$1" in
+ ocrd__argv[input_file_grp]=OCR-D-OCR
+ shift
+ shift
+ [[ -O = -* ]]
+ case "$1" in
+ ocrd__argv[output_file_grp]=OCR-D-TXT
+ shift
+ shift
+ [[ -p = -* ]]
+ case "$1" in
+ __parameters+=(-p "$2")
+ shift
+ shift
+ [[ --overwrite = -* ]]
+ case "$1" in
+ ocrd__argv[overwrite]=true
+ shift
+ [[ '' = -* ]]
+ [[ ! -r /data/mets.xml ]]
++ dirname /data/mets.xml
+ [[ ! -d /data ]]
+ [[ ! INFO =~ OFF|ERROR|WARN|INFO|DEBUG|TRACE ]]
+ [[ -z OCR-D-OCR ]]
+ [[ -z OCR-D-TXT ]]
+ local params_parsed retval
++ ocrd ocrd-tool /usr/local/share/ocrd_fileformat/ocrd-tool.json tool ocrd-fileformat-transform parse-params -p '{"from-to": "page text"}'
+ params_parsed='params["from-to"]="page text"
params["ext"]=".xml"
params["script-args"]=""'
+ eval 'params["from-to"]="page text"
params["ext"]=".xml"
params["script-args"]=""'
++ params["from-to"]='page text'
++ params["ext"]=.xml
++ params["script-args"]=
+ cd /data
+ page_id=
+ in_file_grp=OCR-D-OCR
+ out_file_grp=OCR-D-TXT
+ mkdir -p OCR-D-TXT
+ local from_to script_args output_extension
+ script_args=(${params['script-args']:-})
+ from_to=(${params['from-to']})
+ output_extension=.xml
+ local 'IFS=
'
+ files=($(ocrd workspace find ${page_id:+-g} ${page_id:-} -G $in_file_grp -k local_filename -k ID -k pageId --download))
++ ocrd workspace find -G OCR-D-OCR -k local_filename -k ID -k pageId --download
+ local 'IFS=
'
+ local n=0 zeros=0000
+ for csv in "${files[@]}"
+ let n+=1
+ local 'IFS= '
+ fields=($csv)
+ local fields
+ local 'IFS=
'
+ local in_file=OCR-D-OCR/OCR-D-OCR_catalog46muse_0023.xml
+ local in_id=OCR-D-OCR_catalog46muse_0023
+ local pageid=catalog46muse_0023
+ test -f OCR-D-OCR/OCR-D-OCR_catalog46muse_0023.xml
+ local out_id=OCR-D-TXT_catalog46muse_0023
+ '[' xOCR-D-TXT_catalog46muse_0023 = xOCR-D-OCR_catalog46muse_0023 ']'
+ local out_file=OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml
+ local output_mimetype
+ case "${from_to[1]}" in
+ output_mimetype=text/plain
+ ocrd__log info 'page --> text: input file OCR-D-OCR_catalog46muse_0023 (catalog46muse_0023)'
+ local log_level=INFO
+ [[ -n INFO ]]
+ ocrd -l INFO log info 'page --> text: input file OCR-D-OCR_catalog46muse_0023 (catalog46muse_0023)'
2020-09-08 12:02:37,895.895 INFO root - Overriding log level globally to INFO
2020-09-08 12:02:37,896.896 INFO ocrd-fileformat-transform - page --> text: input file OCR-D-OCR_catalog46muse_0023 (catalog46muse_0023)
+ ocr-transform page text OCR-D-OCR/OCR-D-OCR_catalog46muse_0023.xml OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml --
+ declare -a options
+ '[' -n catalog46muse_0023 ']'
+ options=(-g $pageid)
+ options+=(-G $out_file_grp -m "$output_mimetype" -i "$out_id" "$out_file")
+ ocrd workspace add -g catalog46muse_0023 -G OCR-D-TXT -m text/plain -i OCR-D-TXT_catalog46muse_0023 OCR-D-TXT/OCR-D-TXT_catalog46muse_0023.xml
Traceback (most recent call last):
File "/usr/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/decorators.py", line 73, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd/cli/workspace.py", line 206, in workspace_add_file
workspace.mets.add_file(**kwargs)
File "/usr/lib/python3.6/site-packages/ocrd_models/ocrd_mets.py", line 262, in add_file
raise Exception("File with ID='%s' already exists" % ID)
Exception: File with ID='OCR-D-TXT_catalog46muse_0023' already exists
(This issue is not super important to me, I just noticed this small inconvenience when testing ocrd_fileformat.)
Because I often use rather "blank" container images while testing I've encountered this line in the Makefile
:
PIP ?= $(shell which pip)
When pip is not installed yet, this sets $PIP
to nothing and the error message I get in line 37 (https://github.com/OCR-D/ocrd_fileformat/blob/master/Makefile#L37) is from the system's install
(not from pip install
!). This would yield a better error:
PIP ?= pip
I'm not sure if the $(shell which pip)
construct has another purpose I'm not aware of.
Side note: https://github.com/OCR-D/ocrd_fileformat/blob/master/Makefile#L90 doesn't use $PIP
Currently converting from PAGE to ALTO - so one of the primary use cases for me - requires a working network connection and possibly a working HTTP proxy configuration to - apparently - load the ALTO schema. (#29) I also noticed that this conversion also needs to load - at least - xlink.xsd
from the network.
There is code in PrimaDla.jar to load from a schema folder (searchForAdditionalSchemas
), we should probably explore this first with the aim to pre-install all schemas in such a folder.
When a HTTP proxy is needed, conversion from PAGE to ALTO is failing:
# ocrd-fileformat-transform -I OCR-D-GT-PAGE -O ALTO
14:36:13.086 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-GT-PAGE_00000024 (PHYS_0024)
java.net.ConnectException: Connection timed out (Connection timed out)
at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399)
at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242)
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224)
at java.base/java.net.Socket.connect(Socket.java:609)
at java.base/java.net.Socket.connect(Socket.java:558)
at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:182)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
at java.base/java.net.URL.openStream(URL.java:1140)
at org.primaresearch.io.xml.XmlValidator.getSchema(XmlValidator.java:53)
at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:200)
at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:115)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:282)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
Could not initialise ALTO XML writer
java.lang.NullPointerException
at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:200)
at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:115)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:282)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:161)
14:38:23.306 ERROR ocrd-fileformat-transform - Transformation exited with return value 0 but no file was written.
Unfortunately with the network setup here, this also is a long wait for a connection error because packets are simply dropped...
The preferred solution for me if ocr-fileformat would parse the somewhat standard http_proxy
environment variable and passes the correct parameter to java
:
java -Dhttp.proxyHost=http-proxy.sbb.spk-berlin.de -Dhttp.proxyPort=3128 [...other parameters...]
Conversion bails out with the following error while converting PAGE to ALTO:
+ ocr-transform page alto TEXT/FILE_0001_TEXT.xml ALTO/FILE_0001_ALTO.xml --
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Premature end of file.
at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:204)
at java.xml/com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:178)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1471)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:1013)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:112)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaParsingConfig.parse(SchemaParsingConfig.java:640)
at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaParsingConfig.parse(SchemaParsingConfig.java:696)
at java.xml/com.sun.org.apache.xerces.internal.impl.xs.opti.SchemaDOMParser.parse(SchemaDOMParser.java:530)
at java.xml/com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.getSchemaDocument(XSDHandler.java:2226)
at java.xml/com.sun.org.apache.xerces.internal.impl.xs.traversers.XSDHandler.parseSchema(XSDHandler.java:588)
at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadSchema(XMLSchemaLoader.java:617)
at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:576)
at java.xml/com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaLoader.loadGrammar(XMLSchemaLoader.java:542)
at java.xml/com.sun.org.apache.xerces.internal.jaxp.validation.XMLSchemaFactory.newSchema(XMLSchemaFactory.java:276)
at java.xml/javax.xml.validation.SchemaFactory.newSchema(SchemaFactory.java:669)
at org.primaresearch.io.xml.XmlValidator.getSchema(XmlValidator.java:55)
at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:186)
at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:101)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:232)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)
Could not initialise ALTO XML writer
java.lang.NullPointerException
at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.run(XmlPageWriter_Alto.java:186)
at org.primaresearch.dla.page.io.xml.XmlPageWriter_Alto.write(XmlPageWriter_Alto.java:101)
at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:232)
at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:130)
Consequently, no ALTO file is created. However, an entry in the METS file is created nonetheless. I.e., while rerunning:
+ declare -a options
+ '[' -n PHYS_0001 ']'
+ options=(-g $pageid)
+ options+=(-G $out_file_grp -m "$output_mimetype" -i "$out_id" "$out_file")
+ ocrd workspace add -g PHYS_0001 -G ALTO -m application/alto+xml -i FILE_0001_ALTO ALTO/FILE_0001_ALTO.xml
Traceback (most recent call last):
File "/home/kmw/OCR-D/env/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/decorators.py", line 64, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd/cli/workspace.py", line 178, in workspace_add_file
workspace.mets.add_file(**kwargs)
File "/home/kmw/OCR-D/env/lib/python3.7/site-packages/ocrd_models/ocrd_mets.py", line 261, in add_file
raise Exception("File with ID='%s' already exists" % ID)
Exception: File with ID='FILE_0001_ALTO' already exists
Currently, the script-args
param does not get passed over to ocr-transform
. IIUC, this should be used for the Saxon parameters in the last position.
ocr-fileformat also supports validation of many OCR formats, we should wrap that as well in addition to the core-provided XSD validation.
(ht @wrznr)
Since there seems to be a lot of interest in fas
training, I decided to look at the data in OpenITI/OCR_GS_Data to give it try with tesstrain.
OCR_GS_Data/TypeFaces/persian* has png files and ALTO xml. These are v4.1. Since these did not work directly with ocrd-segment-extract-lines
. I thought to convert them to PAGE format:
for i in ALTO/*; do base=$(basename "$i" .xml); ocrd workspace add "$i" -G ALTO -i "${base}_alto" -g "$base" -m application/alto+xml; done
for i in ALTO/*; do base=$(basename "$i" .xml); ocrd-fileformat-transform -P from-to "alto page" -I ALTO -O PAGE -g "$base" ; done
No PAGE files were generated, so I tried just for a single file. While "alto text" conversion is working, "alto page" is not.
(venv) (base) ubuntu@tesseract-ocr-1:~/fasGS$ ocrd-fileformat-transform -P from-to "alto page" -I ALTO -O PAGE -g ahsan_at_tavarikh_1
09:53:17.685 INFO ocrd-fileformat-transform - alto --> page: input file ahsan_at_tavarikh_1_alto (ahsan_at_tavarikh_1)
(venv) (base) ubuntu@tesseract-ocr-1:~/fasGS$ ocrd-fileformat-transform -P from-to "alto text" -I ALTO -O PAGE -g ahsan_at_tavarikh_1
09:54:53.183 INFO ocrd-fileformat-transform - alto --> text: input file ahsan_at_tavarikh_1_alto (ahsan_at_tavarikh_1)
09:55:00.625 INFO ocrd-fileformat-transform - Successfully executed: ocr-transform alto text ALTO/ahsan_at_tavarikh_1.xml PAGE/PAGE_0001.txt --
09:55:01.949 INFO ocrd.workspace.save_mets - Saving mets '/home/ubuntu/fasGS/mets.xml'
Is this the correct workflow to follow to split the ALTO page level info to lines?
Currently, running make deps install
does not work due to the empty ocr-fileformats
directory.
I believe the workflow-configuration example for ALTO conversion in the README is wrong as it is identical to the PAGE → text example
I got:
+ocr-transform page alto OCR-D-OCR/OCR-D-OCR_0001.xml FULLTEXT/FULLTEXT_0001.xml -- --no-check-border --dummy-word
+SHAREDIR=/usr/share/ocr-fileformat
+source /usr/share/ocr-fileformat/lib.sh
/usr/bin/ocr-transform: line 4: COLORTERM: unbound variable
The latter is the fourth line of lib.sh's
if [[ -n "$COLORTERM" || "$TERM" = *color* || "$TERM" = xterm* ]];then
The only way this can fail AFAICS is if the shell has nounset
enabled (i.e. set -u
).
But ocr-fileformat's ocr-transform.sh
itself only uses set -e
.
We have set -u
here in ocrd-fileformat-transform
, but IIUC these shell options cannot be inherited.
So where does this come from?
Currently the conversion from PAGE XML to text ("page text"
) creates text files with the file extension .xml
which is unexpected and can cause problems with viewers which expect XML but get pure text.
The alto page
transform does not set /PcGts/Page/@imageFilename
if the input had no /alto/description/sourceImageInformation/@fileName
. It is impossible to fix that with OCR-D means (even ocrd workspace
).
It would be very helpful if this processor had some fix-up capability for this important case (and probably others).
My suggestion would be to try to find the "correct" image file by looking up the physical pageId for the ALTO file and then among the image-only fileGrps taking the first (or the largest, or a parameter-configured) entry for that page.
When running on a workspace with URL ending in anything but mets.xml
, the input gets processed, but the METS is not updated.
The reason seems to be that the final ocrd workspace bulk-add
command should use ocrd workspace -m ${ocrd__argv[mets]} bulk-add
(but after we chdired we should strip the directory part).
Using the workspace actevedef_718448162.first-page.ocrd_fileformat_fail.zip I get the following error:
% ocrd-fileformat-transform -I OCR-D-OCR-TESS -O TMP.$RANDOM
17:05:35.001 INFO ocrd-fileformat-transform - page --> alto: input file OCR-D-OCR-TESS_00000024 (PHYS_0024)
Error writing target ALTO XML file
cvc-length-valid: Value 'oͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'oͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
cvc-length-valid: Value 'uͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'uͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
cvc-length-valid: Value 'uͤ' with length = '2' is not facet-valid with respect to length '1' for type '#AnonType_CONTENTGlyphType'.
cvc-attribute.3: The value 'uͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
[ ... more messages like the above ...]
cvc-attribute.3: The value 'aͤ' of attribute 'CONTENT' on element 'Glyph' is not valid with respect to its type, 'null'.
17:05:38.950 INFO ocrd-fileformat-transform - Successfully executed: ocr-transform page alto OCR-D-OCR-TESS/OCR-D-OCR-TESS_00000024.xml TMP.25711/TMP.25711_00000024.xml --
17:05:39.621 INFO ocrd.workspace.save_mets - Saving mets '/home/mike/devel/ocrd-galley/actevedef_718448162.first-page/mets.xml'
The file TMP.25711/TMP.25711_00000024.xml
does not exist, so that Successfully executed
is misleading ;-)
OCR-D-OCR-TESS was created using ocrd_tesserocr, so maybe there is a problem there too.
The README could use a usage
section with an (real-life) example call. Also, the tool json is somewhat incomplete (descriptions, FS/dir vs METS/filegrp perspective).
I believe it would be helpful if the ocrd-fileformat-transform PAGE → ALTO transformation would add a <Processing>
tag. I looked into to the file to figure out if https://github.com/kba/page-to-alto was used for the conversion and did not find a processing tag for the conversion, just for segmentation/binarization/OCR.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.