internetarchive / archive-hocr-tools Goto Github PK

Efficient hOCR tooling

License: Other

Python 99.85% HTML 0.15%

archive-hocr-tools's Introduction

archive-hocr-tools

This repository contains a python package to perform hOCR parsing efficiently, and it also contains a set of tools that can help perform operations on and analyse hOCR files.

hocr-combine-stream: A tool to combine many hocr files into a big hocr file while keeping memory usage low. Used internally to combine tesseract per-page results into a larger hocr resulting file for an entire book.
hocr-pagenumbers: A tool to find pagenumbers in multi-page hOCR documents
hocr-fold-chars: A tool to transform a per-character hocr file into a per-word hocr file.
pdf-to-hocr: A tool to take text content embedded in a PDF, and extract it as hOCR format.
See more tools in the ./bin directory, not all have been documented yet.

The python library is called hocr.

archive-hocr-tools's People

Contributors

Stargazers

Watchers

Forkers

novikovke cclauss nayankanaparthi stweil gerhobbelt jrochkind whikloj

archive-hocr-tools's Issues

hocr-combine-stream output contains multiple </body></html> tags, producing invalid xml

When trying out hocr-combine-stream -g "*.hocr" > combined.html to merge several hocr files produced by tesseract, the resulting output contains multiple </body> and </html> tags, at the close of each of the input pages. recode_pdf is unable to work with the combined hocr file, exiting with the error "lxml.etree.XMLSyntaxError: Extra content at the end of the document."

I'm using this for the first time, so I'm not sure if I'm doing something wrong or if this is a bug in one of the tools. I've tried with several different image sets but all had the same result.

Using Arch Linux with software versions:

archive-hocr-tools 1.1.19
archive-pdf-tools 1.4.15
python-lxml 4.8.0
tesseract 5.1.0
python 3.10.4

Steps to reproduce:

$ wget https://ia800303.us.archive.org/28/items/rubaiyatfitzgera00omar/rubaiyatfitzgera00omar_jp2.zip
$ unzip rubaiyatfitzgera00omar_jp2.zip
$ cd rubaiyatfitzgera00omar_jp2
$ fd -e jp2 -x bash -c "magick {} TIFF:- | tesseract --dpi 300 - {.} hocr"
$ hocr-combine-stream -g "*.hocr" > combined.html
$ recode_pdf --from-imagestack "*.jp2" --hocr-file combined.html --dpi 300 --bg-downsample 3 --mask-compression jbig2 -o test.pdf
Traceback (most recent call last):
  File "/home/jw/.local/bin/recode_pdf", line 290, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/home/jw/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 634, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/home/jw/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 110, in create_tess_textonly_pdf
    for idx, hocr_page in enumerate(hocr_iter):
  File "/home/jw/.local/lib/python3.10/site-packages/hocr/parse.py", line 47, in hocr_page_iterator
    for act, elem in doc:
  File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
  File "src/lxml/parser.pxi", line 1376, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 606, in lxml.etree._ParserContext._handleParseResult
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "/home/jw/rubaiyatfitzgera00omar_jp2/combined.html", line 25
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 25, column 2

The attached example-hocr.zip file contains the .hocr files and combined.html produced from the preceding example.

Add detection of image/graphics features

As far as I can tell, the way things are set up the hOCR files generated don't include any layout information on the location of images in a document.

From the spec it looks like this should be possible? http://kba.cloud/hocr-spec/1.2/#floats-image

The previous OCR approach using ABBYY did produce picture features, and this allowed some really exciting things like programatically extracting and exploring images from books, which then resulted in the Internet Archive Book Images project, something that wouldn't really be feasible if you had to download every book page image just to check if it contained any illustrations.

Is there a reason why this features aren't included, or is this something that just needs to be enabled?

Edit:

Ah, I think I might be in the wrong place, I thought this repository related to the generation of the hocr files. Does that stuff live somewhere else?

Edit2:

Found it: https://git.archive.org/www/tesseract

Make tools more usable with pipes

Many of the tools currently cannot work in special files in /dev/stdin in bash, or in general accept files from stdin, this is because of some unnecessary seeks.

Additionally, it would be nice to add some features to filter (for example) by word confidence. This could be done in hocr-text, but we could also have a streaming hocr filter tool that takes hocr as input, and also outputs hocr, but only allows words with certain confidence to pass. This would need to be streaming which makes it a little tricky, but it would be cool to for example pipe Tesseract output directly to such a tool.

Non-integer confidences cause error parsing

If your confidence is not a whole number then parsing it throws an Exception at line 186 of parse.py

Traceback (most recent call last):
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/bin/recode_pdf", line 302, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 640, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 210, in create_tess_textonly_pdf
    word_data = hocr_page_to_word_data(hocr_page, font_scaler)
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/env/lib/python3.9/site-packages/hocr/parse.py", line 186, in hocr_page_to_word_data
    conf = int(m.group(1).split()[0])
ValueError: invalid literal for int() with base 10: '0.988'

Code that offends is.

conf = int(m.group(1).split()[0])

You can also just test this with any old python.

> python3 
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print(int("99"))
99
>>> print(int("99.9"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '99.9'

Solution is to convert to float() first.

conf = int(float(m.group(1).split()[0]))

>>> print(int(float("99.9")))
99
>>> print(int(float("99")))
99

or perhaps use the float instead of an integer

internetarchive / archive-hocr-tools Goto Github PK

archive-hocr-tools's Introduction

archive-hocr-tools

archive-hocr-tools's People

Contributors

Stargazers

Watchers

Forkers

archive-hocr-tools's Issues

hocr-combine-stream output contains multiple </body></html> tags, producing invalid xml

Add detection of image/graphics features

Make tools more usable with pipes

Non-integer confidences cause error parsing

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent