Coder Social home page Coder Social logo

archive-hocr-tools's Introduction

archive-hocr-tools

This repository contains a python package to perform hOCR parsing efficiently, and it also contains a set of tools that can help perform operations on and analyse hOCR files.

  • hocr-combine-stream: A tool to combine many hocr files into a big hocr file while keeping memory usage low. Used internally to combine tesseract per-page results into a larger hocr resulting file for an entire book.
  • hocr-pagenumbers: A tool to find pagenumbers in multi-page hOCR documents
  • hocr-fold-chars: A tool to transform a per-character hocr file into a per-word hocr file.
  • pdf-to-hocr: A tool to take text content embedded in a PDF, and extract it as hOCR format.
  • See more tools in the ./bin directory, not all have been documented yet.

The python library is called hocr.

archive-hocr-tools's People

Contributors

cclauss avatar jrochkind avatar merlijnwajer avatar stweil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archive-hocr-tools's Issues

hocr-combine-stream output contains multiple </body></html> tags, producing invalid xml

When trying out hocr-combine-stream -g "*.hocr" > combined.html to merge several hocr files produced by tesseract, the resulting output contains multiple </body> and </html> tags, at the close of each of the input pages. recode_pdf is unable to work with the combined hocr file, exiting with the error "lxml.etree.XMLSyntaxError: Extra content at the end of the document."

I'm using this for the first time, so I'm not sure if I'm doing something wrong or if this is a bug in one of the tools. I've tried with several different image sets but all had the same result.

Using Arch Linux with software versions:

archive-hocr-tools 1.1.19
archive-pdf-tools 1.4.15
python-lxml 4.8.0
tesseract 5.1.0
python 3.10.4

Steps to reproduce:

$ wget https://ia800303.us.archive.org/28/items/rubaiyatfitzgera00omar/rubaiyatfitzgera00omar_jp2.zip
$ unzip rubaiyatfitzgera00omar_jp2.zip
$ cd rubaiyatfitzgera00omar_jp2
$ fd -e jp2 -x bash -c "magick {} TIFF:- | tesseract --dpi 300 - {.} hocr"
$ hocr-combine-stream -g "*.hocr" > combined.html
$ recode_pdf --from-imagestack "*.jp2" --hocr-file combined.html --dpi 300 --bg-downsample 3 --mask-compression jbig2 -o test.pdf
Traceback (most recent call last):
  File "/home/jw/.local/bin/recode_pdf", line 290, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/home/jw/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 634, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/home/jw/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 110, in create_tess_textonly_pdf
    for idx, hocr_page in enumerate(hocr_iter):
  File "/home/jw/.local/lib/python3.10/site-packages/hocr/parse.py", line 47, in hocr_page_iterator
    for act, elem in doc:
  File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
  File "src/lxml/parser.pxi", line 1376, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 606, in lxml.etree._ParserContext._handleParseResult
  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
  File "/home/jw/rubaiyatfitzgera00omar_jp2/combined.html", line 25
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 25, column 2

The attached example-hocr.zip file contains the .hocr files and combined.html produced from the preceding example.

Add detection of image/graphics features

As far as I can tell, the way things are set up the hOCR files generated don't include any layout information on the location of images in a document.

From the spec it looks like this should be possible? http://kba.cloud/hocr-spec/1.2/#floats-image

The previous OCR approach using ABBYY did produce picture features, and this allowed some really exciting things like programatically extracting and exploring images from books, which then resulted in the Internet Archive Book Images project, something that wouldn't really be feasible if you had to download every book page image just to check if it contained any illustrations.

Is there a reason why this features aren't included, or is this something that just needs to be enabled?

Edit:

Ah, I think I might be in the wrong place, I thought this repository related to the generation of the hocr files. Does that stuff live somewhere else?

Edit2:

Found it: https://git.archive.org/www/tesseract

Make tools more usable with pipes

Many of the tools currently cannot work in special files in /dev/stdin in bash, or in general accept files from stdin, this is because of some unnecessary seeks.

Additionally, it would be nice to add some features to filter (for example) by word confidence. This could be done in hocr-text, but we could also have a streaming hocr filter tool that takes hocr as input, and also outputs hocr, but only allows words with certain confidence to pass. This would need to be streaming which makes it a little tricky, but it would be cool to for example pipe Tesseract output directly to such a tool.

Non-integer confidences cause error parsing

If your confidence is not a whole number then parsing it throws an Exception at line 186 of parse.py

Traceback (most recent call last):
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/bin/recode_pdf", line 302, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 640, in recode
    create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/./internetarchivepdf/recode.py", line 210, in create_tess_textonly_pdf
    word_data = hocr_page_to_word_data(hocr_page, font_scaler)
  File "/Users/jaredwhiklo/www/DAM/scripts/archive-pdf-tools/env/lib/python3.9/site-packages/hocr/parse.py", line 186, in hocr_page_to_word_data
    conf = int(m.group(1).split()[0])
ValueError: invalid literal for int() with base 10: '0.988'

Code that offends is.

conf = int(m.group(1).split()[0])

You can also just test this with any old python.

> python3 
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> print(int("99"))
99
>>> print(int("99.9"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '99.9'

Solution is to convert to float() first.

conf = int(float(m.group(1).split()[0]))
>>> print(int(float("99.9")))
99
>>> print(int(float("99")))
99

or perhaps use the float instead of an integer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.