I'm using this for the first time, so I'm not sure if I'm doing something wrong or if this is a bug in one of the tools. I've tried with several different image sets but all had the same result.
$ wget https://ia800303.us.archive.org/28/items/rubaiyatfitzgera00omar/rubaiyatfitzgera00omar_jp2.zip
$ unzip rubaiyatfitzgera00omar_jp2.zip
$ cd rubaiyatfitzgera00omar_jp2
$ fd -e jp2 -x bash -c "magick {} TIFF:- | tesseract --dpi 300 - {.} hocr"
$ hocr-combine-stream -g "*.hocr" > combined.html
$ recode_pdf --from-imagestack "*.jp2" --hocr-file combined.html --dpi 300 --bg-downsample 3 --mask-compression jbig2 -o test.pdf
Traceback (most recent call last):
File "/home/jw/.local/bin/recode_pdf", line 290, in <module>
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/home/jw/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 634, in recode
create_tess_textonly_pdf(hocr_file, tess_tmp_path, in_pdf=in_pdf,
File "/home/jw/.local/lib/python3.10/site-packages/internetarchivepdf/recode.py", line 110, in create_tess_textonly_pdf
for idx, hocr_page in enumerate(hocr_iter):
File "/home/jw/.local/lib/python3.10/site-packages/hocr/parse.py", line 47, in hocr_page_iterator
for act, elem in doc:
File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
File "src/lxml/parser.pxi", line 1376, in lxml.etree._FeedParser.feed
File "src/lxml/parser.pxi", line 606, in lxml.etree._ParserContext._handleParseResult
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "/home/jw/rubaiyatfitzgera00omar_jp2/combined.html", line 25
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 25, column 2