<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Using: <div class="snippet-clipboard-content notranslate position-relative overflo

IndexError: list index out of range (single TIFF file) about archive-pdf-tools HOT 5 CLOSED

jrochkind commented on June 11, 2024

IndexError: list index out of range (single TIFF file)

from archive-pdf-tools.

Comments (5)

jrochkind commented on June 11, 2024

OK, I think there may in fact be no way to run recode_pdf on a single-page?

If I use --from-imagestack 'some_dir/*', if that dir only has one image in it -- I still get IndexError: list index out of range

If I use eg --from-imagestack /tmp/scan.tiff (an example from README), I also get IndexError: list index out of range.

If I use --from-imagestack 'some_dir/*' on a directory that has at least two image files in it -- it works.

Is there no way to run on a PDF with only image? Is this a bug?

from archive-pdf-tools.

MerlijnWajer commented on June 11, 2024

I believe the problem is that this hocr file contains two ocr_page elements, which can happen if you run Tesseract on a TIFF file that contains two images - this seems to be the case here. A tiff with an embedded thumbnail is also seen as two images.

If you tell Tesseract to use only the first image, this problem will go away, try passing -c tessedit_page_number=0.

The --from-imagestack takes a glob as argument, so it can definitely deal with a single image - the problem here occurs because the hOCR file claims to contain two pages. :-)

from archive-pdf-tools.

MerlijnWajer commented on June 11, 2024

Using:

tesseract -c tessedit_page_number=0 insuring_15.tiff - hocr > /tmp/test.hocr
recode_pdf -v --from-imagestack insuring_15.tiff --hocr-file test.hocr -o out.pdf

out.pdf

from archive-pdf-tools.

jrochkind commented on June 11, 2024

Oh right, that makes sense!

Thank you for helping me figure this out, and suggesting the tesseract command to only take the first one!

Now that you mention it, I recall us having problems before with these embedded thumbnails that our production process winds up embedding for reasons we don't really know. But I had forgotten about that, and hadn't noticed the double page in the HOCR.

It seems like it might be better to get a more clear error message like "number of pages in HOCR does not match number of images provided" -- but this might not be the highest priority in archive-pdf-tools.

I will close this issue. Thanks so much for your help, and for providing this code!

from archive-pdf-tools.

MerlijnWajer commented on June 11, 2024

Thanks for the suggestion, I have committed said error message in commit 11661f2. (Untested)

from archive-pdf-tools.

Recommend Projects

IndexError: list index out of range (single TIFF file) about archive-pdf-tools HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent