Coder Social home page Coder Social logo

Comments (5)

jrochkind avatar jrochkind commented on June 11, 2024

OK, I think there may in fact be no way to run recode_pdf on a single-page?

If I use --from-imagestack 'some_dir/*', if that dir only has one image in it -- I still get IndexError: list index out of range

If I use eg --from-imagestack /tmp/scan.tiff (an example from README), I also get IndexError: list index out of range.

If I use --from-imagestack 'some_dir/*' on a directory that has at least two image files in it -- it works.

Is there no way to run on a PDF with only image? Is this a bug?

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on June 11, 2024

I believe the problem is that this hocr file contains two ocr_page elements, which can happen if you run Tesseract on a TIFF file that contains two images - this seems to be the case here. A tiff with an embedded thumbnail is also seen as two images.

If you tell Tesseract to use only the first image, this problem will go away, try passing -c tessedit_page_number=0.

The --from-imagestack takes a glob as argument, so it can definitely deal with a single image - the problem here occurs because the hOCR file claims to contain two pages. :-)

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on June 11, 2024

Using:

tesseract -c tessedit_page_number=0 insuring_15.tiff - hocr > /tmp/test.hocr
recode_pdf -v --from-imagestack insuring_15.tiff --hocr-file test.hocr -o out.pdf

out.pdf

from archive-pdf-tools.

jrochkind avatar jrochkind commented on June 11, 2024

Oh right, that makes sense!

Thank you for helping me figure this out, and suggesting the tesseract command to only take the first one!

Now that you mention it, I recall us having problems before with these embedded thumbnails that our production process winds up embedding for reasons we don't really know. But I had forgotten about that, and hadn't noticed the double page in the HOCR.

It seems like it might be better to get a more clear error message like "number of pages in HOCR does not match number of images provided" -- but this might not be the highest priority in archive-pdf-tools.

I will close this issue. Thanks so much for your help, and for providing this code!

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on June 11, 2024

Thanks for the suggestion, I have committed said error message in commit 11661f2. (Untested)

from archive-pdf-tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.