Thank you for this interesting project, which seems to exactly fit my needs, but so fa

A user-friendly example for a scanned multipage PDF needed about archive-pdf-tools HOT 3 OPEN

FilipDominec commented on June 5, 2024

A user-friendly example for a scanned multipage PDF needed

from archive-pdf-tools.

Comments (3)

MerlijnWajer commented on June 5, 2024

Thank you for the report.

I agree with all three points. I had long planned to make more user friendly tooling around this technology but hadn't gotten to this point yet. This is integrated with the Archive.org stack where I also wrote the entire OCR module (which is FOSS) - which I'd have to somehow port and then tie that in to the PDF compression too. The PDF can be compressed without hOCR - but not yet with the current tooling.

Regarding your suggestions:

Agreed
Agreed
A mostly empty hOCR could be made and there is a tool to do this, but I am not sure if the compression would be close to what you would like. That is, the OCR process helps with the quality of the compression.

There are a few more things to say on this:

It is easy to make a "stub" hOCR file, but the compression might suffer. I did work on a tool called pdfcomp just to recompress a given PDF mentioned in this issue: #51 and this issue: ocrmypdf/OCRmyPDF#541 (comment) - it works but could see some more testing.
You can actually make a hOCR directly from a PDF that has a text layer using this tool: https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/pdf-to-hocr - but I'm assuming that your scan doesn't have a PDF.

from archive-pdf-tools.

FilipDominec commented on June 5, 2024

I processed my scanned document with ocrmypdf - it generates nice searchable text overlay.

But an attempt at retrieving hOCR file fails:

$pdf-to-hocr -f scan_searchable.pdf
Traceback (most recent call last):
  File "/home/dominecf/.local/bin/pdf-to-hocr", line 429, in <module>
    process_files(args.infile, args.json_metadata_file)
  File "/home/dominecf/.local/bin/pdf-to-hocr", line 388, in process_files
    metadata = json.load(open(json_metadata_file))
TypeError: expected str, bytes or os.PathLike object, not NoneType

from archive-pdf-tools.

MerlijnWajer commented on June 5, 2024

Okay, the tooling is really under documented. :-(

You first need to make a JSON file from the PDF file so that pdf-to-hocr understands what it is dealing with. The pdfcomp tool that I mentioned does this: https://github.com/internetarchive/archive-pdf-tools/blob/master/bin/pdfcomp

So perhaps you could just try to call pdfcomp on the PDF and see if it does anything sensible? It was made to be plugged into projects like ocrmypdf?

pdfcomp isn't yet a 'first class' citizen of this project, but I think with a small amount of work it can be made quite usable.

from archive-pdf-tools.

Recommend Projects

A user-friendly example for a scanned multipage PDF needed about archive-pdf-tools HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent