Coder Social home page Coder Social logo

Comments (3)

MerlijnWajer avatar MerlijnWajer commented on June 5, 2024

Thank you for the report.

I agree with all three points. I had long planned to make more user friendly tooling around this technology but hadn't gotten to this point yet. This is integrated with the Archive.org stack where I also wrote the entire OCR module (which is FOSS) - which I'd have to somehow port and then tie that in to the PDF compression too. The PDF can be compressed without hOCR - but not yet with the current tooling.

Regarding your suggestions:

  1. Agreed
  2. Agreed
  3. A mostly empty hOCR could be made and there is a tool to do this, but I am not sure if the compression would be close to what you would like. That is, the OCR process helps with the quality of the compression.

There are a few more things to say on this:

from archive-pdf-tools.

FilipDominec avatar FilipDominec commented on June 5, 2024

I processed my scanned document with ocrmypdf - it generates nice searchable text overlay.

But an attempt at retrieving hOCR file fails:

$pdf-to-hocr -f scan_searchable.pdf
Traceback (most recent call last):
  File "/home/dominecf/.local/bin/pdf-to-hocr", line 429, in <module>
    process_files(args.infile, args.json_metadata_file)
  File "/home/dominecf/.local/bin/pdf-to-hocr", line 388, in process_files
    metadata = json.load(open(json_metadata_file))
TypeError: expected str, bytes or os.PathLike object, not NoneType

from archive-pdf-tools.

MerlijnWajer avatar MerlijnWajer commented on June 5, 2024

Okay, the tooling is really under documented. :-(

You first need to make a JSON file from the PDF file so that pdf-to-hocr understands what it is dealing with. The pdfcomp tool that I mentioned does this: https://github.com/internetarchive/archive-pdf-tools/blob/master/bin/pdfcomp

So perhaps you could just try to call pdfcomp on the PDF and see if it does anything sensible? It was made to be plugged into projects like ocrmypdf?

pdfcomp isn't yet a 'first class' citizen of this project, but I think with a small amount of work it can be made quite usable.

from archive-pdf-tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.