Coder Social home page Coder Social logo

Comments (2)

ybeltukov avatar ybeltukov commented on June 2, 2024

It seems to me that it is related to hocr pdf renderer, which is enabled by default now. It produces a better visual quality (see e.g. #1131 ), however it increases the size of OCR layer almost twice.

With the option --pdf-renderer sandwich I obtain the following sizes for the same file:
251 KB → 310 KB → 369 KB

So the OCR layer takes 59 KB for sandwich and 98 KB for hocr

So the questions are:

  • Is it possible to optimize hocr renderer?
  • Is it possible to remove previously added OCR layer without image recompression?
    (--force-ocr is not suitable for this task)

from ocrmypdf.

Jmuccigr avatar Jmuccigr commented on June 2, 2024

I wrote a script that redoes the ocr on PDFs by deleting any original text from the file and then using ocrmypdf to generate new ocr which I then add to the original file. I use it mainly to replace the often bad ocr in jstor files. It uses Ghostscript to remove the text and relies on some other stuff that you can see in the code.

from ocrmypdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.