Comments (2)
It seems to me that it is related to hocr
pdf renderer, which is enabled by default now. It produces a better visual quality (see e.g. #1131 ), however it increases the size of OCR layer almost twice.
With the option --pdf-renderer sandwich
I obtain the following sizes for the same file:
251 KB → 310 KB → 369 KB
So the OCR layer takes 59 KB for sandwich
and 98 KB for hocr
So the questions are:
- Is it possible to optimize
hocr
renderer? - Is it possible to remove previously added OCR layer without image recompression?
(--force-ocr
is not suitable for this task)
from ocrmypdf.
I wrote a script that redoes the ocr on PDFs by deleting any original text from the file and then using ocrmypdf to generate new ocr which I then add to the original file. I use it mainly to replace the often bad ocr in jstor files. It uses Ghostscript to remove the text and relies on some other stuff that you can see in the code.
from ocrmypdf.
Related Issues (20)
- [Bug]: OCRmyPDF Docker Hot Folder Option OCR_ON_SUCCESS_ARCHIVE OCR_ON_SUCCESS_DELETE doesnt work
- [Bug]: dpi-problem with rasterizing text HOT 5
- [Bug]: Ghostscript PDF/A rendering failed HOT 1
- [Bug]: "Corrupt JPEG data: premature end of data segment" with some files
- [Bug]: AttributeError: 'NoneType' object has no attribute 'get'
- [Bug]: Missing support for certain unicode characters HOT 4
- Recommended settings for dealing with text superimposed on clipart? HOT 1
- Allow resuming OCR after DecompressionBombError HOT 3
- [Bug] SubprocessOutputError HOT 2
- [Feature]: Choose between NFKC and NFC normalization for Unicode characters so copy-pasting works HOT 5
- max_workers must be greater than 0 HOT 2
- [Feature]: Could watcher.py be enhanced to support the conversion of single or multi TIF and JPG files to PDF?
- [Bug]: DecompressionBombWarning HOT 1
- [Bug]: Memory Error
- [Bug]: Warning: "xref 473: While extracting this image, an error occurred" HOT 1
- [Bug]: watcher.py requires the "ARCHIVE" folder to be assigned, even if the option is disabled HOT 1
- Release notes don't include the latest versions HOT 1
- [Bug]: real text replaced by � � (visually unchanged, only by copying)
- [Feature]: Change demo format to VHS
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.