Comments (7)
@Mark-Joy Thanks, it looks like it's actually just changing -1 to
-text.text_transform(Matrix(1, 0, 0, 1, box.llx, 0))
+text.text_transform(Matrix(1, 0, 0, -1, box.llx, 0))
Translation: the invisible font is upside down, and some PDF viewers freak out. 🙃
But I need to test it everywhere first, since the first one slipped through.
v16.0.2 temporarily makes sandwich the main renderer again.
from ocrmypdf.
A quick fix shall be:
Change:
OCRmyPDF/src/ocrmypdf/hocrtransform/_hocr.py
Line 290 in 0b6fb62
to
bottom_left_corner = line_box.llx, line_box.lly
And:
OCRmyPDF/src/ocrmypdf/hocrtransform/_hocr.py
Line 312 in 0b6fb62
to
fontsize = line_box_height - intercept
from ocrmypdf.
16.0.2 is a temporary fix - I'll close this issue when there's a full solution and the new renderer can be reinstated as default.
The sandwich renderer (default in 16.0.2 and <16) has a number of issues like wordsegmentationproblemsinsomecases and registration (aligning selected text to actual). The new renderer, which is mostly a reimplementation of sandwich to fix sandwich's issues, doesn't work universally yet, but does have significant improvements.
from ocrmypdf.
Thanks for 16.0.2 - I can confirm that I am witnessing the same issue with my PDFs, as described in #1214. The Persian text issues persist in 16.0.2 unfortunately.
from ocrmypdf.
Thanks all -- v16.0.2 fixes the issue for me, very pleased!
from ocrmypdf.
Thanks for the fix. Can confirm that this bug caused several hours of head scratching today before I saw this issue. Really appreciate all the great work!
EDIT: I was also seeing a huge number of \ufeff
characters in the output. Moving from 16.0.0 -> 16.0.2 also fixed this.
from ocrmypdf.
v16.0.4 should contain significant fixes when running in --output-type hocr
mode. It has not made been made the default.
from ocrmypdf.
Related Issues (20)
- [Bug]: OCRmyPDF does not preserve existing XMP metadata HOT 3
- [Bug]: PDF graphics stack overflowed spec limit HOT 1
- [Bug]: RHEL 9 requires ghostscript 9.54 to work HOT 6
- [Feature]: Only optimise file, skip OCR completely HOT 2
- [Bug]: Bunch of incomprehensible OCR content to delete HOT 3
- [Bug]: 'File not found' error in latest versions HOT 4
- [Bug]: Conda - pikepdf is unavailable HOT 1
- [Feature]: Explain on the docs how to change the language of OCR on watcher.py HOT 1
- [Feature]: More Accessible Via Consistently connecting words to form sentences. HOT 2
- Doc suggestion: also great for just removing the text layer! HOT 1
- [Bug]: Memory access error if using a German terminal HOT 2
- [Bug]: Unknown tesseract error, returns non-zero HOT 1
- [Feature]: Add support for docTR as alternate OCR backend?
- [Bug]: OCR on .pdf isn't the same as tesseract but the format is correct on .txt file
- [Bug]: installation failed due to ghostcript in-compatible version and can not upgraded ghostscript in Ubuntu 20.04 HOT 1
- [Feature]: convert grayscale PDF to jbig monochrome while doing OCR HOT 1
- [Documentation]: Upgrade via pip after system install needs a different command HOT 1
- [Feature]: Integrations with other backends via hOcr (naive implementation of easyOcr backend inside) HOT 4
- [Bug]: OCRmyPDF not adding any text to document v 1.4 HOT 1
- [Feature]: sidecar Support Text Output to io.StringIO()
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.