Comments (5)
I am happy to submit a patch, if you accept contributions. I would suggest having a command-line option like --normalize-unicode=NFKC
, which should be the default. Obviously, the documentation will have to describe why you would want to pick NFC over NFKC. I think it shouldn’t offer NFD normalization, but if someone has a valid use-case, it can be easily added on later.
I am also open to other command-line flags, if you think that users learning about Unicode normalization is too much to impose on them.
from ocrmypdf.
@sfllaw I appreciate the suggestion but I think what will really be needed is to insert markup into the PDF that allows competent PDF readers to see what is going on - and then testing to see if it helps sufficiently.
If you want to attempt the relevant portion of the spec is below:
from ocrmypdf.
@jbarlow83 If I understand correctly, you are suggesting that we use ActualText
to clobber the invisible GlyphLessFont text that Tesseract produces with the NFC normalization?
That is, for the above example with the scanned 1½
, OCRmyPDF would produce something like:
/Span<</ActualText(1½)>>
BDC
11/2 Tj
EMC
Maybe I am misunderstanding your proposal, because it seems like this will depend on how the PDF reader deals with ActualText
? I have tried this using Evince, Okular, qpdfview, xpdf, and Chrome and they all don’t match 1/2
when searching, because the invisible text has been overridden.
Because of this, I can’t think of an advantage over skipping NFKC altogether and rendering the NFC version in GlyphLessFont.
Does ActualText
work as you expect in your PDF reader? Did you have a different example in mind?
Also, it looks like some PDF readers don’t handle non-trivial ActualText
correctly, but I have not investigated this deeply: ho-tex/accsupp#2
from ocrmypdf.
A few key points here:
- The data (not text) inside parentheses at the relevant data should be treated as binary data, and if it happens to resemble text, that's just a convenient coincidence.
- The binary data inside parentheses in a PDF is interpreted in PdfDocEncoding not UTF-8 or any other encoding. Use
'½'.encode('pdfdoc')
to perform the conversion to bytes. - The binary data inside parentheses is a list of character IDs to render. The font defines what glyph IDs to render for a given character IDs (accents can be rendered as multiple glyph IDs), and (hopefully) provides a mapping from character IDs to Unicode. Most fonts are sane and make character IDs equal to Unicode numbers, but that is by no means required.
When using parenthesis in a content stream the character IDs must be encoded in pdfdoc. However ½ is U+00BD which is b'\xbd'
in pdfdoc. If you encode a content stream in UTF-8, ½
would be encoded as b'\xc2\xbd'
which is not equivalent. Does the hexdump show /ActualText(... 31 BD ...)
or /ActualText(... 31 C2 BD...)
? If the latter, that would explain why the text was not recognized - it looks like '1½'
in pdfdoc.
In reference to the final point, GlyphLessFont defines itself as having a 1:1 mapping of Unicode to character ID, and then maps all character IDs to glyph ID 1, which is a blank cell. Actual text is supposed to supply an alternate list of characters IDs that are used for searching and copy-pasting, but not for rendering, such as in the example from the PDF reference manual, where hyphenation is present in the rendered text but eliminated in ActualText.
All that said, it's quite possible most PDF viewers don't respect ActualText even when it's properly encoded.
from ocrmypdf.
Thank you for all the details about the intricate Unicode handling in PDFs! However, I’d like to pop the stack and talk about the bigger picture.
When OCRmyPDF encounters the ocrx_word 1½
in an hOCR file, it normalizes that to 11/2
, which is a much larger number! You suggested that ActualText markup would allow competent PDF readers to see what is going on, but I don't understand how that would work better than doing NFC normalization instead of NFKC. Since OCRmyPDF already typesets invisible text, why do we need to add ActualText on top of it?
I’d really like to solve this bug in a way that you’d be happy with. Could you please help me understand your proposal?
from ocrmypdf.
Related Issues (20)
- [Bug]: OCRmyPDF Docker Hot Folder Option OCR_ON_SUCCESS_ARCHIVE OCR_ON_SUCCESS_DELETE doesnt work
- [Bug]: dpi-problem with rasterizing text HOT 5
- [Bug]: Ghostscript PDF/A rendering failed HOT 1
- [Bug]: "Corrupt JPEG data: premature end of data segment" with some files
- [Bug]: AttributeError: 'NoneType' object has no attribute 'get'
- [Bug]: Missing support for certain unicode characters HOT 4
- Recommended settings for dealing with text superimposed on clipart? HOT 1
- [Bug]: The file size increases significantly by OCR even without image recompression HOT 2
- Allow resuming OCR after DecompressionBombError HOT 3
- [Bug] SubprocessOutputError HOT 2
- max_workers must be greater than 0 HOT 2
- [Feature]: Could watcher.py be enhanced to support the conversion of single or multi TIF and JPG files to PDF?
- [Bug]: DecompressionBombWarning HOT 1
- [Bug]: Memory Error
- [Bug]: Warning: "xref 473: While extracting this image, an error occurred" HOT 1
- [Bug]: watcher.py requires the "ARCHIVE" folder to be assigned, even if the option is disabled HOT 1
- Release notes don't include the latest versions HOT 1
- [Bug]: real text replaced by � � (visually unchanged, only by copying)
- [Feature]: Change demo format to VHS
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.