Comments (3)
Thank you for the report.
I agree with all three points. I had long planned to make more user friendly tooling around this technology but hadn't gotten to this point yet. This is integrated with the Archive.org stack where I also wrote the entire OCR module (which is FOSS) - which I'd have to somehow port and then tie that in to the PDF compression too. The PDF can be compressed without hOCR - but not yet with the current tooling.
Regarding your suggestions:
- Agreed
- Agreed
- A mostly empty hOCR could be made and there is a tool to do this, but I am not sure if the compression would be close to what you would like. That is, the OCR process helps with the quality of the compression.
There are a few more things to say on this:
-
It is easy to make a "stub" hOCR file, but the compression might suffer. I did work on a tool called
pdfcomp
just to recompress a given PDF mentioned in this issue: #51 and this issue: ocrmypdf/OCRmyPDF#541 (comment) - it works but could see some more testing. -
You can actually make a hOCR directly from a PDF that has a text layer using this tool: https://github.com/internetarchive/archive-hocr-tools/blob/master/bin/pdf-to-hocr - but I'm assuming that your scan doesn't have a PDF.
from archive-pdf-tools.
I processed my scanned document with ocrmypdf
- it generates nice searchable text overlay.
But an attempt at retrieving hOCR file fails:
$pdf-to-hocr -f scan_searchable.pdf
Traceback (most recent call last):
File "/home/dominecf/.local/bin/pdf-to-hocr", line 429, in <module>
process_files(args.infile, args.json_metadata_file)
File "/home/dominecf/.local/bin/pdf-to-hocr", line 388, in process_files
metadata = json.load(open(json_metadata_file))
TypeError: expected str, bytes or os.PathLike object, not NoneType
from archive-pdf-tools.
Okay, the tooling is really under documented. :-(
You first need to make a JSON file from the PDF file so that pdf-to-hocr understands what it is dealing with. The pdfcomp
tool that I mentioned does this: https://github.com/internetarchive/archive-pdf-tools/blob/master/bin/pdfcomp
So perhaps you could just try to call pdfcomp
on the PDF and see if it does anything sensible? It was made to be plugged into projects like ocrmypdf
?
pdfcomp
isn't yet a 'first class' citizen of this project, but I think with a small amount of work it can be made quite usable.
from archive-pdf-tools.
Related Issues (20)
- pillow is not working properly HOT 27
- Need some inspiration? HOT 7
- Some scans become inverted HOT 7
- Detect if RGB images in pages are greyscale or even 1bit
- Define scope of tooling and work to improve for that scope
- Create better presets for users with quality-comparable options for openjpeg/grok/pillow and kakadu HOT 1
- Missing test suite? HOT 1
- pdfcomp: new tool, discussion, compression questions HOT 19
- Bug in foreground/background separator choosing massive block instead of character outline. HOT 14
- The choice for inverting, what's the use for perc_larger?
- pdfcomp: problems with inverted text that is often better in hocr. HOT 10
- Wrong resolution of mask image when foreground image is downsampled HOT 1
- First recode_pdf test: 'numpy' has no attribute 'int'. HOT 5
- IndexError: list index out of range (single TIFF file) HOT 5
- HOCR rendering compares unfavorably with tesseract PDF text layer HOT 11
- Installing on MacOS? HOT 29
- Q: accessible tagging/hints? HOT 4
- A certain PDF from Archive.org does not display all of its contents on Mac OS HOT 26
- Recode does not merge hocr into pdf HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from archive-pdf-tools.