Comments (2)
I'll add more languages next time I update ocrmypdf.
The Dockerfile
specifies how the container was built. It provides its own copy of tesseract and will not use the one on your machine, or anything else about your machine. It's like a lightweight virtual machine.
You can jump inside an ocrmypdf container, modify it, and save the changes as your own private image. (A container is an instance of image.)
In your case it would go something like this (not tested, made up on the spot):
$ docker run -t -i ocrmypdf /bin/bash
root@container:/# apt-get install tesseract-ocr-chi-sim
root@container:/# exit
$ docker commit -m "Added Chinese simplified" -a "Your Name"
See here:
https://docs.docker.com/engine/userguide/dockerimages/
from ocrmypdf.
I decided to produce a second version of the container which provides all Tesseract's languages.
You can use this command to download it. Then Chinese (Simplified and Traditional) will be available.
docker pull jbarlow83/ocrmypdf-polyglot
from ocrmypdf.
Related Issues (20)
- [Bug]: OCRmyPDF succeeded with warning(s): InputFileError: pdfminer could not process page 0 HOT 1
- Error: jbig2 not found on path, even though installed HOT 4
- [Bug]: OCRmyPDF Docker Hot Folder Option OCR_ON_SUCCESS_ARCHIVE OCR_ON_SUCCESS_DELETE doesnt work
- [Bug]: dpi-problem with rasterizing text HOT 5
- [Bug]: Ghostscript PDF/A rendering failed HOT 1
- [Bug]: "Corrupt JPEG data: premature end of data segment" with some files
- [Bug]: AttributeError: 'NoneType' object has no attribute 'get'
- [Bug]: Missing support for certain unicode characters HOT 4
- Recommended settings for dealing with text superimposed on clipart? HOT 1
- [Bug]: The file size increases significantly by OCR even without image recompression HOT 2
- Allow resuming OCR after DecompressionBombError HOT 3
- [Bug] SubprocessOutputError HOT 2
- [Feature]: Choose between NFKC and NFC normalization for Unicode characters so copy-pasting works HOT 5
- max_workers must be greater than 0 HOT 2
- [Feature]: Could watcher.py be enhanced to support the conversion of single or multi TIF and JPG files to PDF?
- [Bug]: DecompressionBombWarning HOT 1
- [Bug]: Memory Error
- [Bug]: Warning: "xref 473: While extracting this image, an error occurred" HOT 1
- [Bug]: watcher.py requires the "ARCHIVE" folder to be assigned, even if the option is disabled HOT 1
- Release notes don't include the latest versions HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ocrmypdf.