built & tested using Python 3.11.2
Minimal version of Python script that finds PDF files in speficied directory, converts them into .txt files using Python Tesseract & saves final files in same directory as original PDF files.
- clone repo
- enter repo directory:
cd ocr
- install Python tesseract
- create virtual environment:
py -m venv venv
- activate virtual environment:
venv\Scripts\activate.bat
- update pip:
py -m pip install --upgrade pip
- install requirements:
pip install -r requirements.txt
- run program as described below (Usage)
- run
py ocr.py
- follow instructions & prompts of program
- run
py ocr.py
- press Enter to use sample PDFs in
./samplePDFs
subdirectory by default
ocr.py
creates .txt files with content of all PDFs in given directory
- for each PDF in target directory (default:
./samplePDFs
): pdf2image module used to convert PDFs in into images- for each image
- pytesseract module used to convert text in image to string, then appends text to .txt file with name of original PDF & saves alongside original PDF
- for each image
- process takes up to 10 minutes
- content of
./samplePDFs
expected to look like./samplePDFsResult
eventually
- https://www.africau.edu/images/default/sample.pdf
- https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf
- https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf
- https://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf
- https://file-examples.com/index.php/sample-documents-download/sample-pdf-download/
- https://www.orimi.com/pdf-test.pdf
- potentially inaccurate - depending on quality, structure & content of input PDFs (images, charts, ...)
- adjust OCR settings to real world input PDFs (to achieve best results for expected input)
- create REST API to get share results with clients (long-term?)
- authentication & encryption (e.g. using JSON Web Token)