OCR

built & tested using Python 3.11.2

Minimal version of Python script that finds PDF files in speficied directory, converts them into .txt files using Python Tesseract & saves final files in same directory as original PDF files.

Installation (on windows 10)

clone repo
enter repo directory: cd ocr
install Python tesseract
create virtual environment: py -m venv venv
activate virtual environment: venv\Scripts\activate.bat
update pip: py -m pip install --upgrade pip
install requirements: pip install -r requirements.txt
run program as described below (Usage)

Usage (on windows 10)

run py ocr.py
follow instructions & prompts of program

Quickstart

run py ocr.py
press Enter to use sample PDFs in ./samplePDFs subdirectory by default

What happens

ocr.py creates .txt files with content of all PDFs in given directory

for each PDF in target directory (default: ./samplePDFs): pdf2image module used to convert PDFs in into images
- for each image
  - pytesseract module used to convert text in image to string, then appends text to .txt file with name of original PDF & saves alongside original PDF
process takes up to 10 minutes
content of ./samplePDFs expected to look like ./samplePDFsResult eventually

Resources

sample PDFs

Limitations / Known Issues

potentially inaccurate - depending on quality, structure & content of input PDFs (images, charts, ...)

Potential Improvements

adjust OCR settings to real world input PDFs (to achieve best results for expected input)
create REST API to get share results with clients (long-term?)
- authentication & encryption (e.g. using JSON Web Token)

sammeeey / ocr Goto Github PK

ocr's Introduction

OCR

Installation (on windows 10)

Usage (on windows 10)

Quickstart

What happens

Resources

sample PDFs

Limitations / Known Issues

Potential Improvements

ocr's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent