This is a simple POC for OCR reading images and documents.
The Node.js code relies on a couple of NPM packages that in turn act as wrappers around ghostscript and tesseract that perform the actual OCR processing.
TODO
- Change to node lts
- Cleanup the Dockerfile
- Swe language in PDF processing
- Investigate PDF processing performance
brew install ghostscript tesseract tesseract-lang
nvm install && nvm use
npm ci
# Set TESSDATA_PREFIX (MacOS + brew)
export TESSDATA_PREFIX=/opt/homebrew/share/tessdata
npx nodemon
Add more images/documents to the input/
dir to try it out.