Research

Selection options

PyMuPDF Pros
1. Recent Github activity; commits + closed issues + closed pull requests
2. Github popularity; 2.6k Stars + 325 Forks
3. Integrates Google Tesseract Engine for Optical Character Recognition (OCR)
4. File conversions, to and from Pdf or other formats
5. Wide range of support for working with text, images, drawings, shape objects, forms in pdf files
6. Multiprocessing
Cons
1. Extracting from a table in pdf files
Unknowns
1. Batch processing of files
2. Async operations
PyTesseract Pros
1. Recent Github activity; commits + closed issues + closed pull requests
2. Github popularity; 4.9k Stars + 659 Forks
3. Wraps around Google Tesseract Engine for Optical Character Recognition (OCR)
4. File output conversions, to and from Pdf or other formats
5. Language setting
Cons
1. ??
Unknowns
1. Batch processing of files
2. Multiprocessing
3. Async operations
Textract Pros
1. Wide range of file support
2. Github popularity; 3.5k Stars + 528 Forks
3. Uses Google Tesseract Engine for Optical Character Recognition (OCR)
4. Works with video, audio, doc files
5. Language setting
Cons
1. Minimal recent Github activity; commits + closed issues + closed pull requests
Unknowns
1. Batch processing of files
2. Multiprocessing
3. Async operations
PdfMiner.Six Pros
1. Available in Command line
2. Github popularity; 4.6k Stars + 834 Forks
3. Wide range of support for working with text, shape objects, images in pdf files
4. File output generation, to and from Pdf or other formats
Cons
1. Minimal recent Github activity; commits + closed issues + closed pull requests
2. Extracting from a table in pdf files
Unknowns
1. Batch processing of files
2. Multiprocessing
3. Async operations
PdfPlumber Pros
1. Recent Github activity; commits + closed issues + closed pull requests
2. Github popularity; 4k Stars + 504 Forks
3. Wide range of support for working with text, lines, shape objects, images, tables, forms in pdf files
4. Visual debugging using ImageMagick implementation
5. Pdf file/single page conversion to image
Cons
1. Works with pdf files only
2. Does not support Optical Character Recognition (OCR)
3. Generating a pdf file from another format
Unknowns
1. Batch processing of files
2. Multiprocessing
3. Async operations
PyPdf Pros
1. Recent Github activity; commits + closed issues + closed pull requests
2. Github popularity; 5.8k Stars + 1.2k Forks
3. Wide range of support for working with text and metadata in pdf files
Cons
1. Works with pdf files only
2. Extracting from a image, table, shape objects in pdf files
3. Does not support Optical Character Recognition (OCR)
Unknowns
1. Batch processing of files
2. Multiprocessing
3. Async operations

Setup

# Install Tesseract OCR engine
sudo apt install -y tesseract-ocr
sudo apt install -y libtesseract-dev

# Setup virtual enviroment
python3 -m venv .venv

# Activate the virtual environment
. ./.venv/bin/activate

# Upgrade pip to the latest version
pip install --upgrade pip

# Install the python packages
pip install -r requirements.txt

# Running the app
python3 app.py

Challenges

Flask uploads doesn't seem to work correctly and gives an error message about werkzeug

ImportError: cannot import name 'secure_filename' from 'werkzeug'

Solution: Decided to use file type extensions instead of trying another module like flask-Reuploaded

OCR conversion

Couldn't get PymuPDF to utilize the tesseract OCR engine and fell short in handling documents that required OCR

Solution: imported pytesseract which wraps around the tesseract engine to handle OCR file processing

n1klaus / file-data-extraction Goto Github PK

file-data-extraction's Introduction

Research

Selection options

Setup

Challenges

Screenshots

Video

file-data-extraction's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent