Coder Social home page Coder Social logo

file-data-extraction's Introduction

Research

Selection options

  1. PyMuPDF Pros

    1. Recent Github activity; commits + closed issues + closed pull requests
    2. Github popularity; 2.6k Stars + 325 Forks
    3. Integrates Google Tesseract Engine for Optical Character Recognition (OCR)
    4. File conversions, to and from Pdf or other formats
    5. Wide range of support for working with text, images, drawings, shape objects, forms in pdf files
    6. Multiprocessing

    Cons

    1. Extracting from a table in pdf files

    Unknowns

    1. Batch processing of files
    2. Async operations
  2. PyTesseract Pros

    1. Recent Github activity; commits + closed issues + closed pull requests
    2. Github popularity; 4.9k Stars + 659 Forks
    3. Wraps around Google Tesseract Engine for Optical Character Recognition (OCR)
    4. File output conversions, to and from Pdf or other formats
    5. Language setting

    Cons

    1. ??

    Unknowns

    1. Batch processing of files
    2. Multiprocessing
    3. Async operations
  3. Textract Pros

    1. Wide range of file support
    2. Github popularity; 3.5k Stars + 528 Forks
    3. Uses Google Tesseract Engine for Optical Character Recognition (OCR)
    4. Works with video, audio, doc files
    5. Language setting

    Cons

    1. Minimal recent Github activity; commits + closed issues + closed pull requests

    Unknowns

    1. Batch processing of files
    2. Multiprocessing
    3. Async operations
  4. PdfMiner.Six Pros

    1. Available in Command line
    2. Github popularity; 4.6k Stars + 834 Forks
    3. Wide range of support for working with text, shape objects, images in pdf files
    4. File output generation, to and from Pdf or other formats

    Cons

    1. Minimal recent Github activity; commits + closed issues + closed pull requests
    2. Extracting from a table in pdf files

    Unknowns

    1. Batch processing of files
    2. Multiprocessing
    3. Async operations
  5. PdfPlumber Pros

    1. Recent Github activity; commits + closed issues + closed pull requests
    2. Github popularity; 4k Stars + 504 Forks
    3. Wide range of support for working with text, lines, shape objects, images, tables, forms in pdf files
    4. Visual debugging using ImageMagick implementation
    5. Pdf file/single page conversion to image

    Cons

    1. Works with pdf files only
    2. Does not support Optical Character Recognition (OCR)
    3. Generating a pdf file from another format

    Unknowns

    1. Batch processing of files
    2. Multiprocessing
    3. Async operations
  6. PyPdf Pros

    1. Recent Github activity; commits + closed issues + closed pull requests
    2. Github popularity; 5.8k Stars + 1.2k Forks
    3. Wide range of support for working with text and metadata in pdf files

    Cons

    1. Works with pdf files only
    2. Extracting from a image, table, shape objects in pdf files
    3. Does not support Optical Character Recognition (OCR)

    Unknowns

    1. Batch processing of files
    2. Multiprocessing
    3. Async operations

Setup

# Install Tesseract OCR engine
sudo apt install -y tesseract-ocr
sudo apt install -y libtesseract-dev

# Setup virtual enviroment
python3 -m venv .venv

# Activate the virtual environment
. ./.venv/bin/activate

# Upgrade pip to the latest version
pip install --upgrade pip

# Install the python packages
pip install -r requirements.txt

# Running the app
python3 app.py

Challenges

  • Flask uploads doesn't seem to work correctly and gives an error message about werkzeug
ImportError: cannot import name 'secure_filename' from 'werkzeug' 

Solution: Decided to use file type extensions instead of trying another module like flask-Reuploaded

  • OCR conversion
Couldn't get PymuPDF to utilize the tesseract OCR engine and fell short in handling documents that required OCR

Solution: imported pytesseract which wraps around the tesseract engine to handle OCR file processing

Screenshots

Video

file-data-extraction's People

Contributors

n1klaus avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.