Coder Social home page Coder Social logo

time-sheet-pdf-to-excel's Introduction

Time-sheet-PDF-to-Excel

This repo contains script to extract image from a PDF containing time sheet information, extract the text.

Prerequisites

  • pdf2image library
  • Pillow library
  • Poppler library

Inputs

  • POPPLER_PATH : The path of the Poppler library
  • IMAGE_WIDTH : The width of the image in pixels
  • IMAGE_HEIGHT : The height of the image in pixels
  • IMAGES_FOLDER : The folder where the images will be saved
  • pdf_file_path : The path of the PDF file
  • TEMP_TASK : The task that is going to be performed on the image (eg. "Pull mats")

Functions

save_images(pdf_file_path: str, images: list, destination_folder: str = IMAGES_FOLDER)

This function takes the pdf_file_path, images and the destination_folder as input. It gets the file name from the path using os.path.basename, removes the file extension and creates a folder with the same name as the file name. Then it saves the images to the folder.

inches_to_pixels(inches: float, dpi: int) -> int

Converts inches to pixels based on the given DPI (dots per inch)

mm_to_pixels(mm: float, dpi: int) -> int

Converts millimeters to pixels based on the given DPI (dots per inch)

crop_image(image_path: str, x_start: int, y_start: int, width: int, height: int, save_path: str)

This function takes an image path, x_start, y_start, width, height and save_path as input. It opens the image, check the size and DPI of the image. Then it crops the image based on the given x_start, y_start, width and height.

Finally, it saves the cropped image to the given save_path

The cropped image is then read to extract the text. To predict the text, the image is passed to an OCR pre-trained transformer model by Microsoft.

How to use

Update the POPPLER_PATH, pdf_file_path, TEMP_TASK with the appropriate values. Run the script The script will convert the pdf to image, save the image to the IMAGES_FOLDER and then extract text corresponding to the task in TEMP_TASK located in the image.

time-sheet-pdf-to-excel's People

Contributors

sughoshkulkarni avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.