Coder Social home page Coder Social logo

pdf_text_extraction_jr's Introduction

Text Extraction from PDFs

In this repo, I will provide a comprehensive guide on extracting text data from PDF files in Python. This approach will cover the text extraction for different components in PDFs such as:

  • Plain text
  • Tables
  • Images in the PDF

For the full guide you can read my article on Medium: https://bit.ly/3RtPuCw

To achieve that we will use the PDFMiner library to perform an initial analysis of the layout of the PDF and identify the proper tool needed for the specific component. Then based on the component found we will apply the appropriate function and Python Library.

The output of this process will be a Python dictionary containing information extracted for each page of the PDF file. Each key in this dictionary will present the page number of the document, and its corresponding value will be a list with the following 5 nested lists containing:

  1. The text extracted per text block of the corpus
  2. The format of the text in each text block in terms of font family and size
  3. The text extracted from the images on the page
  4. The text extracted from tables in a structured format
  5. The complete text content of the page

You can see a flowchart of the process below:

To extract text from Plain Corpus:

  • We use the get_text() method of the LTTextContainer element provided be PDFMiner to extract the text presented in the container.
  • We iterate through the LTTextContainer object to access each LTTextLine object and then we access each individual character element as LTChar collecting the metadata for its fromat

To extract text from Images:

  • We use crop_image() function to find the coordinates of the image box detected from PDFMiner and then to crop and save it as a new PDF in our directory using the PyPDF2 library.
  • We employ the convert_from_file() function from the pdf2image library to convert all PDF files in the directory into a list of images, saving them in PNG format.
  • We use the Image package of the PIL module and implement the image_to_string() function of pytesseract to extract text from the images using the tesseract OCR engine.

To extract text from Tables:

  • We employ the extract_table() function, utilising the pdfplumber library,to extract the contents of the table into a list of lists
  • We table_converter() to join the contents of those lists in a table-like string.

pdf_text_extraction_jr's People

Contributors

g-stavrakis avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.