Coder Social home page Coder Social logo

andrealenzi11 / py-poppleract Goto Github PK

View Code? Open in Web Editor NEW
6.0 1.0 0.0 200 KB

Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents

License: GNU General Public License v2.0

Shell 2.22% Dockerfile 11.05% Python 86.73%
ocr optical-character-recognition pdf-reader pdf-to-text pdf2text pdftotext poppler tesseract tesseract-ocr text-extraction

py-poppleract's Introduction

py-poppleract

Python library and Web service based on Poppler Pdftotext utility and Tesseract OCR for extracting text from PDF documents.

Often, many pdf documents are of mixed type and contain:

  • machine-readable pages from which text can be extracted with pdf rendering libraries;
  • not machine-readable pages (images) from which text can only be extracted with OCR engines.

With this tool (Poppleract), you can extract text from mixed documents efficiently and easily .

How to extract text from an input pdf through the web service

Run the Poppleract services

Build the Poppleract services image or pull it from Docker Hub:

docker pull andrealenzi/poppleract-services:0.0.7

Run and expose on the desired port (ex. 50000) the Poppleract services:

docker run -it --rm -p 50000:8080 andrealenzi/poppleract-services:0.0.7

See the APIs documentation:

http://0.0.0.0:50000/docs
http://0.0.0.0:50000/redoc

Call the Text Extraction Service

Perform a CURL to extract text from the specified input pdf doc:

curl -X 'POST' \
  'http://0.0.0.0:50000/extract_text?minimum_chars_number=20&raw=false&physical=false&dpi=200&lang=eng&oem=3&psm=3&thresholding_method=0&preserve_interword_spaces=1' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F '[email protected];type=application/pdf'

Or perform a request with Python:

import requests

headers = {
    'accept': 'application/json'
}
params = {
    'minimum_chars_number': '20',
    'raw': 'false',
    'physical': 'false',
    'dpi': '200',
    'lang': 'eng',
    'oem': '3',
    'psm': '3',
    'tessdata_dir': None,
    'thresholding_method': '0',
    'preserve_interword_spaces': '1',
}
files = {
    'input_file': ('doc1.pdf', open('doc1.pdf', 'rb'), 'application/pdf')
}
response = requests.post(url='http://0.0.0.0:50000/extract_text',
                         params=params,
                         headers=headers,
                         files=files)
print(response.json())

Response body:

{
  "file_name": "doc1.pdf", 
  "file_size_mb": 0.1771, 
  "num_extracted_chars": 762, 
  "extracted_text": "INTERNAL\n\nTEST DOC 1\nThis is a pdf document for test.\n\nThis page is machine-readable.\nThe second page of this document is NOT machine-readable, but it represents an image with text.\nThe third page of this document is again machine-readable.\n\nQwertyuiop\nAsdfghjkl\nZxcvbnm\n\nQWERTYUIOP\nASDFGHJKL\nZXCVBNM\n\n\x0c\n\n<END_PAGE>\n\nINTERNAL\n\nThis is a lot of 12 point text to test the\nocr code and see if it works on all types\nof file format.\n\nThe quick brown dog jumped over the\nlazy fox. The quick brown dog jumped\nover the lazy fox. The quick brown dog\njumped over the lazy fox. The quick\nbrown dog jumped over the lazy fox.\n\n\n<END_PAGE>\n\nINTERNAL\n\nThird and final page of this test document.\n\nQwertyuiop\nAsdfghjkl\nZxcvbnm\n\nQWERTYUIOP\nASDFGHJKL\nZXCVBNM\n\n<end of document>\n\n\x0c"
}

How to use programmatically PoppleractPdfExtractor

"""
Hybrid Approach for extract text from mixed PDFs: 
Pdftotext on machine-readable pages + Tesseract OCR on images pages
"""
from poppleract.text_extraction import PoppleractPdfExtractor

hybrid_extr_obj = PoppleractPdfExtractor(
    cache_folder="imgs/",  # Folder with doc images representing pages 
    preserve_cache=False  # Boolean flag for preserve the folder with doc images or not
)

hybrid_extr_obj.extract_text(
    in_pdf_file_path="doc1.pdf",  # Input pdf document path
    out_txt_file_path="doc1.txt",  # Output txt file path
    minimum_chars_number=20,  # For each page, we apply OCR only if we extract less than this threshold value 
    raw=False,  # Pdftotext parameter to keep strings in content stream order or not
    physical=False,  # Pdftotext parameter to maintain original physical layout or not
    dpi=200,  # Dots per Inch (DPI) used by Pdftocairo and Tesseract
    lang="eng+ita",  # Tesseract langs
    oem=3,  # Tesseract OCR Engine Mode 
    psm=3,  # Tesseract Page Segmentation Mode 
    tessdata_dir="/usr/local/share/tessdata/",  # Folder with Tesseract languages files
    thresholding_method=0,  # Tesseract parameter to select image thresholding method
    preserve_interword_spaces=1  # Tesseract option to preserve spaces
)

How to use programmatically PdfSplitter

"""
Splitting of an input pdf document in the relative png pages
"""
from poppleract.pdf_splitting import PdfSplitter

pdf_splitter_obj = PdfSplitter()

pdf_splitter_obj.split_pdf_to_images(
    in_pdf_file_path="doc1.pdf",
    out_images_directory_path="imgs/",
    dpi=200,
    img_exportation_format="png",
    use_pdf_to_cairo=True,
    first_page=1,
    last_page=3,
    output_filename_prefix="page"
)

py-poppleract's People

Contributors

andrealenzi11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.