Coder Social home page Coder Social logo

Program Stuck with a pdf about pdftotext HOT 2 CLOSED

jalan avatar jalan commented on August 24, 2024
Program Stuck with a pdf

from pdftotext.

Comments (2)

jalan avatar jalan commented on August 24, 2024 1

You can use the multiprocessing module to perform long-running tasks with a timeout. Here is a commented example script I called readpdf.py:

import multiprocessing
import sys
import time

import pdftotext


def main():
    path = sys.argv[1]
    text = read_pdf(path)
    if text is None:
        sys.exit("timed out trying to read the PDF")
    print(text)


def read_pdf(path, timeout=5):
    # A queue to pass data between processes
    q = multiprocessing.Queue()

    # Start a new process
    p = multiprocessing.Process(target=read_pdf_process, args=(path, q,))
    p.start()

    # Wait some time for it to finish
    p.join(timeout=timeout)

    if p.is_alive():
        # The process is still running but we are out of time, so terminate it
        p.terminate()

    # Based on the exit code, either get the text that was extracted, or None
    # if we weren't able to finish
    p.join()
    if p.exitcode == 0:
        text = q.get()
    else:
        text = None

    q.close()
    return text


def read_pdf_process(path, q):
    with open(path, "rb") as f:
        pdf = pdftotext.PDF(f)
    text = "\n\n".join(pdf)
    q.put(text)
    q.close()


if __name__ == "__main__":
    main()

On a typical PDF, you get the text:

$ python readpdf.py abcde.pdf 
abcde.

On your troublesome PDF, it will time out after five seconds:

$ python readpdf.py 2014-ASHP-Handbook-web-edition.pdf 
timed out trying to read the PDF

I hope it helps.

from pdftotext.

jalan avatar jalan commented on August 24, 2024 1

Closing, since this is something you should solve in your own code. I shouldn't change this module itself to do this on my side, since spinning up another process would be expensive and surprising to most users.

from pdftotext.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.