PDF Slurp is my simple yet effective command-line tool specially tailored to pull text and images from PDF files. It's been primarily created to assist with personal knowledge development and learning. My intention behind creating PDF Slurp was to conveniently pull pages I've recently read into LLMs (and other AIs) for summarisation, flash card creation, etc. Whether you're compiling notes, building a personal knowledge base, or just satisfying your curiosity, PDF Slurp stands out as a handy assistant in your daily intellectual endeavors.
This tool is proudly homemade (mostly by Claude-3-Opus and GPT-4) and hosted here for anyone to clone and use. I've set it up for pipx
installation because, frankly, I'm a bit too lazy to navigate the official PyPI distribution process, and I prefer the simplicity of installing it directly from a local source.
- Extract text from designated pages or ranges in a PDF, perfect for piecing together your own distilled summaries or reflections.
- Extract all text from PDFs, enabling quick access to content without manual copying and pasting.
- Pull individual images from PDFs, useful for saving diagrams, charts, or artwork that sparks inspiration.
- Invert the colors of extracted images on the fly, because why not have fun with some customization?
- Python 3.6 or higher
- PyPDF2
- Pillow (PIL Fork)
To get PDF Slurp up and running on your system, pipx
is the way to go. It's straightforward and keeps things tidy by installing the tool in an isolated environment, which is ideal for casual toolmakers and users like me. Here's how to do it:
-
Make sure
pipx
is installed on your machine. If it's not, you can get it set up with:python3 -m pip install pipx python3 -m pipx ensurepath
-
Head on over to the root directory where
pdf_slurp
lives and install with:pipx install .
After that, you should be good to go. Launch PDF Slurp from anywhere in your terminal and enjoy the effortless data extraction!
Here are a few quick examples to show what PDF Slurp can do:
To extract text from pages 1 to 3 and page 5:
pdf-slurp /path/to/pdf -p 1-3,5
To extract all text from a PDF:
pdf-slurp /path/to/pdf --all-pages
To grab the third image from page 2:
pdf-slurp /path/to/pdf -p 2 -i 3
To invert the colors of the extracted image:
pdf-slurp /path/to/pdf -p 2 -i 3 -v
PDF Slurp is made available under the MIT License. For more details, check out the LICENSE file on the GitHub repository.
Happy learning and managing your PDFs!