PDFParser
User Manual Search
This Python script is used to parse a PDF user manual, extract text from it (even if it's a scanned document), and remove headers and footers. It uses the Tesseract OCR engine for text extraction and PyPDF2 for PDF handling.
Prerequisites
Before running this script, you need to install some dependencies:
- Python 3.10 or higher
brew install python
- Install any virtual environment manager (e.g. virtualenv, conda, etc.)
- I am using conda
- Install conda from https://docs.conda.io/en/latest/miniconda.html
- Create a virtual environment
conda create --name pdf_parser python=3.8
- Activate the virtual environment
conda activate pdf_parser
- Install the required packages
pip install -r requirements.txt
- Install the required packages
- For Mac Users
- Install Tesseract
brew install tesseract
- Install Poppler
brew install poppler
- Install Tesseract
- For Linux (Ubuntu-18.04) Users
- Install tesseract
sudo apt-get install tesseract-ocr
- Install poppler
sudo apt-get install poppler-utils
- Install tesseract
- Download the trained data file for Simplified Chinese
sudo mkdir -p /usr/local/share/tessdata/ sudo curl -L -o /usr/local/share/tessdata/chi_sim.traineddata https://github.com/tesseract-ocr/tessdata/raw/main/chi_sim.traineddata
- Download the text detector model
sudo wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Jk4eGD7crsqCCg9C9VjCLkMN3ze8kutZ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/1n/p')&id=1Jk4eGD7crsqCCg9C9VjCLkMN3ze8kutZ" -O craft_mlt_25k.pth && rm -rf /tmp/cookies.txt
- Run the script
python main.py