This script extracts text from OCRed PDFs, translate it, and generate new PDFs with the translated content. The script uses the GPT-3.5 Turbo API for translation.
Before using this script, make sure you have the following prerequisites installed:
- Python 3.x
- PyPDF2
- reportlab
- tqdm
- requests
- GPT-3.5 Turbo API key (You can obtain this from OpenAI)
- TrueType font file (.ttf) with support for Chinese characters (You can replace './font.ttf' with the path to your font file)
-
First, provide your GPT-3.5 Turbo API key in the
api_key
variable. -
Modify the
source_pdf
variable to specify the path to your source PDF file that you want to translate. -
Adjust the
start_page
andend_page
variables to define the range of pages you want to translate within the source PDF. -
Run the script using the following command:
python script.py
This will initiate the translation process and create individual PDF files with translated content for each specified page.
-
If you want to retranslate specific pages, you can use the
retranslate_and_merge_pages
function by uncommenting it in themain
function and providing the list of page numbers you want to retranslate. -
The translated PDFs will be stored in the
./pages/
directory, and the final merged translated document will be namedfinal_translated_document.pdf
.
-
The script takes care of handling previous and next page content to ensure coherent translations.
-
Make sure you have a valid API key and an internet connection for the translation process.
-
You can replace the './font.ttf' font path with the path to your own TrueType font file if needed.
Feel free to customize the script according to your specific needs and enjoy automated translation and PDF processing!