安装需要的环境: Python 3.7.13 CUDA 11.x
pip install -r requirement.txt
!pip install layoutparser torchvision && pip install "git+https://github.com/facebookresearch/[email protected]#egg=detectron2"
!pip install "layoutparser[effdet]"
!pip install "layoutparser[paddledetection]"
环境中包含tools, 解决paddleocr安装包兼容问题(PaddlePaddle/PaddleOCR#1024) 5.31 Efficient和paddle环境不存在上述问题
将paddle中.tools替换成paddleocr.tools
mv replace_paddleocr.py pathToYourPaddleOCR/paddleocr.py
colab
mv replace_paddleocr.py /usr/local/lib/python3.7/dist-packages/paddleocr/paddleocr.py
先运行extract_image.py将pdf中的每一页存成图片
python extract_image.py
python layout_analysis.py
---------------------------- 文字信息抽取 -------------------------- 之后下载模型
python3 image2text.py
可视化(Optional)
python3 visualization.py