Comments (5)
看到 #63 提供的测试文件,才理解你的问题。
pdf2docx
是靠上游python库PyMuPDF
提取PDF内容的,PyMuPDF
本身支持中文,但是某些PDF文件可能缺少cmaps表——将CJK(中日韩)字符映射到unicode字符的规则,结果导致无法正确获取文本。
I have multiple pdf files without 'toUnicode' cmap table. Absence of cmap table restricts me from copying the text from pdf files.
For some fonts in some PDFs some characters cannot be extracted correctly, because their CMap / ToUnicode doesn't make sense or is incomplete
PyMuPDF
项目相关的issue:
由于是PDF自身的问题,且尚未发现通用的解决方案,pdf2docx
对这个问题无法提供支持。
from pdf2docx.
Just as a comment from the outside:
This type of problem can only be solved by using an OCR program like Tesseract. There is Python repo available, which does a good job on this: https://github.com/jbarlow83/OCRmyPDF, I have tried it myself.
There also is a way to incorporate Tesseract directly in MuPDF v1.18.0 (which I do not use yet in PyMuPDF).
from pdf2docx.
方便提供测试文件吗?
from pdf2docx.
from pdf2docx.
你这是图片,目前不支持扫描文件的转换
from pdf2docx.
Related Issues (20)
- [WARNING] Ignore Line "<image>" due to overlap HOT 1
- 无法复原pdf文件中表格的框线 HOT 1
- How to save highlight in table after convert pdf to docx HOT 4
- Negative ref_dif in Blocks.py causing paragraph splitting
- 转化后存在页面超出的问题
- ValueError: unsupported colorspace for 'png' HOT 3
- Any support for ANDROID? HOT 1
- 转换时遇到字体名为中文(比如“宋体”)时,发生错误 HOT 3
- language support HOT 2
- pdf2docx.Converter将某些特殊pdf转word时,某个子进程会卡住 HOT 3
- Table is broken when the table is displayed on 2 pages HOT 2
- 关于行高分配的逻辑疑问 HOT 2
- 转换docx表格中文本不全,请问这个可以解决吗 HOT 1
- Resource Han Rounded CN Light rendered as "Resource" HOT 2
- 转word后图片被旋转180° HOT 12
- 表格生成的时候没有处理好浮动形图片 HOT 1
- 含XFA表单域的PDF无法转换为word HOT 1
- 占用内存没有gc
- pdf转word后,表格会溢出边界 HOT 2
- Hyperlinks are not transferred HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2docx.