Comments (4)
谢谢,很好的问题和解释。我一直没注意PDF操作符例如这里的K
和文本内容的冲突,尽管测试的文件里也有单独的k
,K
等字符。这个问题也暗示了按照目前的解析方法,其他操作符也存在冲突的可能。
所以,你提交的 #36 是针对你的具体问题的临时修复方法。为了彻底解决,可否提供复现这个问题的pdf文件(可以删除不便公开的内容)?谢谢。
from pdf2docx.
测试pdf已经发送至您的邮箱:[email protected],抱歉刚开始用QQ邮箱一直没有投递成功,所以又用Gmail投递了一次,所以共投递了两次相同的内容。
解析程序在执行至第24页时即报错,所以您可以在测试时将页面解析限制在该页,即
parse(pdf_path, docx_path, start=24, end=25)
错误信息为:
Processing 24/357...
Traceback (most recent call last):
File "E:/Projects/GiteeProjects/doc_parser/tests/pdf2docx_test.py", line 11, in <module>
parse(pdf_path, docx_path, start=24, end=25)
File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\main.py", line 33, in parse
cv.parse(page).make_page()
File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\converter.py", line 116, in parse
self.init(page).parse(**self._debug_kwargs)
File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\converter.py", line 94, in init
self._layout.rects.from_stream(page_content, page.transformationMatrix)
File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\shape\Rectangles.py", line 61, in from_stream
rects = pdf.rects_from_stream(xref_stream, M)
File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\common\pdf.py", line 335, in rects_from_stream
c, m, y, k = map(float, lines[i-4:i])
ValueError: could not convert string to float: 'Td'
正如您所提到的其他的操作符也存在冲突的可能,我发现在pdf2docx/common/pdf.py中相关数据切分代码为:
# check xref stream word by word (line always changes)
lines = xref_stream.split()
rects = []
我猜想,是不是以行为单位进行切分再进行后面的逻辑判断是否会更加合理,然而看您的注释专门提到word by word (line always changes)
应该是有其他因素促使您使用目前的策略。
from pdf2docx.
多谢提供测试文件及最后的建议。
根据PDF规范 ,PDF内容形式一般为a b c op
,其中op
是操作符例如设置颜色、画线等等,前面a
, b
, c
等是参数,具体个数因op
而异。但是xref_stream
并非一行正好一条命令,有可能几组命令在一行,也就是上面word by word
处理的原因。
重新查看了PyMuPDF
的文档,发现page.cleanContents()
正好进行这样的清理工作,那就适用按行处理的建议了,不仅效率更高而且更直观。本地已经测试通过,接下来我会再更新一个小版本。
from pdf2docx.
好的,感谢您的回复~
from pdf2docx.
Related Issues (20)
- 这个项目最大的问题在于数据结构设计 HOT 4
- PDF转docx时文档中带链接的文字全部丢失 HOT 1
- pdf2docx-0.5.8版,将附件"深入浅出强化学习01.pdf"转docx后,每段首句被移到末尾了 HOT 1
- 转word速度太慢了,怎么设置只转换部分内容?比如只转换pdf中表格到word,不要页眉页脚段落,也许这样指定内容更快
- 2 tests fail
- transfer error:unsupported colorspace for '{output}' HOT 1
- [WARNING] Ignore Line "<image>" due to overlap
- 无法复原pdf文件中表格的框线 HOT 1
- How to save highlight in table after convert pdf to docx HOT 4
- Negative ref_dif in Blocks.py causing paragraph splitting
- 转化后存在页面超出的问题
- ValueError: unsupported colorspace for 'png' HOT 2
- Any support for ANDROID? HOT 1
- 转换时遇到字体名为中文(比如“宋体”)时,发生错误 HOT 1
- language support
- pdf2docx.Converter将某些特殊pdf转word时,某个子进程会卡住 HOT 3
- Table is broken when the table is displayed on 2 pages HOT 1
- 关于行高分配的逻辑疑问
- 转换docx表格中文本不全,请问这个可以解决吗
- Resource Han Rounded CN Light rendered as "Resource"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2docx.