Coder Social home page Coder Social logo

Comments (4)

dothinking avatar dothinking commented on August 10, 2024

谢谢,很好的问题和解释。我一直没注意PDF操作符例如这里的K和文本内容的冲突,尽管测试的文件里也有单独的kK等字符。这个问题也暗示了按照目前的解析方法,其他操作符也存在冲突的可能。

所以,你提交的 #36 是针对你的具体问题的临时修复方法。为了彻底解决,可否提供复现这个问题的pdf文件(可以删除不便公开的内容)?谢谢。

from pdf2docx.

smilelight avatar smilelight commented on August 10, 2024

测试pdf已经发送至您的邮箱:[email protected],抱歉刚开始用QQ邮箱一直没有投递成功,所以又用Gmail投递了一次,所以共投递了两次相同的内容。
解析程序在执行至第24页时即报错,所以您可以在测试时将页面解析限制在该页,即

parse(pdf_path, docx_path, start=24, end=25)

错误信息为:

Processing 24/357...
Traceback (most recent call last):
  File "E:/Projects/GiteeProjects/doc_parser/tests/pdf2docx_test.py", line 11, in <module>
    parse(pdf_path, docx_path, start=24, end=25)
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\main.py", line 33, in parse
    cv.parse(page).make_page()
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\converter.py", line 116, in parse
    self.init(page).parse(**self._debug_kwargs)
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\converter.py", line 94, in init
    self._layout.rects.from_stream(page_content, page.transformationMatrix)
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\shape\Rectangles.py", line 61, in from_stream
    rects = pdf.rects_from_stream(xref_stream, M)
  File "D:\Software\Anaconda3\envs\file_parse\lib\site-packages\pdf2docx\common\pdf.py", line 335, in rects_from_stream
    c, m, y, k = map(float, lines[i-4:i])
ValueError: could not convert string to float: 'Td'

正如您所提到的其他的操作符也存在冲突的可能,我发现在pdf2docx/common/pdf.py中相关数据切分代码为:

# check xref stream word by word (line always changes)    
lines = xref_stream.split()
rects = []

我猜想,是不是以行为单位进行切分再进行后面的逻辑判断是否会更加合理,然而看您的注释专门提到word by word (line always changes)应该是有其他因素促使您使用目前的策略。

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024

多谢提供测试文件及最后的建议。

根据PDF规范 ,PDF内容形式一般为a b c op,其中op是操作符例如设置颜色、画线等等,前面a, b, c等是参数,具体个数因op而异。但是xref_stream并非一行正好一条命令,有可能几组命令在一行,也就是上面word by word处理的原因。

重新查看了PyMuPDF的文档,发现page.cleanContents()正好进行这样的清理工作,那就适用按行处理的建议了,不仅效率更高而且更直观。本地已经测试通过,接下来我会再更新一个小版本。

from pdf2docx.

smilelight avatar smilelight commented on August 10, 2024

好的,感谢您的回复~

from pdf2docx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.