Coder Social home page Coder Social logo

Comments (6)

dothinking avatar dothinking commented on August 10, 2024

根据你的描述,貌似中间部分有较多直线曲线,导致程序以为这是一副矢量图,所以就直接截图了。

不知是否方便提供测试文件,github个人页面上可以看到我的邮箱。如果不便公开具体内容,可以尝试用word创建一个类似版式的文件,然后转pdf作为测试文件,看能否复现问题。了解具体的版式有助于解决这个问题,同时使这个转换工具更健壮。谢谢。

from pdf2docx.

harrylyf avatar harrylyf commented on August 10, 2024

您好,我尝试用word复现类似板式的文件后发现,该图片实则为扫描件。根据之前solved的issue,目前pdf2docx好像还不支持扫描件。
我已经得到了base64格式的图片,想尝试自己添加图片进入到json file中,但是失败了。我的代码如下

{
    "filename": "test.pdf",
    "page_num": 1,
    "pages": [
        {
            "id": 0,
            "width": 595.5,
            "height": 842.2,
            "margin": [
                0.0,
                0.0,
                2.5,
                2.0
            ],
            "blocks": [
                {
                    "bbox": [
                        44.0,
                        335.8,
                        523.1,
                        698.7
                    ],
                    "type": 4,
                    "alignment": 0,
                    "left_space": 0,
                    "right_space": 0,
                    "first_line_space": 0.0,
                    "before_space": 0.0,
                    "after_space": 0.0,
                    "line_space": 0.0,
                    "tab_stops": [],
                    "left_space_total": 0.0,
                    "right_space_total": 0.0,
                    "ext": "jpeg",
                    "width": 2555,
                    "height": 1935,
                    "image": xxxxx
                }
            ],
            "shapes": []
        }
    ]
}
filename = 'test.pdf'
cv = Converter(filename)
cv.deserialize('test.json')
cv.make_docx('test.docx')

生成的docx为空白文档。请问这应该如何解决呢

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024

是的,目前还不支持扫描件。

对于已经得到json文件再重建docx的问题,的确是pdf2docx的一个bug,参考下面步骤改一下源码可以解决:

  1. 找到Blocks.py文件(可以按下面方法查看文件路径)
>>> import pdf2docx
>>> pdf2docx.page.Blocks.__file__
'd:\\21_github\\pdf2docx\\pdf2docx\\page\\Blocks.py'
  1. 定位到restore()方法的第116行左右,按下面注释增加一行
# floating image block
elif block_type == BlockType.FLOAT_IMAGE.value:
    block = ImageBlock(raw_block)
    block.set_float_image_block()
    self.floating_image_blocks.append(block)  # 增加这一行

修改后重新运行你的代码即可。

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024

另外,如果只是将PDF扫描件的图片原封不动地插入到新的word中,可以不用使用pdf2docx;建议用PyMuPDF读取图片,然后用python-docx创建word和插入图片,这样可能灵活性高些。

from pdf2docx.

harrylyf avatar harrylyf commented on August 10, 2024

谢谢,我刚才试了一下,可以了!

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024

不客气,也感谢你发现这个问题 :)

from pdf2docx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.