Coder Social home page Coder Social logo

Comments (10)

dothinking avatar dothinking commented on August 10, 2024 13

Seems that floating picture with python-docx is a common request, document here for sharing.

# -*- coding: utf-8 -*-

'''
Implement floating image based on python-docx.

- Text wrapping style: BEHIND TEXT <wp:anchor behindDoc="1">
- Picture position: top-left corner of PAGE `<wp:positionH relativeFrom="page">`.

Create a docx sample (Layout | Positions | More Layout Options) and explore the 
source xml (Open as a zip | word | document.xml) to implement other text wrapping
styles and position modes per `CT_Anchor._anchor_xml()`.
'''

from docx.oxml import parse_xml, register_element_cls
from docx.oxml.ns import nsdecls
from docx.oxml.shape import CT_Picture
from docx.oxml.xmlchemy import BaseOxmlElement, OneAndOnlyOne

# refer to docx.oxml.shape.CT_Inline
class CT_Anchor(BaseOxmlElement):
    """
    ``<w:anchor>`` element, container for a floating image.
    """
    extent = OneAndOnlyOne('wp:extent')
    docPr = OneAndOnlyOne('wp:docPr')
    graphic = OneAndOnlyOne('a:graphic')

    @classmethod
    def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):
        """
        Return a new ``<wp:anchor>`` element populated with the values passed
        as parameters.
        """
        anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))
        anchor.extent.cx = cx
        anchor.extent.cy = cy
        anchor.docPr.id = shape_id
        anchor.docPr.name = 'Picture %d' % shape_id
        anchor.graphic.graphicData.uri = (
            'http://schemas.openxmlformats.org/drawingml/2006/picture'
        )
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    @classmethod
    def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):
        """
        Return a new `wp:anchor` element containing the `pic:pic` element
        specified by the argument values.
        """
        pic_id = 0  # Word doesn't seem to use this, but does not omit it
        pic = CT_Picture.new(pic_id, filename, rId, cx, cy)
        anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    @classmethod
    def _anchor_xml(cls, pos_x, pos_y):
        return (
            '<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'
            '           behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'
            '           %s>\n'
            '  <wp:simplePos x="0" y="0"/>\n'
            '  <wp:positionH relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionH>\n'
            '  <wp:positionV relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionV>\n'                    
            '  <wp:extent cx="914400" cy="914400"/>\n'
            '  <wp:wrapNone/>\n'
            '  <wp:docPr id="666" name="unnamed"/>\n'
            '  <wp:cNvGraphicFramePr>\n'
            '    <a:graphicFrameLocks noChangeAspect="1"/>\n'
            '  </wp:cNvGraphicFramePr>\n'
            '  <a:graphic>\n'
            '    <a:graphicData uri="URI not set"/>\n'
            '  </a:graphic>\n'
            '</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )
        )


# refer to docx.parts.story.BaseStoryPart.new_pic_inline
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):
    """Return a newly-created `w:anchor` element.

    The element contains the image specified by *image_descriptor* and is scaled
    based on the values of *width* and *height*.
    """
    rId, image = part.get_or_add_image(image_descriptor)
    cx, cy = image.scaled_dimensions(width, height)
    shape_id, filename = part.next_id, image.filename    
    return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)


# refer to docx.text.run.add_picture
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):
    """Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.
    """
    run = p.add_run()
    anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)
    run._r.add_drawing(anchor)

# refer to docx.oxml.shape.__init__.py
register_element_cls('wp:anchor', CT_Anchor)


if __name__ == '__main__':

    from docx import Document
    from docx.shared import Inches, Pt

    document = Document()

    # add a floating image
    p = document.add_paragraph()
    add_float_picture(p, 'test.png', width=Inches(5.0), pos_x=Pt(20), pos_y=Pt(30))

    # add text
    p.add_run('Hello World'*50)


    document.save('output.docx')

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024 1

Thanks for providing this case.

Lots of vector graphics, i.e. path like a line, curve and their combination, exist in your pdf. However, currently clipping path is ignored by this library due to technical issue when extracting these paths from pdf. Some paths are out of page without being clipped, which results in this compression error -2 issue.

Besides, two more issues to convert this pdf:

  • The path color is incorrect. I guess the root cause is that currently only Device Color Space (Gray/RGB/CMYK) are considered, while this pdf sample may follow special color space like Indexed CS, DeviceN CS.

  • overlapped images are removed. python-docx is applied to write the converted docx, but python-docx doesn't support floating elements now. So, floating images are removed as a compromise.

So, unfortunately, pdf2docx is not able to convert your pdf for now. At least the following efforts should be made:

  • clip path when extract paths from pdf
  • implement more color space
  • introduce floating images

from pdf2docx.

echan00 avatar echan00 commented on August 10, 2024 1

Thanks @dothinking for the clear explanation. I'm surprised this library isn't more popular than it is. The current version is already very good and I know a lot of people can benefit from it.

Please let me know how I can help to resolve any of the issues you listed (I will need some guidance.) Whether resolving the bugs, testing, or otherwise.

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024 1

Thanks a lot @echan00.

Some progress on this issue:

  • floating image is supported.
  • clip path and color space -> good news that another upstream library PyMuPDF published new feature on extracting path. I'll look into it and hopefully can resolve this issue.

After that, any test or suggestions are appreciated.

Comment on 2020-12-31: the latest PyMuPDF 1.18.5 solved this issue partly, but not perfectly, especially clipping path.

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024

Since inline image is supported in python-docx, the steps to explore floating image:

  • create two docx files, one with an inline image and another a floating image (for this case, the behind text mode)
  • check the difference of source xml between these two files
  • implement floating image based on the observed structure and code for inline image

xml structure results:

  • inline image is a <wp:inline> node under <w:drawing>
  • floating image is a <wp:anchor> node under <w:drawing>
  • besides all sub-nodes of inline image, floating image contains also <wp:positionH> and <wp:positionV> to define the fixed position

So, the idea is to create <wp:anchor> node, then append sub-nodes:

  • all nodes same with inline image
  • <wp:positionH> and <wp:positionV>

from pdf2docx.

echan00 avatar echan00 commented on August 10, 2024

Nice @dothinking, it looks like you know what the issues are exactly. I have a variety of PDFs I can help test once you're ready

from pdf2docx.

tonysepia avatar tonysepia commented on August 10, 2024

@dothinking thank you so much for your code sample! Solves my problem perfectly!!!!

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024

Didn't get time to this project for so long a time. New version v0.5.0 is now available to partly solve this issue:

  • floating image is now supported.
  • path extraction is supported by upstream library PyMuPDF, but not so good for complicated shapes, e.g. clipping path.

With this latest version, the sample pdf can be converted successfully, but still need lots of work to improve the quality of converted docx file, due to the complicated/gorgeous style.

from pdf2docx.

echan00 avatar echan00 commented on August 10, 2024

Wow this is a great upgrade. Thanks very much for your hard work @dothinking

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024

Close for now since this issue itself was resolved.

Still need lots of efforts to improve the conversion quality for complicated layouts like this test file.

from pdf2docx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.