Running into an error compression error -2 . It would

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Thanks a lot <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Nice <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Didn't get time to this project for so long a time. New version <code class="notransla

Wow this is a great upgrade. Thanks very much for your hard work <a class="user-mentio

compression error -2 about pdf2docx HOT 10 CLOSED

artifexsoftware commented on August 10, 2024

compression error -2

from pdf2docx.

Comments (10)

dothinking commented on August 10, 2024 13

Seems that floating picture with python-docx is a common request, document here for sharing.

# -*- coding: utf-8 -*-

'''
Implement floating image based on python-docx.

- Text wrapping style: BEHIND TEXT <wp:anchor behindDoc="1">
- Picture position: top-left corner of PAGE `<wp:positionH relativeFrom="page">`.

Create a docx sample (Layout | Positions | More Layout Options) and explore the 
source xml (Open as a zip | word | document.xml) to implement other text wrapping
styles and position modes per `CT_Anchor._anchor_xml()`.
'''

from docx.oxml import parse_xml, register_element_cls
from docx.oxml.ns import nsdecls
from docx.oxml.shape import CT_Picture
from docx.oxml.xmlchemy import BaseOxmlElement, OneAndOnlyOne

# refer to docx.oxml.shape.CT_Inline
class CT_Anchor(BaseOxmlElement):
    """
    ``<w:anchor>`` element, container for a floating image.
    """
    extent = OneAndOnlyOne('wp:extent')
    docPr = OneAndOnlyOne('wp:docPr')
    graphic = OneAndOnlyOne('a:graphic')

    @classmethod
    def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):
        """
        Return a new ``<wp:anchor>`` element populated with the values passed
        as parameters.
        """
        anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))
        anchor.extent.cx = cx
        anchor.extent.cy = cy
        anchor.docPr.id = shape_id
        anchor.docPr.name = 'Picture %d' % shape_id
        anchor.graphic.graphicData.uri = (
            'http://schemas.openxmlformats.org/drawingml/2006/picture'
        )
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    @classmethod
    def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):
        """
        Return a new `wp:anchor` element containing the `pic:pic` element
        specified by the argument values.
        """
        pic_id = 0  # Word doesn't seem to use this, but does not omit it
        pic = CT_Picture.new(pic_id, filename, rId, cx, cy)
        anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)
        anchor.graphic.graphicData._insert_pic(pic)
        return anchor

    @classmethod
    def _anchor_xml(cls, pos_x, pos_y):
        return (
            '<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'
            '           behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'
            '           %s>\n'
            '  <wp:simplePos x="0" y="0"/>\n'
            '  <wp:positionH relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionH>\n'
            '  <wp:positionV relativeFrom="page">\n'
            '    <wp:posOffset>%d</wp:posOffset>\n'
            '  </wp:positionV>\n'                    
            '  <wp:extent cx="914400" cy="914400"/>\n'
            '  <wp:wrapNone/>\n'
            '  <wp:docPr id="666" name="unnamed"/>\n'
            '  <wp:cNvGraphicFramePr>\n'
            '    <a:graphicFrameLocks noChangeAspect="1"/>\n'
            '  </wp:cNvGraphicFramePr>\n'
            '  <a:graphic>\n'
            '    <a:graphicData uri="URI not set"/>\n'
            '  </a:graphic>\n'
            '</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )
        )


# refer to docx.parts.story.BaseStoryPart.new_pic_inline
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):
    """Return a newly-created `w:anchor` element.

    The element contains the image specified by *image_descriptor* and is scaled
    based on the values of *width* and *height*.
    """
    rId, image = part.get_or_add_image(image_descriptor)
    cx, cy = image.scaled_dimensions(width, height)
    shape_id, filename = part.next_id, image.filename    
    return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)


# refer to docx.text.run.add_picture
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):
    """Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.
    """
    run = p.add_run()
    anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)
    run._r.add_drawing(anchor)

# refer to docx.oxml.shape.__init__.py
register_element_cls('wp:anchor', CT_Anchor)


if __name__ == '__main__':

    from docx import Document
    from docx.shared import Inches, Pt

    document = Document()

    # add a floating image
    p = document.add_paragraph()
    add_float_picture(p, 'test.png', width=Inches(5.0), pos_x=Pt(20), pos_y=Pt(30))

    # add text
    p.add_run('Hello World'*50)


    document.save('output.docx')

from pdf2docx.

dothinking commented on August 10, 2024 1

Thanks for providing this case.

Lots of vector graphics, i.e. path like a line, curve and their combination, exist in your pdf. However, currently clipping path is ignored by this library due to technical issue when extracting these paths from pdf. Some paths are out of page without being clipped, which results in this compression error -2 issue.

Besides, two more issues to convert this pdf:

The path color is incorrect. I guess the root cause is that currently only Device Color Space (Gray/RGB/CMYK) are considered, while this pdf sample may follow special color space like Indexed CS, DeviceN CS.
overlapped images are removed. python-docx is applied to write the converted docx, but python-docx doesn't support floating elements now. So, floating images are removed as a compromise.

So, unfortunately, pdf2docx is not able to convert your pdf for now. At least the following efforts should be made:

clip path when extract paths from pdf
implement more color space
introduce floating images

from pdf2docx.

echan00 commented on August 10, 2024 1

Thanks @dothinking for the clear explanation. I'm surprised this library isn't more popular than it is. The current version is already very good and I know a lot of people can benefit from it.

Please let me know how I can help to resolve any of the issues you listed (I will need some guidance.) Whether resolving the bugs, testing, or otherwise.

from pdf2docx.

dothinking commented on August 10, 2024 1

Thanks a lot @echan00.

Some progress on this issue:

floating image is supported.
clip path and color space -> good news that another upstream library PyMuPDF published new feature on extracting path. I'll look into it and hopefully can resolve this issue.

After that, any test or suggestions are appreciated.

Comment on 2020-12-31: the latest PyMuPDF 1.18.5 solved this issue partly, but not perfectly, especially clipping path.

from pdf2docx.

dothinking commented on August 10, 2024

Since inline image is supported in python-docx, the steps to explore floating image:

create two docx files, one with an inline image and another a floating image (for this case, the behind text mode)
check the difference of source xml between these two files
implement floating image based on the observed structure and code for inline image

xml structure results:

inline image is a <wp:inline> node under <w:drawing>
floating image is a <wp:anchor> node under <w:drawing>
besides all sub-nodes of inline image, floating image contains also <wp:positionH> and <wp:positionV> to define the fixed position

So, the idea is to create <wp:anchor> node, then append sub-nodes:

all nodes same with inline image
<wp:positionH> and <wp:positionV>

from pdf2docx.

echan00 commented on August 10, 2024

Nice @dothinking, it looks like you know what the issues are exactly. I have a variety of PDFs I can help test once you're ready

from pdf2docx.

tonysepia commented on August 10, 2024

@dothinking thank you so much for your code sample! Solves my problem perfectly!!!!

from pdf2docx.

dothinking commented on August 10, 2024

Didn't get time to this project for so long a time. New version v0.5.0 is now available to partly solve this issue:

floating image is now supported.
path extraction is supported by upstream library PyMuPDF, but not so good for complicated shapes, e.g. clipping path.

With this latest version, the sample pdf can be converted successfully, but still need lots of work to improve the quality of converted docx file, due to the complicated/gorgeous style.

from pdf2docx.

echan00 commented on August 10, 2024

Wow this is a great upgrade. Thanks very much for your hard work @dothinking

from pdf2docx.

dothinking commented on August 10, 2024

Close for now since this issue itself was resolved.

Still need lots of efforts to improve the conversion quality for complicated layouts like this test file.

from pdf2docx.

compression error -2 about pdf2docx HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent