Comments (10)
Seems that floating picture with python-docx
is a common request, document here for sharing.
# -*- coding: utf-8 -*-
'''
Implement floating image based on python-docx.
- Text wrapping style: BEHIND TEXT <wp:anchor behindDoc="1">
- Picture position: top-left corner of PAGE `<wp:positionH relativeFrom="page">`.
Create a docx sample (Layout | Positions | More Layout Options) and explore the
source xml (Open as a zip | word | document.xml) to implement other text wrapping
styles and position modes per `CT_Anchor._anchor_xml()`.
'''
from docx.oxml import parse_xml, register_element_cls
from docx.oxml.ns import nsdecls
from docx.oxml.shape import CT_Picture
from docx.oxml.xmlchemy import BaseOxmlElement, OneAndOnlyOne
# refer to docx.oxml.shape.CT_Inline
class CT_Anchor(BaseOxmlElement):
"""
``<w:anchor>`` element, container for a floating image.
"""
extent = OneAndOnlyOne('wp:extent')
docPr = OneAndOnlyOne('wp:docPr')
graphic = OneAndOnlyOne('a:graphic')
@classmethod
def new(cls, cx, cy, shape_id, pic, pos_x, pos_y):
"""
Return a new ``<wp:anchor>`` element populated with the values passed
as parameters.
"""
anchor = parse_xml(cls._anchor_xml(pos_x, pos_y))
anchor.extent.cx = cx
anchor.extent.cy = cy
anchor.docPr.id = shape_id
anchor.docPr.name = 'Picture %d' % shape_id
anchor.graphic.graphicData.uri = (
'http://schemas.openxmlformats.org/drawingml/2006/picture'
)
anchor.graphic.graphicData._insert_pic(pic)
return anchor
@classmethod
def new_pic_anchor(cls, shape_id, rId, filename, cx, cy, pos_x, pos_y):
"""
Return a new `wp:anchor` element containing the `pic:pic` element
specified by the argument values.
"""
pic_id = 0 # Word doesn't seem to use this, but does not omit it
pic = CT_Picture.new(pic_id, filename, rId, cx, cy)
anchor = cls.new(cx, cy, shape_id, pic, pos_x, pos_y)
anchor.graphic.graphicData._insert_pic(pic)
return anchor
@classmethod
def _anchor_xml(cls, pos_x, pos_y):
return (
'<wp:anchor distT="0" distB="0" distL="0" distR="0" simplePos="0" relativeHeight="0" \n'
' behindDoc="1" locked="0" layoutInCell="1" allowOverlap="1" \n'
' %s>\n'
' <wp:simplePos x="0" y="0"/>\n'
' <wp:positionH relativeFrom="page">\n'
' <wp:posOffset>%d</wp:posOffset>\n'
' </wp:positionH>\n'
' <wp:positionV relativeFrom="page">\n'
' <wp:posOffset>%d</wp:posOffset>\n'
' </wp:positionV>\n'
' <wp:extent cx="914400" cy="914400"/>\n'
' <wp:wrapNone/>\n'
' <wp:docPr id="666" name="unnamed"/>\n'
' <wp:cNvGraphicFramePr>\n'
' <a:graphicFrameLocks noChangeAspect="1"/>\n'
' </wp:cNvGraphicFramePr>\n'
' <a:graphic>\n'
' <a:graphicData uri="URI not set"/>\n'
' </a:graphic>\n'
'</wp:anchor>' % ( nsdecls('wp', 'a', 'pic', 'r'), int(pos_x), int(pos_y) )
)
# refer to docx.parts.story.BaseStoryPart.new_pic_inline
def new_pic_anchor(part, image_descriptor, width, height, pos_x, pos_y):
"""Return a newly-created `w:anchor` element.
The element contains the image specified by *image_descriptor* and is scaled
based on the values of *width* and *height*.
"""
rId, image = part.get_or_add_image(image_descriptor)
cx, cy = image.scaled_dimensions(width, height)
shape_id, filename = part.next_id, image.filename
return CT_Anchor.new_pic_anchor(shape_id, rId, filename, cx, cy, pos_x, pos_y)
# refer to docx.text.run.add_picture
def add_float_picture(p, image_path_or_stream, width=None, height=None, pos_x=0, pos_y=0):
"""Add float picture at fixed position `pos_x` and `pos_y` to the top-left point of page.
"""
run = p.add_run()
anchor = new_pic_anchor(run.part, image_path_or_stream, width, height, pos_x, pos_y)
run._r.add_drawing(anchor)
# refer to docx.oxml.shape.__init__.py
register_element_cls('wp:anchor', CT_Anchor)
if __name__ == '__main__':
from docx import Document
from docx.shared import Inches, Pt
document = Document()
# add a floating image
p = document.add_paragraph()
add_float_picture(p, 'test.png', width=Inches(5.0), pos_x=Pt(20), pos_y=Pt(30))
# add text
p.add_run('Hello World'*50)
document.save('output.docx')
from pdf2docx.
Thanks for providing this case.
Lots of vector graphics, i.e. path
like a line, curve and their combination, exist in your pdf. However, currently clipping path is ignored by this library due to technical issue when extracting these paths from pdf. Some paths are out of page without being clipped, which results in this compression error -2
issue.
Besides, two more issues to convert this pdf:
-
The path color is incorrect. I guess the root cause is that currently only
Device Color Space
(Gray/RGB/CMYK) are considered, while this pdf sample may follow special color space likeIndexed CS
,DeviceN CS
. -
overlapped images are removed.
python-docx
is applied to write the converted docx, butpython-docx
doesn't support floating elements now. So, floating images are removed as a compromise.
So, unfortunately, pdf2docx
is not able to convert your pdf for now. At least the following efforts should be made:
- clip path when extract paths from pdf
- implement more color space
- introduce floating images
from pdf2docx.
Thanks @dothinking for the clear explanation. I'm surprised this library isn't more popular than it is. The current version is already very good and I know a lot of people can benefit from it.
Please let me know how I can help to resolve any of the issues you listed (I will need some guidance.) Whether resolving the bugs, testing, or otherwise.
from pdf2docx.
Thanks a lot @echan00.
Some progress on this issue:
- floating image is supported.
- clip path and color space -> good news that another upstream library
PyMuPDF
published new feature on extracting path. I'll look into it and hopefully can resolve this issue.
After that, any test or suggestions are appreciated.
Comment on 2020-12-31: the latest PyMuPDF 1.18.5 solved this issue partly, but not perfectly, especially clipping path.
from pdf2docx.
Since inline image is supported in python-docx
, the steps to explore floating image:
- create two docx files, one with an inline image and another a floating image (for this case, the
behind text
mode) - check the difference of source xml between these two files
- implement floating image based on the observed structure and code for inline image
xml structure results:
- inline image is a
<wp:inline>
node under<w:drawing>
- floating image is a
<wp:anchor>
node under<w:drawing>
- besides all sub-nodes of inline image, floating image contains also
<wp:positionH>
and<wp:positionV>
to define the fixed position
So, the idea is to create <wp:anchor>
node, then append sub-nodes:
- all nodes same with inline image
<wp:positionH>
and<wp:positionV>
from pdf2docx.
Nice @dothinking, it looks like you know what the issues are exactly. I have a variety of PDFs I can help test once you're ready
from pdf2docx.
@dothinking thank you so much for your code sample! Solves my problem perfectly!!!!
from pdf2docx.
Didn't get time to this project for so long a time. New version v0.5.0
is now available to partly solve this issue:
- floating image is now supported.
- path extraction is supported by upstream library
PyMuPDF
, but not so good for complicated shapes, e.g. clipping path.
With this latest version, the sample pdf can be converted successfully, but still need lots of work to improve the quality of converted docx file, due to the complicated/gorgeous style.
from pdf2docx.
Wow this is a great upgrade. Thanks very much for your hard work @dothinking
from pdf2docx.
Close for now since this issue itself was resolved.
Still need lots of efforts to improve the conversion quality for complicated layouts like this test file.
from pdf2docx.
Related Issues (20)
- 转word速度太慢了,怎么设置只转换部分内容?比如只转换pdf中表格到word,不要页眉页脚段落,也许这样指定内容更快
- 2 tests fail
- transfer error:unsupported colorspace for '{output}' HOT 1
- [WARNING] Ignore Line "<image>" due to overlap
- 无法复原pdf文件中表格的框线 HOT 1
- How to save highlight in table after convert pdf to docx HOT 4
- Negative ref_dif in Blocks.py causing paragraph splitting
- 转化后存在页面超出的问题
- ValueError: unsupported colorspace for 'png' HOT 2
- Any support for ANDROID? HOT 1
- 转换时遇到字体名为中文(比如“宋体”)时,发生错误 HOT 1
- language support
- pdf2docx.Converter将某些特殊pdf转word时,某个子进程会卡住 HOT 3
- Table is broken when the table is displayed on 2 pages HOT 1
- 关于行高分配的逻辑疑问
- 转换docx表格中文本不全,请问这个可以解决吗
- Resource Han Rounded CN Light rendered as "Resource"
- 转word后图片被旋转180° HOT 11
- 表格生成的时候没有处理好浮动形图片 HOT 1
- 含XFA表单域的PDF无法转换为word
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2docx.