Coder Social home page Coder Social logo

Comments (3)

inzhir avatar inzhir commented on August 10, 2024 1

Thank you for your fast reply!

Fixed code this way

if isinstance(self.parent._parent, Cell):
    delete_paragraph(doc.paragraphs[-1])

(added _parent) and now everything works.

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024

Many thanks to report this issue.

The root cause is a fact in Word document that an empty line is always created automatically when insert a nested table into a cell. So, a fix is to delete such line -> will be included in next version.

For now, you can modify the source code based on version 0.5.1:

  1. Find Blocks.py
>>> import pdf2docx
>>> pdf2docx.page.Blocks.__file__
'd:\\workspace\\github\\pdf2docx\\pdf2docx\\page\\Blocks.py'
  1. Check lines 523-538 in method make_docx()
for block in self._instances:
    # make paragraphs
    if block.is_text_image_block():
        # new paragraph
        p = doc.add_paragraph()
        block.make_docx(p)

        pre_table = False # mark block type
    
    # make table
    elif block.is_table_block():
        make_table(block, pre_table)
        pre_table = True # mark block type

# below table processing is necessary for page level only
if isinstance(self.parent, Cell): return
  1. Change the above lines 523-538 to as follows
for block in self._instances:
    # make paragraphs
    if block.is_text_image_block():                
        # new paragraph
        p = doc.add_paragraph()
        block.make_docx(p)

        pre_table = False # mark block type
    
    # make table
    elif block.is_table_block():
        make_table(block, pre_table)
        pre_table = True # mark block type

        # NOTE: within a cell, there is always an empty paragraph after table,
        # so, delete it right here
        if isinstance(self.parent, Cell):
            delete_paragraph(doc.paragraphs[-1])

Note to import method delete_paragraph() accordingly:

from ..common.docx import reset_paragraph_format, delete_paragraph

from pdf2docx.

dothinking avatar dothinking commented on August 10, 2024

Ah, my fix was based on dev. Per version 0.5.1, self.parent -> Layout, then Layout._parent -> Cell.

So, thank you for figuring it out

from pdf2docx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.