Comments (7)
I'm afraid no, at least for now. Table has two roles in this library: lattice table
for structure data, and stream table
in case float layout (e.g. multi-paragraphs in same row). So, I guess the "false positive tables" are stream tables, which are required by reproducing the layout.
Could you please share a test case? Maybe we can find a workaround or to make some change accordingly.
from pdf2docx.
thanks for replying . I think better option would be have a flag to identify steam table only , lattice table only , stream lattice both , none. In this way we can definitely have much more control about document conversion . unfortunately I can't share the document as it's organisation internal document .
from pdf2docx.
May I know your purpose using this lib, the pdf text only, or both text and layout (e.g. paragraph spacing, indentation and text format)? If you concern only the text, I'd suggest to use PyMuPDF
directly, or similar pdf processing libs like pikepdf
, PyPDF
, pdfminer
...
The lib extracts pdf text with an upstream lib PyMuPDF
, and focuses on layout rebuilding, while stream table is responsible for the layout (split float layout into flow layout in each cell). The lib thinks it's necessary to if stream table exists in your docx. In this case, if turn it off, the layout rebuilding may fail.
from pdf2docx.
close this for now. feel free to reopen it if any new thoughts.
from pdf2docx.
Hey is there any way i can only have lattice table detection and switch off stream table detection .
from pdf2docx.
Hey is there any way i can only have lattice table detection and switch off stream table detection .
As described before, stream table is required to rebuild the page layout. But if you don't care much about the layout, yes, you can switch it off. For now, you need to do it manually -> hard comment stream table parsing. Take version 0.5.1 for example,
- go to file
Layout.py
>>> import pdf2docx
>>> pdf2docx.page.Layout.__file__
'd:\\21_github\\pdf2docx\\pdf2docx\\page\\Layout.py'
- go to method
_parse_layout_bottom_up()
(Line 119), then hard-comment stream table parsing as needed.
def _parse_layout_bottom_up(self, settings:dict):
'''Parse layout bottom-up:
* detect explicit tables first based on shapes,
* then stream tables based on original text blocks and parsed explicit tables;
* move table contained blocks (text block or explicit table) to associated cell layout.
'''
# parse table structure/format recognized from explicit shapes
self._table_parser.lattice_tables(
settings['connected_border_tolerance'],
settings['min_border_clearance'],
settings['max_border_width'])
# parse table structure based on implicit layout of text blocks
self._table_parser.stream_tables(
settings['min_border_clearance'],
settings['max_border_width'],
settings['float_layout_tolerance'],
settings['line_separate_threshold'])
I think better option would be have a flag to identify steam table only , lattice table only , stream lattice both , none. In this way we can definitely have much more control about document conversion.
I concerned much about the function of stream table for page layout. But as suggested, flexible option of table parsing does accommodate more use cases. As seen in method _parse_layout_bottom_up
, it's easy to add such flag in settings
. Would do it in next version.
Thank you so much for this good suggestion.
from pdf2docx.
v0.5.2
is available in Pypi now, arguments parse_lattice_table
and parse_stream_table
, both True by default, should work for this issue. The usage might look like:
from pdf2docx import Converter
pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'
# parse lattice tables only
cv = Converter(pdf_file)
cv.convert(docx_file, parse_stream_table=False)
cv.close()
from pdf2docx.
Related Issues (20)
- [WARNING] Ignore Line "<image>" due to overlap HOT 1
- 无法复原pdf文件中表格的框线 HOT 1
- How to save highlight in table after convert pdf to docx HOT 4
- Negative ref_dif in Blocks.py causing paragraph splitting
- 转化后存在页面超出的问题
- ValueError: unsupported colorspace for 'png' HOT 3
- Any support for ANDROID? HOT 1
- 转换时遇到字体名为中文(比如“宋体”)时,发生错误 HOT 3
- language support HOT 2
- pdf2docx.Converter将某些特殊pdf转word时,某个子进程会卡住 HOT 3
- Table is broken when the table is displayed on 2 pages HOT 2
- 关于行高分配的逻辑疑问 HOT 2
- 转换docx表格中文本不全,请问这个可以解决吗 HOT 1
- Resource Han Rounded CN Light rendered as "Resource" HOT 2
- 转word后图片被旋转180° HOT 12
- 表格生成的时候没有处理好浮动形图片 HOT 1
- 含XFA表单域的PDF无法转换为word HOT 1
- 占用内存没有gc
- pdf转word后,表格会溢出边界 HOT 2
- Hyperlinks are not transferred HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdf2docx.