Coder Social home page Coder Social logo

opendatalab / mineru Goto Github PK

View Code? Open in Web Editor NEW
8.5K 48.0 635.0 64.02 MB

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。

Home Page: https://opendatalab.com/OpenSourceTools

License: GNU Affero General Public License v3.0

Python 100.00%
extract-data layout-analysis ocr parser pdf pdf-converter python document-analysis pdf-parser pdf-extractor-llm

mineru's People

Contributors

1shuimo avatar conghui avatar drunkpig avatar dt-yy avatar eltociear avatar focusshang avatar gddgcz518 avatar github-actions[bot] avatar icecraft avatar myhloli avatar nutshellfool avatar papayalove avatar qiangqiang199 avatar renpengli01 avatar wangbindl avatar yzztin avatar zuanzuanshao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mineru's Issues

SDK不能使用GPU模型

Description of the bug | 错误描述

企业微信截图_1721044200913 企业微信截图_17210441867202

How to reproduce the bug | 如何复现

企业微信截图_1721044200913

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Device mode | 设备模式

cuda

多进程会报这个错- Required dependency not installed, please install by "pip install magic-pdf[full-cpu] detectron2 --extra-index-url https://myhloli.github.io/wheels/"

Description of the bug | 错误描述

  • Required dependency not installed, please install by
    "pip install magic-pdf[full-cpu] detectron2 --extra-index-url https://myhloli.github.io/wheels/"
    实际上我已经执行了这个安装

How to reproduce the bug | 如何复现

  • Required dependency not installed, please install by
    "pip install magic-pdf[full-cpu] detectron2 --extra-index-url https://myhloli.github.io/wheels/"
    实际上我已经执行了这个安装

Operating system | 操作系统

Windows

Python version | Python 版本

3.11

Device mode | 设备模式

cpu

Failed to build grpcio

Description of the bug | 错误描述

distutils.errors.CompileError: command '/usr/bin/clang' failed with exit code 1

  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for grpcio
Running setup.py clean for grpcio
Failed to build grpcio
ERROR: Could not build wheels for grpcio, which is required to install pyproject.toml-based projects

How to reproduce the bug | 如何复现

distutils.errors.CompileError: command '/usr/bin/clang' failed with exit code 1

  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for grpcio
Running setup.py clean for grpcio
Failed to build grpcio
ERROR: Could not build wheels for grpcio, which is required to install pyproject.toml-based projects

Operating system | 操作系统

MacOS

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

pymupdf.mupdf.FzErrorFormat: code=7: object out of range (0 0 R); xref size 161

Description of the bug | 错误描述

Traceback (most recent call last):
  File "/root/miniconda3/envs/MinerU/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 325, in pdf_command
    do_parse(
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 120, in do_parse
    draw_layout_bbox(pdf_info, pdf_bytes, local_md_dir)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/magic_pdf/libs/draw_bbox.py", line 142, in draw_layout_bbox
    pdf_docs.save(f"{out_path}/layout.pdf")
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/pymupdf/__init__.py", line 5444, in save
    mupdf.pdf_save_document(pdf, filename, opts)
  File "/root/miniconda3/envs/MinerU/lib/python3.10/site-packages/pymupdf/mupdf.py", line 50563, in pdf_save_document
    return _mupdf.pdf_save_document(doc, filename, opts)
pymupdf.mupdf.FzErrorFormat: code=7: object out of range (0 0 R); xref size 161
(MinerU) [root@n01v PDF-Extract]# magic-pdf --version
magic-pdf, version 0.6.1

How to reproduce the bug | 如何复现

magic-pdf pdf-command --pdf ./test2.pdf --inside_model true

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

magic-pdf pdf-command --pdf "pdf_path" --inside_model true报错

Description of the bug | 错误描述

企业微信截图_17210956733839

How to reproduce the bug | 如何复现

  1. pip install magic-pdf[full-cpu]
  2. pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/
  3. Downloading model weights files
  4. pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
  5. magic-pdf pdf-command --pdf "pdf_path" --inside_model true

Operating system | 操作系统

Windows

Python version | Python 版本

3.9

Device mode | 设备模式

cuda

折行的单词有连接符且没有拼接

Description of the bug | 错误描述

image

image

如上图,在 PDF 文档里,如果一行文本的最后一个单词分在两行显示,会在行尾加上 '-' 连接符号。
转换成 Markdown 之后,'-' 连接符号依然纯在,单词被 '-' 加一个空白分开。

How to reproduce the bug | 如何复现

可以使用这个 https://arxiv.org/pdf/2407.01906 pdf 复现

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Device mode | 设备模式

cpu

Typo in video

Description of the bug | 错误描述

Word "Markdown" is misspelled.

image

How to reproduce the bug | 如何复现

Operating system | 操作系统

Linux

Python version | Python 版本

3.11

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

TypeError: 'NoneType' object is not subscriptable

Description of the bug | 错误描述

Speed: 23.2ms preprocess, 3345.8ms inference, 0.0ms postprocess per image at shape (1, 3, 1888, 1408)
2024-07-19 15:57:08.869 | INFO | magic_pdf.model.pdf_extract_kit:call:166 - formula nums: 0, mfr time: 0.0
2024-07-19 15:57:08.869 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:116 - doc analyze cost: 373.20968651771545
'NoneType' object is not subscriptable
2024-07-19 15:57:08.889 | ERROR | main::53 - 'NoneType' object is not subscriptable
Traceback (most recent call last):

File "E:\MinerU-master\demo\demo.py", line 49, in
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
│ │ └ 'images'
│ └ <function UNIPipe.pipe_mk_markdown at 0x000002447890A050>
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000002441756DF90>

File "E:\MinerU-master\magic_pdf\pipe\UNIPipe.py", line 47, in pipe_mk_markdown
result = super().pipe_mk_markdown(img_parent_path, drop_mode, md_make_mode)
│ │ └ 'mm_markdown'
│ └ 'none'
└ 'images'

File "E:\MinerU-master\magic_pdf\pipe\AbsPipe.py", line 55, in pipe_mk_markdown
md_content = AbsPipe.mk_markdown(self.get_compress_pdf_mid_data(), img_parent_path, drop_mode, md_make_mode)
│ │ │ │ │ │ └ 'mm_markdown'
│ │ │ │ │ └ 'none'
│ │ │ │ └ 'images'
│ │ │ └ <function AbsPipe.get_compress_pdf_mid_data at 0x000002444E267910>
│ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000002441756DF90>
│ └ <staticmethod(<function AbsPipe.mk_markdown at 0x000002444E267D90>)>
└ <class 'magic_pdf.pipe.AbsPipe.AbsPipe'>

File "E:\MinerU-master\magic_pdf\pipe\AbsPipe.py", line 103, in mk_markdown
pdf_info_list = pdf_mid_data["pdf_info"]
└ None

TypeError: 'NoneType' object is not subscriptable

进程已结束,退出代码0

How to reproduce the bug | 如何复现

from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
import magic_pdf.model as model_config
from magic_pdf.model.pdf_extract_kit import CustomPEKModel, mfd_model_init

model_config.use_inside_model = True

设置模型目录和权重文件路径

current_script_dir = os.path.dirname(os.path.abspath(file))
model_dir = os.path.abspath(os.path.join(current_script_dir, '..', 'tmp', 'models'))
weight_path = os.path.join(model_dir, 'MFD','weights.pt')

打印路径确认

print(f"Model directory: {model_dir}")
print(f"Weight file path: {weight_path}")

确保权重文件存在

if not os.path.exists(weight_path):
# 提示用户权重文件不存在
print(f"Please ensure the weights file is available at: {weight_path}")
raise FileNotFoundError(f"Model weights not found at {weight_path}")

初始化模型

custom_model = CustomPEKModel(
ocr=True, show_log=True, models_dir=model_dir, device='cpu'
)

try:
demo_name = "demo1"
pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf")
model_path = os.path.join(current_script_dir, f"{demo_name}.json")
pdf_bytes = open(pdf_path, "rb").read()

model_json = []  # model_json传空list使用内置模型解析
if not model_json:
    # Initialize model_json with a default non-empty list
    model_json = [{"layout_dets": []}]  # Example of a minimal valid model_json structure

jso_useful_key = {"_pdf_type": "", "model_list": model_json}
local_image_dir = os.path.join(current_script_dir, 'images')
image_dir = str(os.path.basename(local_image_dir))
image_writer = DiskReaderWriter(local_image_dir)
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
pipe.pipe_analyze()  # 确保使用正确的方法
md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
with open(f"{demo_name}.md", "w", encoding="utf-8") as f:
    f.write(md_content)

except Exception as e:
logger.exception(e)
print(e)

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

没有使用后gpu加速

Description of the bug | 错误描述

当我修改json文件中"device-mode":"cuda"
运行magic-pdf pdf-command --pdf "1.pdf" --inside_model true
仍然提示 magic_pdf.model.pdf_extract_kit:init:100 - using device: cpu

How to reproduce the bug | 如何复现

1

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

能不能给一个完整的使用说明

Is your feature request related to a problem? Please describe.
您的特性请求是否与某个问题相关?请描述。
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
对存在的问题进行清晰且简洁的描述。例如:我一直很困扰的是 [...]

Describe the solution you'd like
描述您期望的解决方案
A clear and concise description of what you want to happen.
清晰且简洁地描述您希望实现的内容。

Describe alternatives you've considered
描述您已考虑的替代方案
A clear and concise description of any alternative solutions or features you've considered.
清晰且简洁地描述您已经考虑过的任何替代解决方案。

Additional context
提供更多细节
Add any other context or screenshots about the feature request here.
请附上任何相关截图、链接或文件,以帮助我们更好地理解您的请求。

在demo文件中,如果我想要保存解析后的json文件,应该如何操作

Description of the bug | 错误描述

当我这样保存,返回的json文件是空值

How to reproduce the bug | 如何复现

import os
import json
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
from loguru import logger
from magic_pdf.model.pdf_extract_kit import CustomPEKModel

from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter

import magic_pdf.model as model_config
model_config.use_inside_model = True

设置模型目录和权重文件路径

current_script_dir = os.path.dirname(os.path.abspath(file))
model_dir = os.path.abspath(os.path.join(current_script_dir, '..', 'tmp', 'models', 'MFD'))
weight_path = os.path.join(model_dir, 'weights.pt')

打印路径确认

print(f"Model directory: {model_dir}")
print(f"Weight file path: {weight_path}")

确保权重文件存在

if not os.path.exists(weight_path):
# 提示用户权重文件不存在
print(f"Please ensure the weights file is available at: {weight_path}")
raise FileNotFoundError(f"Model weights not found at {weight_path}")

初始化模型

custom_model = CustomPEKModel(
ocr=True, show_log=True, models_dir=os.path.abspath(os.path.join(current_script_dir, '..', 'tmp', 'models')), device='cuda'
)
try:
current_script_dir = os.path.dirname(os.path.abspath(file))
demo_name = "3"
pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf")
model_path = os.path.join(current_script_dir, f"{demo_name}.json")
pdf_bytes = open(pdf_path, "rb").read()
# model_json = json.loads(open(model_path, "r", encoding="utf-8").read())
model_json = [] # model_json传空list使用内置模型解析
jso_useful_key = {"_pdf_type": "", "model_list": model_json}
# 创建存放解析结果的文件夹,以PDF文件名命名
output_folder = os.path.join(current_script_dir, demo_name)
os.makedirs(output_folder, exist_ok=True)
# 创建存放图片的子文件夹
image_dir = os.path.join(output_folder, 'images')
os.makedirs(image_dir, exist_ok=True)
image_writer = DiskReaderWriter(image_dir)
pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
pipe.pipe_classify()
"""如果没有传入有效的模型数据,则使用内置model解析"""
if len(model_json) == 0:
if model_config.use_inside_model:
pipe.pipe_analyze()
# 保存模型 JSON 文件
with open(os.path.join(output_folder, f"{demo_name}_model.json"), "w", encoding="utf-8") as f:
json.dump(model_json, f, ensure_ascii=False, indent=4)
else:
logger.error("need model list input")
exit(1)
pipe.pipe_parse()
md_content = pipe.pipe_mk_markdown('images', drop_mode="none")
with open(os.path.join(output_folder, f"{demo_name}.md"), "w", encoding="utf-8") as f:
f.write(md_content)
except Exception as e:
logger.exception(e)

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

word文档中的试题提取

请问目前的算法是否可以提取英语试卷word文档中的一个个的试题,
试卷就是高中英语试卷或者四六级试卷

请问针对这种文档,有合适的算法推荐吗

并没有找到tmp文件夹

Description of the bug | 错误描述

捕获 当代码执行完之后并没有 tmp的临时文件夹出现,不知道是不是我的操作有误还是其他原因

How to reproduce the bug | 如何复现

直接执行代码

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Device mode | 设备模式

cpu

多级标题

  1. 目前导出的 md 仅支持一级标题,是否考虑支持多级标题。
  2. 实现多级标题是否有什么方案。

Getting Ran Out of Input error

Description of the bug | 错误描述

I'm getting the 'Ran Out of Memory' error. I reinstalled and rechecked everything, but the error persists. It was running fine before, but after 2-3 inputs, I started getting this error.

How to reproduce the bug | 如何复现

magic-pdf pdf-command --pdf D:\github_p
roject\backend_code_generation\components\invoice_june.pdf --inside_model true
2024-07-21 13:52:43.517 | WARNING | magic_pdf.cli.magi
cpdf:get_model_json:312 - not found json D:\github_proj
ect\backend_code_generation\components\invoice_june.json existed
2024-07-21 13:52:43.528 | INFO | magic_pdf.cli.magi
cpdf:do_parse:92 - local output dir is D:\github_projec
t\backend_code_generation\components\output\magic-pdf\invoice_june\auto
2024-07-21 13:52:43.606 | INFO | magic_pdf.libs.pdf
_check:detect_invalid_chars:57 - cid_count: 0, text_len: 394, cid_chars_radio: 0.0
2024-07-21 13:52:52.544 | INFO | magic_pdf.model.pd
f_extract_kit:init:93 - DocAnalysis init, this may
take some times. apply_layout: True, apply_formula: True, apply_ocr: False
2024-07-21 13:52:52.545 | INFO | magic_pdf.model.pdf_extract_kit:init:101 - using device: cpu
2024-07-21 13:52:52.546 | ERROR | magic_pdf.cli.magicpdf:parse_doc:338 - Ran out of input
Traceback (most recent call last):

File "\?\D:\github_project\backend_code_generation\venv\Scripts\magic-pdf-script.py", line 33, in
sys.exit(load_entry_point('magic-pdf', 'console_scripts', 'magic-pdf')())
│ │ └ <function importlib_load_entry_point at 0x000001C518AA3E20>
│ └
└ <module 'sys' (built-in)>

File "D:\github_project\backend_code_generation\venv
lib\site-packages\click\core.py", line 1157, in call
return self.main(*args, **kwargs)
│ │ │ └ {}
│ │ └ ()
│ └ <function BaseCommand.main at 0x000001C519605D80>

File "D:\github_project\backend_code_generation\venv\lib\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
│ │ └ <click.core.Context object at 0x000001C518C06AD0>
│ └ <function MultiCommand.invoke at 0x000001C519606D40>

File "D:\github_project\backend_code_generation\venv\lib\site-packages\click\core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
│ │ │ │ └ <click.core.Context object at 0x000001C519452A70>
│ │ │ └ <function Command.invoke at 0x000001C519606830>
│ │ └
│ └ <click.core.Context object at 0x000001C519452A70>
└ <function MultiCommand.invoke.._process_result at 0x000001C530E028C0>

File "D:\github_project\backend_code_generation\venv\lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
│ │ │ │ │ └ {'pdf': 'D
:\github_project\backend_code_generation\components
\invoice_june.pdf', 'inside_model': True, 'model': None, 'me...
│ │ │ │ └ <click.core.Context object at 0x000001C519452A70>
│ │ │ └ <function pdf_command at 0x000001C530E02EF0>
│ │ └
│ └ <function Context.invoke at 0x000001C5196055A0>
└ <click.core.Context object at 0x000001C519452A70>

File "D:\github_project\backend_code_generation\venv\lib\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
│ └ {'pdf': 'D:\github_pr
oject\backend_code_generation\components\invoice_june.pdf', 'inside_model': True, 'model': None, 'me...
└ ()

File "d:\github_project\backend_code_generation\compo
nents\mineru\magic_pdf\cli\magicpdf.py", line 352, in pdf_command
parse_doc(pdf)
│ └ 'D:\github_project\backend_code_generation\components\invoice_june.pdf'
└ <function pdf_command..parse_doc at 0x000001C530E02CB0>

File "d:\github_project\backend_code_generation\compo
nents\mineru\magic_pdf\cli\magicpdf.py", line 330, in parse_doc
do_parse(
└ <function do_parse at 0x000001C530E02830>

File "d:\github_project\backend_code_generation\compo
nents\mineru\magic_pdf\cli\magicpdf.py", line 112, in do_parse
pipe.pipe_analyze()
│ └ <function UNIPipe.pipe_analyze at 0x000001C530E01480>
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001C530DCBD90>

File "d:\github_project\backend_code_generation\compo
nents\mineru\magic_pdf\pipe\UNIPipe.py", line 29, in pipe_analyze
self.model_list = doc_analyze(self.pdf_bytes, ocr=False)
│ │ │ │ └ b'%PDF-1.4\n1
0 obj\n<<\n/Title (\xfe\xff\x00D\x00o\x00c\x00u\x00m\x0
0e\x00n\x00t)\n/Creator (\xfe\xff\x00w\x00k\x00h\x00t\x0...
│ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001C530DCBD90>
│ │ └ <function doc_analyze at 0x000001C5251CE4D0>
│ └ []
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001C530DCBD90>

File "d:\github_project\backend_code_generation\compo
nents\mineru\magic_pdf\model\doc_analyze_by_custom_model.py", line 101, in doc_analyze
custom_model = model_manager.get_model(ocr, show_log)
│ │ │ └ False
│ │ └ False
│ └ <function ModelSingleton.get_model at 0x000001C5251CE440>
└ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x000001C530E10460>

File "d:\github_project\backend_code_generation\compo
nents\mineru\magic_pdf\model\doc_analyze_by_custom_model.py", line 63, in get_model
self._models[key] = custom_model_init(ocr=ocr, show_log=show_log)
│ │ │ │ │ └ False
│ │ │ │ └ False
│ │ │ └ <function custom_model_init at 0x000001C5251CE320>
│ │ └ (False, False)
│ └ {}
└ <magic_pdf.model.doc_analyze_by_custom_model.ModelSingleton object at 0x000001C530E10460>

File "d:\github_project\backend_code_generation\compo
nents\mineru\magic_pdf\model\doc_analyze_by_custom_model.py", line 85, in custom_model_init
custom_model = CustomPEKModel(ocr=ocr, show_log=show_log, models_dir=local_models_dir, device=device)
│ │ │ │ └ 'cpu'
│ │ │
└ 'D:\github_project\backend_code_generation\components\PDF-Extract-Kit\models'
│ │ └ False
│ └ False
└ <class 'magic_pdf.model.pdf_extract_kit.CustomPEKModel'>

File "d:\github_project\backend_code_generation\compo
nents\mineru\magic_pdf\model\pdf_extract_kit.py", line 107, in init
self.mfd_model = mfd_model_init(str(os.path.join(models_dir, self.configs["weights"]["mfd"])))
│ │ │ │ │ │
│ └ {'config': {'device': 'cpu', 'layout':
True, 'formula': True}, 'weights': {'layout': 'Layout/model_final.pth', 'mfd': 'MFD/we...
│ │ │ │ │ │
└ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x000001C530E103D0>
│ │ │ │ │ └
'D:\github_project\backend_code_generation\components\PDF-Extract-Kit\models'
│ │ │ │ └ <function join at 0x000001C518BB72E0>
│ │ │ └ <module 'n
tpath' from 'C:\Users\mishr\AppData\Local\Programs\Python\Python310\lib\ntpath.py'>
│ │ └ <module 'os'
from 'C:\Users\mishr\AppData\Local\Programs\Python\Python310\lib\os.py'>
│ └ <function mfd_model_init at 0x000001C530E03910>
└ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x000001C530E103D0>

File "d:\github_project\backend_code_generation\compo
nents\mineru\magic_pdf\model\pdf_extract_kit.py", line 30, in mfd_model_init
mfd_model = YOLO(weight)
│ └ 'D:\github_project\backend_cod
e_generation\components\PDF-Extract-Kit\models\MFD/weights.pt'
└ <class 'ultralytics.models.yolo.model.YOLO'>

File "D:\github_project\backend_code_generation\venv
lib\site-packages\ultralytics\models\yolo\model.py", line 23, in init
super().init(model=model, task=task, verbose=verbose)
│ │ └ False
│ └ None
└ 'D:\github_project\backe
nd_code_generation\components\PDF-Extract-Kit\models\MFD/weights.pt'

File "D:\github_project\backend_code_generation\venv
lib\site-packages\ultralytics\engine\model.py", line 149, in init
self._load(model, task=task)
│ │ │ └ None
│ │ └ 'D:\github_project\backend_code_gene
ration\components\PDF-Extract-Kit\models\MFD/weights.pt'
│ └ <function Model._load at 0x000001C543CE23B0>
└ YOLO()

File "D:\github_project\backend_code_generation\venv
lib\site-packages\ultralytics\engine\model.py", line 230, in _load
self.model, self.ckpt = attempt_load_one_weight(weights)
│ │ │ │ │ └ '
D:\github_project\backend_code_generation\components\PDF-Extract-Kit\models\MFD/weights.pt'
│ │ │ │ └ <function attempt_load_one_weight at 0x000001C543C5D000>
│ │ │ └ None
│ │ └ YOLO()
│ └ None
└ YOLO()

File "D:\github_project\backend_code_generation\venv
lib\site-packages\ultralytics\nn\tasks.py", line 855, in attempt_load_one_weight
ckpt, weight = torch_safe_load(weight) # load ckpt
│ └ 'D:\github_projec
t\backend_code_generation\components\PDF-Extract-Kit\models\MFD/weights.pt'
└ <function torch_safe_load at 0x000001C543C5CEE0>

File "D:\github_project\backend_code_generation\venv
lib\site-packages\ultralytics\nn\tasks.py", line 781, in torch_safe_load
ckpt = torch.load(file, map_location="cpu")
│ │ └ 'D:\github_project\backend_co
de_generation\components\PDF-Extract-Kit\models\MFD\weights.pt'
│ └ <function load at 0x000001C534921240>
└ <module 'torch' from 'D:\github_project\
backend_code_generation\venv\lib\site-packages\torch\init.py'>

File "D:\github_project\backend_code_generation\venv
lib\site-packages\torch\serialization.py", line 1040, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
│ │ │ │ └ {'encoding': 'utf-8'}
│ │ │ └ <m
odule 'pickle' from 'C:\Users\mishr\AppData\Local\Programs\Python\Python310\lib\pickle.py'>
│ │ └ 'cpu'
│ └ <_io.BufferedReader name='D:
\github_project\backend_code_generation\components\PDF-Extract-Kit\models\MFD\weights.pt'>
└ <function _legacy_load at 0x000001C5349213F0>

File "D:\github_project\backend_code_generation\venv
lib\site-packages\torch\serialization.py", line 1262, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
│ │ │ └ {'encoding': 'utf-8'}
│ │ └ <_io.BufferedRe
ader name='D:\github_project\backend_code_generation\components\PDF-Extract-Kit\models\MFD\weights.pt'>
│ └
└ <module 'pickle' from 'C:\Users\
mishr\AppData\Local\Programs\Python\Python310\lib\pickle.py'>

EOFError: Ran out of input

This is present in my magic-pdf.json file -

{
"bucket_info":{
"bucket-name-1":["ak", "sk", "endpoint"],
"bucket-name-2":["ak", "sk", "endpoint"]
},
"temp-output-dir":"D:\github_project\backend_code_generation\components\output",
"models-dir": "D:\github_project\backend_code_generation\components\PDF-Extract-Kit\models",
"device-mode":"cpu"
}

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

如何批量解析多个pdf文件

有没有类似marker的那种命令,可以直接把一个文件夹内的pdf都解析出来,存入另一个文件夹中,开启多个worker加速,顺便这个项目有微信群可以添加吗,谢谢大佬!

你好,我遇到有些PDF可以解析,有的pdf文件报错

Description of the bug | 错误描述


C++ Traceback (most recent call last):

0 paddle_infer::Predictor::Predictor(paddle::AnalysisConfig const&)
1 std::unique_ptr<paddle::PaddlePredictor, std::default_deletepaddle::PaddlePredictor > paddle::CreatePaddlePredictor<paddle::AnalysisConfig, (paddle::PaddleEngineKind)2>(paddle::AnalysisConfig const&)
2 paddle::AnalysisPredictor::Init(std::shared_ptrpaddle::framework::Scope const&, std::shared_ptrpaddle::framework::ProgramDesc const&)
3 paddle::AnalysisPredictor::PrepareProgram(std::shared_ptrpaddle::framework::ProgramDesc const&)
4 paddle::AnalysisPredictor::OptimizeInferenceProgram()
5 paddle::inference::analysis::Analyzer::RunAnalysis(paddle::inference::analysis::Argument*)
6 paddle::inference::analysis::IrAnalysisPass::RunImpl(paddle::inference::analysis::Argument*)
7 paddle::inference::analysis::IRPassManager::Apply(std::unique_ptr<paddle::framework::ir::Graph, std::default_deletepaddle::framework::ir::Graph >)
8 paddle::framework::ir::Pass::Apply(paddle::framework::ir::Graph*) const
9 paddle::framework::ir::SelfAttentionFusePass::ApplyImpl(paddle::framework::ir::Graph*) const
10 paddle::framework::ir::GraphPatternDetector::operator()(paddle::framework::ir::Graph*, std::function<void (std::map<paddle::framework::ir::PDNode*, paddle::framework::ir::Node*, paddle::framework::ir::GraphPatternDetector::PDNodeCompare, std::allocator<std::pair<paddle::framework::ir::PDNode* const, paddle::framework::ir::Node*> > > const&, paddle::framework::ir::Graph*)>)


Error Message Summary:

FatalError: Illegal instruction is detected by the operating system.
[TimeInfo: *** Aborted at 1721031085 (unix time) try "date -d @1721031085" if you are using GNU date ***]
[SignalInfo: *** SIGILL (@0x7f95b6ffa13a) received by PID 20020 (TID 0x7f96f055d180) from PID 18446744072484790586 ***]

非法指令 (核心已转储)

How to reproduce the bug | 如何复现

按照最新配置好环境之后执行
magic-pdf pdf-command --pdf ../1.pdf --inside_model true
有的文件会报核心已转储 的错误

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Device mode | 设备模式

cuda

表格解析优化

我看到目前支持的还是markdown格式输出,对于有合并单元格的表格解析效果不好,后续有没有这方面的优化思路

pdf的模型文件如何生成

您好,请问一下pdf的模型文件如何生成,我这边有使用开源的DocXChain,同一文件对比demo的生成效果相差很多,谢谢

下载模型后运行报错FileNotFoundError: [Errno 2] No such file or directory: '/tmp/models/MFD/weights.pt'

Description of the bug | 错误描述

在魔搭上下载模型wanderkid/PDF-Extract-Kit后上传至服务器,同时在magic-pdf.json中修改为"models-dir":"/tmp/models",按照仓库步骤执行magic-pdf pdf-command --pdf "pdf_path" --inside_model true报错FileNotFoundError: [Errno 2] No such file or directory: '/tmp/models/MFD/weights.pt'

How to reproduce the bug | 如何复现

image

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

解析文本型pdf时出现 magic_pdf.user_api:parse_pdf:85 - list index out of range

Description of the bug | 错误描述

医学论文pdf demo1.pdf,图文俱备。解析时报两个错:

  • INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 3, text_len: 45665, cid_chars_radio: 6.57217341774925e-05
  • ERROR | magic_pdf.user_api:parse_pdf:85 - list index out of range

最后,没有直接提取文本,而是通过ocr识别得到的markdown文档 demo1_mineru.md,出现不少错误:

  1. 上标注释很多都识别错误,比如作者列表5、6、7、8都识别错了
  2. 部分文字丢失,比如Table1的第一行,完全丢失
  3. 一些小排版问题,比如单词换行连接符仍然保留了下来,没有还原单词。

How to reproduce the bug | 如何复现

import pathlib
import json

from loguru import logger

from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter

import magic_pdf.model as model_config 
model_config.__use_inside_model__ = True

def gen_md(pdf_file: str):
    try:
        pdf_path = pathlib.Path(pdf_file)
        task_name = pdf_path.stem
        task_dir = pdf_path.parent
        model_path = task_dir.joinpath(task_name + '.json')
        pdf_bytes = open(pdf_path, 'rb').read()
        # model_json = json.loads(open(model_path, 'r', encoding='utf-8').read())
        model_json = []  # model_json传空list使用内置模型解析
        jso_useful_key = {'_pdf_type': '', 'model_list': model_json}
        local_image_dir = task_dir.joinpath(task_name, 'img')
        image_dir = str(pathlib.PurePath(local_image_dir).name)
        image_writer = DiskReaderWriter(local_image_dir)
        pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
        pipe.pipe_classify()
        pipe.pipe_parse()
        md_content = pipe.pipe_mk_markdown(image_dir, drop_mode='none')
        with open(f'{task_name}_mineru.md', 'w', encoding='utf-8') as f:
            f.write(md_content)
    except Exception as e:
        logger.exception(e)

gen_md('demo1.pdf')

日志:

2024-07-17 23:38:42.992 | INFO     | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 3, text_len: 45665, cid_chars_radio: 6.57217341774925e-05
2024-07-17 23:38:42.998 | ERROR    | magic_pdf.user_api:parse_pdf:85 - list index out of range
Traceback (most recent call last):

  File "[/opt/anaconda3/envs/mineru/lib/python3.10/runpy.py", line 196](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/runpy.py#line=195), in _run_module_as_main
    return _run_code(code, main_globals, None,
           │         │     └ {'__name__': '__main__', '__doc__': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
           │         └ <code object <module> at 0x7fd8c4c7ece0, file "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py)"...
           └ <function _run_code at 0x7fd8c4c8b2e0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/runpy.py", line 86](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/runpy.py#line=85), in _run_code
    exec(code, run_globals)
         │     └ {'__name__': '__main__', '__doc__': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
         └ <code object <module> at 0x7fd8c4c7ece0, file "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py)"...
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py", line 17](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py#line=16), in <module>
    app.launch_new_instance()
    │   └ <bound method Application.launch_instance of <class 'ipykernel.kernelapp.IPKernelApp'>>
    └ <module 'ipykernel.kernelapp' from '[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelapp.py](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelapp.py)'>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/traitlets/config/application.py", line 1075](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/traitlets/config/application.py#line=1074), in launch_instance
    app.start()
    │   └ <function IPKernelApp.start at 0x7fd8c8014af0>
    └ <ipykernel.kernelapp.IPKernelApp object at 0x7fd8c4c21150>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 701](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelapp.py#line=700), in start
    self.io_loop.start()
    │    │       └ <function BaseAsyncIOLoop.start at 0x7fd8c80155a0>
    │    └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fd8c8046b00>
    └ <ipykernel.kernelapp.IPKernelApp object at 0x7fd8c4c21150>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 195](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/tornado/platform/asyncio.py#line=194), in start
    self.asyncio_loop.run_forever()
    │    │            └ <function BaseEventLoop.run_forever at 0x7fd8c6234dc0>
    │    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
    └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fd8c8046b00>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/base_events.py", line 603](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/base_events.py#line=602), in run_forever
    self._run_once()
    │    └ <function BaseEventLoop._run_once at 0x7fd8c62368c0>
    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/base_events.py", line 1909](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/base_events.py#line=1908), in _run_once
    handle._run()
    │      └ <function Handle._run at 0x7fd8c61e1bd0>
    └ <Handle Task.task_wakeup(<Future finis...e50>, ...],))>)>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/events.py", line 80](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/events.py#line=79), in _run
    self._context.run(self._callback, *self._args)
    │    │            │    │           │    └ <member '_args' of 'Handle' objects>
    │    │            │    │           └ <Handle Task.task_wakeup(<Future finis...e50>, ...],))>)>
    │    │            │    └ <member '_callback' of 'Handle' objects>
    │    │            └ <Handle Task.task_wakeup(<Future finis...e50>, ...],))>)>
    │    └ <member '_context' of 'Handle' objects>
    └ <Handle Task.task_wakeup(<Future finis...e50>, ...],))>)>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py#line=533), in dispatch_queue
    await self.process_one()
          │    └ <function Kernel.process_one at 0x7fd8c79e85e0>
          └ <ipykernel.ipkernel.IPythonKernel object at 0x7fd8c8047100>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 523](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py#line=522), in process_one
    await dispatch(*args)
          │         └ ([<zmq.sugar.frame.Frame object at 0x7fd8cde3e820>, <zmq.sugar.frame.Frame object at 0x7fd8cde3fe20>, <zmq.sugar.frame.Frame ...
          └ <bound method Kernel.dispatch_shell of <ipykernel.ipkernel.IPythonKernel object at 0x7fd8c8047100>>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 429](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py#line=428), in dispatch_shell
    await result
          └ <coroutine object Kernel.execute_request at 0x7fd8cde9edc0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 767](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py#line=766), in execute_request
    reply_content = await reply_content
                          └ <coroutine object IPythonKernel.do_execute at 0x7fd8cddebae0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 429](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/ipkernel.py#line=428), in do_execute
    res = shell.run_cell(
          │     └ <function ZMQInteractiveShell.run_cell at 0x7fd8c7ff63b0>
          └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/zmqshell.py#line=548), in run_cell
    return super().run_cell(*args, **kwargs)
                             │       └ {'store_history': True, 'silent': False, 'cell_id': 'b3d9eb6a-9a32-4f7e-8922-aff3c5e4a314'}
                             └ ("gen_md('demo1.pdf')",)
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3074), in run_cell
    result = self._run_cell(
             │    └ <function InteractiveShell._run_cell at 0x7fd8c7046dd0>
             └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3129), in _run_cell
    result = runner(coro)
             │      └ <coroutine object InteractiveShell.run_cell_async at 0x7fd8cddeb990>
             └ <function _pseudo_sync_runner at 0x7fd8c702a710>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/async_helpers.py#line=128), in _pseudo_sync_runner
    coro.send(None)
    │    └ <method 'send' of 'coroutine' objects>
    └ <coroutine object InteractiveShell.run_cell_async at 0x7fd8cddeb990>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3333), in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
                       │    │             │        │     └ '[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py)'
                       │    │             │        └ [<ast.Expr object at 0x7fd8c84b17b0>]
                       │    │             └ <ast.Module object at 0x7fd8c84b1b40>
                       │    └ <function InteractiveShell.run_ast_nodes at 0x7fd8c70470a0>
                       └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3516), in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
             │    │        │     │              └ False
             │    │        │     └ <ExecutionResult object at 7fd8c84b1720, execution_count=6 error_before_exec=None error_in_exec=None info=<ExecutionInfo obje...
             │    │        └ <code object <module> at 0x7fd8cde55dc0, file "[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py...](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py...)
             │    └ <function InteractiveShell.run_code at 0x7fd8c7047130>
             └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3576), in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
         │         │    │               │    └ {'__name__': '__main__', '__doc__': 'Automatically created module for IPython interactive environment', '__package__': None, ...
         │         │    │               └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
         │         │    └ <property object at 0x7fd8c702e7f0>
         │         └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
         └ <code object <module> at 0x7fd8cde55dc0, file "[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py...](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py...)

  File "[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py", line 1](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py#line=0), in <module>
    gen_md('demo1.pdf')
    └ <function gen_md at 0x7fd8c84cd360>

  File "[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2159648948.py", line 16](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2159648948.py#line=15), in gen_md
    pipe.pipe_parse()
    │    └ <function UNIPipe.pipe_parse at 0x7fd8cde40e50>
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>

  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 35](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py#line=34), in pipe_parse
    self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer,
    │    │              │               │    │          │    │           │    └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7fd8c855a9e0>
    │    │              │               │    │          │    │           └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>
    │    │              │               │    │          │    └ []
    │    │              │               │    │          └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>
    │    │              │               │    └ b'%PDF-1.3\n%\xe4\xe3\xcf\xd2\n76 0 obj\n<<\n[/Linearized](http://localhost:8888/Linearized) 1.0\n[/O](http://localhost:8888/O) 78\n[/H](http://localhost:8888/H) [ 1149 392 ]\n[/L](http://localhost:8888/L) 391252\n[/E](http://localhost:8888/E) 87074\n[/N](http://localhost:8888/N) 11\n[/T](http://localhost:8888/T) 389688\n...
    │    │              │               └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>
    │    │              └ <function parse_union_pdf at 0x7fd8cde40c10>
    │    └ None
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/user_api.py", line 88](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/user_api.py#line=87), in parse_union_pdf
    pdf_info_dict = parse_pdf(parse_pdf_by_txt)
                    │         └ <function parse_pdf_by_txt at 0x7fd8cde40af0>
                    └ <function parse_union_pdf.<locals>.parse_pdf at 0x7fd8c84cd240>
> File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/user_api.py", line 77](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/user_api.py#line=76), in parse_pdf
    return method(
           └ <function parse_pdf_by_txt at 0x7fd8cde40af0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py", line 12](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py#line=11), in parse_pdf_by_txt
    return pdf_parse_union(pdf_bytes,
           │               └ b'%PDF-1.3\n%\xe4\xe3\xcf\xd2\n76 0 obj\n<<\n[/Linearized](http://localhost:8888/Linearized) 1.0\n[/O](http://localhost:8888/O) 78\n[/H](http://localhost:8888/H) [ 1149 392 ]\n[/L](http://localhost:8888/L) 391252\n[/E](http://localhost:8888/E) 87074\n[/N](http://localhost:8888/N) 11\n[/T](http://localhost:8888/T) 389688\n...
           └ <function pdf_parse_union at 0x7fd8cde40940>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py", line 225](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py#line=224), in pdf_parse_union
    page_info = parse_page_core(pdf_docs, magic_model, page_id, pdf_bytes_md5, imageWriter, parse_mode)
                │               │         │            │        │              │            └ 'txt'
                │               │         │            │        │              └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7fd8c855a9e0>
                │               │         │            │        └ '3DDDBCDE75F2248F9AB9BC8E266EE16E'
                │               │         │            └ 0
                │               │         └ <magic_pdf.model.magic_model.MagicModel object at 0x7fd8cde1b9d0>
                │               └ Document('', <memory, doc# 4>)
                └ <function parse_page_core at 0x7fd8cde408b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py", line 83](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py#line=82), in parse_page_core
    img_blocks = magic_model.get_imgs(page_id)
                 │           │        └ 0
                 │           └ <function MagicModel.get_imgs at 0x7fd8cc416e60>
                 └ <magic_pdf.model.magic_model.MagicModel object at 0x7fd8cde1b9d0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/magic_model.py", line 459](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/magic_model.py#line=458), in get_imgs
    records, _ = self.__tie_up_category_by_distance(page_no, 3, 4)
                 │                                  └ 0
                 └ <magic_pdf.model.magic_model.MagicModel object at 0x7fd8cde1b9d0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/magic_model.py", line 186](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/magic_model.py#line=185), in __tie_up_category_by_distance
    self.__model_list[page_no]["layout_dets"],
    │                 └ 0
    └ <magic_pdf.model.magic_model.MagicModel object at 0x7fd8cde1b9d0>

IndexError: list index out of range
2024-07-17 23:38:43.019 | WARNING  | magic_pdf.user_api:parse_union_pdf:90 - parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr

Operating system | 操作系统

MacOS

Python version | Python 版本

3.10

Device mode | 设备模式

cpu

解析文本型pdf时出现 magic_pdf.user_api:parse_pdf:85 - list index out of range

Description of the bug | 错误描述

医学论文pdf,图文俱备。解析时报两个错:

  • INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 3, text_len: 45665, cid_chars_radio: 6.57217341774925e-05
  • ERROR | magic_pdf.user_api:parse_pdf:85 - list index out of range

最后,没有直接提取文本,而是通过ocr识别得到的markdown文档,出现不少错误:

  1. 上标注释很多都识别错误,比如作者列表5、6、7、8都识别错了
  2. 部分文字丢失,比如Table1的第一行,完全丢失
  3. 一些小排版问题,比如单词换行连接符仍然保留了下来,没有还原单词。

How to reproduce the bug | 如何复现

import pathlib
import json

from loguru import logger

from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter

import magic_pdf.model as model_config 
model_config.__use_inside_model__ = True

def gen_md(pdf_file: str):
    try:
        pdf_path = pathlib.Path(pdf_file)
        task_name = pdf_path.stem
        task_dir = pdf_path.parent
        model_path = task_dir.joinpath(task_name + '.json')
        pdf_bytes = open(pdf_path, 'rb').read()
        # model_json = json.loads(open(model_path, 'r', encoding='utf-8').read())
        model_json = []  # model_json传空list使用内置模型解析
        jso_useful_key = {'_pdf_type': '', 'model_list': model_json}
        local_image_dir = task_dir.joinpath(task_name, 'img')
        image_dir = str(pathlib.PurePath(local_image_dir).name)
        image_writer = DiskReaderWriter(local_image_dir)
        pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
        pipe.pipe_classify()
        pipe.pipe_parse()
        md_content = pipe.pipe_mk_markdown(image_dir, drop_mode='none')
        with open(f'{task_name}_mineru.md', 'w', encoding='utf-8') as f:
            f.write(md_content)
    except Exception as e:
        logger.exception(e)

gen_md('demo1.pdf')

日志:

2024-07-17 23:38:42.992 | INFO     | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 3, text_len: 45665, cid_chars_radio: 6.57217341774925e-05
2024-07-17 23:38:42.998 | ERROR    | magic_pdf.user_api:parse_pdf:85 - list index out of range
Traceback (most recent call last):

  File "[/opt/anaconda3/envs/mineru/lib/python3.10/runpy.py", line 196](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/runpy.py#line=195), in _run_module_as_main
    return _run_code(code, main_globals, None,
           │         │     └ {'__name__': '__main__', '__doc__': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
           │         └ <code object <module> at 0x7fd8c4c7ece0, file "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py)"...
           └ <function _run_code at 0x7fd8c4c8b2e0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/runpy.py", line 86](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/runpy.py#line=85), in _run_code
    exec(code, run_globals)
         │     └ {'__name__': '__main__', '__doc__': 'Entry point for launching an IPython kernel.\n\nThis is separate from the ipykernel pack...
         └ <code object <module> at 0x7fd8c4c7ece0, file "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py)"...
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py", line 17](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel_launcher.py#line=16), in <module>
    app.launch_new_instance()
    │   └ <bound method Application.launch_instance of <class 'ipykernel.kernelapp.IPKernelApp'>>
    └ <module 'ipykernel.kernelapp' from '[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelapp.py](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelapp.py)'>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/traitlets/config/application.py", line 1075](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/traitlets/config/application.py#line=1074), in launch_instance
    app.start()
    │   └ <function IPKernelApp.start at 0x7fd8c8014af0>
    └ <ipykernel.kernelapp.IPKernelApp object at 0x7fd8c4c21150>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 701](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelapp.py#line=700), in start
    self.io_loop.start()
    │    │       └ <function BaseAsyncIOLoop.start at 0x7fd8c80155a0>
    │    └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fd8c8046b00>
    └ <ipykernel.kernelapp.IPKernelApp object at 0x7fd8c4c21150>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 195](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/tornado/platform/asyncio.py#line=194), in start
    self.asyncio_loop.run_forever()
    │    │            └ <function BaseEventLoop.run_forever at 0x7fd8c6234dc0>
    │    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
    └ <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7fd8c8046b00>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/base_events.py", line 603](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/base_events.py#line=602), in run_forever
    self._run_once()
    │    └ <function BaseEventLoop._run_once at 0x7fd8c62368c0>
    └ <_UnixSelectorEventLoop running=True closed=False debug=False>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/base_events.py", line 1909](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/base_events.py#line=1908), in _run_once
    handle._run()
    │      └ <function Handle._run at 0x7fd8c61e1bd0>
    └ <Handle Task.task_wakeup(<Future finis...e50>, ...],))>)>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/events.py", line 80](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/asyncio/events.py#line=79), in _run
    self._context.run(self._callback, *self._args)
    │    │            │    │           │    └ <member '_args' of 'Handle' objects>
    │    │            │    │           └ <Handle Task.task_wakeup(<Future finis...e50>, ...],))>)>
    │    │            │    └ <member '_callback' of 'Handle' objects>
    │    │            └ <Handle Task.task_wakeup(<Future finis...e50>, ...],))>)>
    │    └ <member '_context' of 'Handle' objects>
    └ <Handle Task.task_wakeup(<Future finis...e50>, ...],))>)>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py#line=533), in dispatch_queue
    await self.process_one()
          │    └ <function Kernel.process_one at 0x7fd8c79e85e0>
          └ <ipykernel.ipkernel.IPythonKernel object at 0x7fd8c8047100>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 523](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py#line=522), in process_one
    await dispatch(*args)
          │         └ ([<zmq.sugar.frame.Frame object at 0x7fd8cde3e820>, <zmq.sugar.frame.Frame object at 0x7fd8cde3fe20>, <zmq.sugar.frame.Frame ...
          └ <bound method Kernel.dispatch_shell of <ipykernel.ipkernel.IPythonKernel object at 0x7fd8c8047100>>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 429](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py#line=428), in dispatch_shell
    await result
          └ <coroutine object Kernel.execute_request at 0x7fd8cde9edc0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 767](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/kernelbase.py#line=766), in execute_request
    reply_content = await reply_content
                          └ <coroutine object IPythonKernel.do_execute at 0x7fd8cddebae0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 429](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/ipkernel.py#line=428), in do_execute
    res = shell.run_cell(
          │     └ <function ZMQInteractiveShell.run_cell at 0x7fd8c7ff63b0>
          └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/ipykernel/zmqshell.py#line=548), in run_cell
    return super().run_cell(*args, **kwargs)
                             │       └ {'store_history': True, 'silent': False, 'cell_id': 'b3d9eb6a-9a32-4f7e-8922-aff3c5e4a314'}
                             └ ("gen_md('demo1.pdf')",)
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3074), in run_cell
    result = self._run_cell(
             │    └ <function InteractiveShell._run_cell at 0x7fd8c7046dd0>
             └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3129), in _run_cell
    result = runner(coro)
             │      └ <coroutine object InteractiveShell.run_cell_async at 0x7fd8cddeb990>
             └ <function _pseudo_sync_runner at 0x7fd8c702a710>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 129](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/async_helpers.py#line=128), in _pseudo_sync_runner
    coro.send(None)
    │    └ <method 'send' of 'coroutine' objects>
    └ <coroutine object InteractiveShell.run_cell_async at 0x7fd8cddeb990>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3333), in run_cell_async
    has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
                       │    │             │        │     └ '[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py)'
                       │    │             │        └ [<ast.Expr object at 0x7fd8c84b17b0>]
                       │    │             └ <ast.Module object at 0x7fd8c84b1b40>
                       │    └ <function InteractiveShell.run_ast_nodes at 0x7fd8c70470a0>
                       └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3516), in run_ast_nodes
    if await self.run_code(code, result, async_=asy):
             │    │        │     │              └ False
             │    │        │     └ <ExecutionResult object at 7fd8c84b1720, execution_count=6 error_before_exec=None error_in_exec=None info=<ExecutionInfo obje...
             │    │        └ <code object <module> at 0x7fd8cde55dc0, file "[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py...](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py...)
             │    └ <function InteractiveShell.run_code at 0x7fd8c7047130>
             └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/IPython/core/interactiveshell.py#line=3576), in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
         │         │    │               │    └ {'__name__': '__main__', '__doc__': 'Automatically created module for IPython interactive environment', '__package__': None, ...
         │         │    │               └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
         │         │    └ <property object at 0x7fd8c702e7f0>
         │         └ <ipykernel.zmqshell.ZMQInteractiveShell object at 0x7fd8c80475b0>
         └ <code object <module> at 0x7fd8cde55dc0, file "[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py...](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py...)

  File "[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py", line 1](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2769732803.py#line=0), in <module>
    gen_md('demo1.pdf')
    └ <function gen_md at 0x7fd8c84cd360>

  File "[/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2159648948.py", line 16](http://localhost:8888/var/folders/lw/mt_gxt9143zgcck9x_tsm2240000gn/T/ipykernel_88890/2159648948.py#line=15), in gen_md
    pipe.pipe_parse()
    │    └ <function UNIPipe.pipe_parse at 0x7fd8cde40e50>
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>

  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 35](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py#line=34), in pipe_parse
    self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer,
    │    │              │               │    │          │    │           │    └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7fd8c855a9e0>
    │    │              │               │    │          │    │           └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>
    │    │              │               │    │          │    └ []
    │    │              │               │    │          └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>
    │    │              │               │    └ b'%PDF-1.3\n%\xe4\xe3\xcf\xd2\n76 0 obj\n<<\n[/Linearized](http://localhost:8888/Linearized) 1.0\n[/O](http://localhost:8888/O) 78\n[/H](http://localhost:8888/H) [ 1149 392 ]\n[/L](http://localhost:8888/L) 391252\n[/E](http://localhost:8888/E) 87074\n[/N](http://localhost:8888/N) 11\n[/T](http://localhost:8888/T) 389688\n...
    │    │              │               └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>
    │    │              └ <function parse_union_pdf at 0x7fd8cde40c10>
    │    └ None
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7fd8cc3bf430>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/user_api.py", line 88](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/user_api.py#line=87), in parse_union_pdf
    pdf_info_dict = parse_pdf(parse_pdf_by_txt)
                    │         └ <function parse_pdf_by_txt at 0x7fd8cde40af0>
                    └ <function parse_union_pdf.<locals>.parse_pdf at 0x7fd8c84cd240>
> File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/user_api.py", line 77](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/user_api.py#line=76), in parse_pdf
    return method(
           └ <function parse_pdf_by_txt at 0x7fd8cde40af0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py", line 12](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py#line=11), in parse_pdf_by_txt
    return pdf_parse_union(pdf_bytes,
           │               └ b'%PDF-1.3\n%\xe4\xe3\xcf\xd2\n76 0 obj\n<<\n[/Linearized](http://localhost:8888/Linearized) 1.0\n[/O](http://localhost:8888/O) 78\n[/H](http://localhost:8888/H) [ 1149 392 ]\n[/L](http://localhost:8888/L) 391252\n[/E](http://localhost:8888/E) 87074\n[/N](http://localhost:8888/N) 11\n[/T](http://localhost:8888/T) 389688\n...
           └ <function pdf_parse_union at 0x7fd8cde40940>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py", line 225](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py#line=224), in pdf_parse_union
    page_info = parse_page_core(pdf_docs, magic_model, page_id, pdf_bytes_md5, imageWriter, parse_mode)
                │               │         │            │        │              │            └ 'txt'
                │               │         │            │        │              └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x7fd8c855a9e0>
                │               │         │            │        └ '3DDDBCDE75F2248F9AB9BC8E266EE16E'
                │               │         │            └ 0
                │               │         └ <magic_pdf.model.magic_model.MagicModel object at 0x7fd8cde1b9d0>
                │               └ Document('', <memory, doc# 4>)
                └ <function parse_page_core at 0x7fd8cde408b0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py", line 83](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core.py#line=82), in parse_page_core
    img_blocks = magic_model.get_imgs(page_id)
                 │           │        └ 0
                 │           └ <function MagicModel.get_imgs at 0x7fd8cc416e60>
                 └ <magic_pdf.model.magic_model.MagicModel object at 0x7fd8cde1b9d0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/magic_model.py", line 459](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/magic_model.py#line=458), in get_imgs
    records, _ = self.__tie_up_category_by_distance(page_no, 3, 4)
                 │                                  └ 0
                 └ <magic_pdf.model.magic_model.MagicModel object at 0x7fd8cde1b9d0>
  File "[/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/magic_model.py", line 186](http://localhost:8888/opt/anaconda3/envs/mineru/lib/python3.10/site-packages/magic_pdf/model/magic_model.py#line=185), in __tie_up_category_by_distance
    self.__model_list[page_no]["layout_dets"],
    │                 └ 0
    └ <magic_pdf.model.magic_model.MagicModel object at 0x7fd8cde1b9d0>

IndexError: list index out of range
2024-07-17 23:38:43.019 | WARNING  | magic_pdf.user_api:parse_union_pdf:90 - parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr

Operating system | 操作系统

MacOS

Python version | Python 版本

3.10

Device mode | 设备模式

cpu

输出的json是无序的,如何按照版面分析的顺序输出

Is your feature request related to a problem? Please describe.
您的特性请求是否与某个问题相关?请描述。
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
对存在的问题进行清晰且简洁的描述。例如:我一直很困扰的是 [...]

Describe the solution you'd like
描述您期望的解决方案
A clear and concise description of what you want to happen.
清晰且简洁地描述您希望实现的内容。

Describe alternatives you've considered
描述您已考虑的替代方案
A clear and concise description of any alternative solutions or features you've considered.
清晰且简洁地描述您已经考虑过的任何替代解决方案。

Additional context
提供更多细节
Add any other context or screenshots about the feature request here.
请附上任何相关截图、链接或文件,以帮助我们更好地理解您的请求。

LaTex 公式未能正确识别

这个项目整体效果非常好!感谢各位贡献者!

本人在使用中发现数学公式未能正确识别为 LaTex 表达式。

以这个 PDF https://arxiv.org/pdf/2407.01906 为例,第二页的公式

image

识别成了
image

看项目介绍是可以将数学公式识别为 LaTex 的,请问应该如何做就可以将论文中的公式正确识别成 LaTex 表达式呢?

谢谢!

AssertionError: Dataset 'scihub_train' is already registered!

Description of the bug | 错误描述

我把处理函数封装成了如下形式:

def process_one_file(file_path: str, output_path: str):
    os.makedirs(output_path, exist_ok=True)
    try:
        pdf_bytes = open(file_path, "rb").read()
        model_json = []

        jso_useful_key = {"_pdf_type": "", "model_list": model_json}
        local_image_dir = os.path.join(output_path, "images")
        image_dir = str(os.path.basename(local_image_dir))
        image_writer = DiskReaderWriter(local_image_dir)
        pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True)
        pipe.pipe_classify()
        """
        如果没有传入有效的模型数据则使用内置model解析
        """
        if len(model_json) == 0:
            if model_config.__use_inside_model__:
                pipe.pipe_analyze()
            else:
                logger.error("need model list input")
                raise ValueError("need model list input")
        pipe.pipe_parse()
        md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
        with open(
            f"{output_path}/{osp.basename(output_path)}.md", "w", encoding="utf-8"
        ) as f:
            f.write(md_content)
        orig_model_list = copy.deepcopy(pipe.model_list)
        with open(f"{output_path}/model_list.json", "w") as f:
            json.dump(orig_model_list, f, ensure_ascii=False, indent=4)
    except Exception as e:
        logger.exception(e)
        exit(-1)

但是发现在第二次调用这个函数的时候detectron2里面就会报错:AssertionError: Dataset 'scihub_train' is already registered!

完整错误如下:

Traceback (most recent call last):

  File "/home/yuanye/pdf-extract/run-test.py", line 80, in <module>
    main()
    └ <function main at 0x7f754a81e950>

  File "/home/yuanye/pdf-extract/run-test.py", line 69, in main
    process_one_file(file_path, output_path)
    │                │          └ '/home/yuanye/pdf-extract/example-outputs/example-simple-text-pdf'
    │                └ '/home/yuanye/pdf-extract/pdf-examples/example-simple-text.pdf'
    └ <function process_one_file at 0x7f754ce01990>

> File "/home/yuanye/pdf-extract/run-test.py", line 32, in process_one_file
    pipe.pipe_analyze()
    │    └ <function UNIPipe.pipe_analyze at 0x7f73fca40ee0>
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f7307c00220>

  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 29, in pipe_analyze
    self.model_list = doc_analyze(self.pdf_bytes, ocr=False)
    │    │            │           │    └ b'%PDF-1.4\n%\xaa\xab\xac\xad\n1 0 obj\n<<\n/Title (Workload Pipelining)\n/Author (Arm Ltd.)\n/Subject (In this guide, read a...
    │    │            │           └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f7307c00220>
    │    │            └ <function doc_analyze at 0x7f74a611b370>
    │    └ []
    └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x7f7307c00220>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 69, in doc_analyze
    custom_model = CustomPEKModel(ocr=ocr, show_log=show_log, models_dir=local_models_dir, device=device)
                   │                  │             │                    │                        └ 'cuda'
                   │                  │             │                    └ '/home/yuanye/PDF-Extract-Kit/models'
                   │                  │             └ False
                   │                  └ False
                   └ <class 'magic_pdf.model.pdf_extract_kit.CustomPEKModel'>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 115, in __init__
    self.layout_model = Layoutlmv3_Predictor(
    │                   └ <class 'magic_pdf.model.pek_sub_modules.layoutlmv3.model_init.Layoutlmv3_Predictor'>
    └ <magic_pdf.model.pdf_extract_kit.CustomPEKModel object at 0x7f7307c346a0>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/model/pek_sub_modules/layoutlmv3/model_init.py", line 122, in __init__
    cfg = setup(layout_args, device)
          │     │            └ 'cuda'
          │     └ {'config_file': '/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/resources/model_config/layoutlmv3/layout...
          └ <function setup at 0x7f7308d7f520>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/magic_pdf/model/pek_sub_modules/layoutlmv3/model_init.py", line 82, in setup
    register_coco_instances(
    └ <function register_coco_instances at 0x7f730a999870>
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/detectron2/data/datasets/coco.py", line 510, in register_coco_instances
    DatasetCatalog.register(name, lambda: load_coco_json(json_file, image_root, name))
    │              │        │             │              │          │           └ 'scihub_train'
    │              │        │             │              │          └ '/mnt/petrelfs/share_data/zhaozhiyuan/publaynet/layout_scihub/train'
    │              │        │             │              └ '/mnt/petrelfs/share_data/zhaozhiyuan/publaynet/layout_scihub/train.json'
    │              │        │             └ <function load_coco_json at 0x7f730a998f70>
    │              │        └ 'scihub_train'
    │              └ <function _DatasetCatalog.register at 0x7f730a97c430>
    └ DatasetCatalog(registered datasets: coco_2014_train, coco_2014_val, coco_2014_minival, coco_2014_valminusminival, coco_2017_t...
  File "/home/yuanye/.conda/envs/pdf/lib/python3.10/site-packages/detectron2/data/catalog.py", line 37, in register
    assert name not in self, "Dataset '{}' is already registered!".format(name)
           │           │                                                  └ 'scihub_train'
           │           └ DatasetCatalog(registered datasets: coco_2014_train, coco_2014_val, coco_2014_minival, coco_2014_valminusminival, coco_2017_t...
           └ 'scihub_train'

AssertionError: Dataset 'scihub_train' is already registered!

How to reproduce the bug | 如何复现

def process_one_file(file_path: str, output_path: str):
    os.makedirs(output_path, exist_ok=True)
    try:
        pdf_bytes = open(file_path, "rb").read()
        model_json = []

        jso_useful_key = {"_pdf_type": "", "model_list": model_json}
        local_image_dir = os.path.join(output_path, "images")
        image_dir = str(os.path.basename(local_image_dir))
        image_writer = DiskReaderWriter(local_image_dir)
        pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True)
        pipe.pipe_classify()
        """
        如果没有传入有效的模型数据则使用内置model解析
        """
        if len(model_json) == 0:
            if model_config.__use_inside_model__:
                pipe.pipe_analyze()
            else:
                logger.error("need model list input")
                raise ValueError("need model list input")
        pipe.pipe_parse()
        md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
        with open(
            f"{output_path}/{osp.basename(output_path)}.md", "w", encoding="utf-8"
        ) as f:
            f.write(md_content)
        orig_model_list = copy.deepcopy(pipe.model_list)
        with open(f"{output_path}/model_list.json", "w") as f:
            json.dump(orig_model_list, f, ensure_ascii=False, indent=4)
    except Exception as e:
        logger.exception(e)
        exit(-1)

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cuda

不能从HF下载权重,可以支持国内的modelscope吗?

Description of the bug | 错误描述

我们的公司网络是在国内的,合规要求比较高,所以不能连HF,是否可以把模型权重上传到国内的平台?

How to reproduce the bug | 如何复现

我们的公司网络是在国内的,合规要求比较高,所以不能连HF,是否可以把模型权重上传到国内的平台?

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Device mode | 设备模式

cuda

部分页面识别结果不准确

Description of the bug | 错误描述

在我的case中,结果不准确包含以下几方面:

  1. 文本内容未识别,如第1页的“甲状腺专科专家委员会...”;第10页最上面的标题“糖化血红蛋白”;第14页的“未检项目”
  2. 检测框位置不准确,如第3页的上半页,检测框有点偏上,导致最后一行只截取了一半
  3. 识别文本内容不完成,如第4页最上面的“体检所见:右眼...”,只识别出了“左眼...”文本

大部分情况下,结果是准确的。我想了解下,以上问题出现的原因及解决方案,非常感谢

How to reproduce the bug | 如何复现

test.pdf
magic-pdf pdf-command --pdf "test.pdf" --inside_model true

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Device mode | 设备模式

cuda

安装版本不是0.6.1

Description of the bug | 错误描述

直接执行pip install magic-pdf[full-cpu]会安装0.5.13,强制安装pip install magic-pdf[full-cpu]==0.6.1会得到依赖报错。Linux为arm64。

How to reproduce the bug | 如何复现

ERROR: Cannot install magic-pdf because these package versions have conflicting dependencies.

The conflict is caused by:
    unimernet 0.1.1 depends on eva-decord<0.7.0 and >=0.6.1
    unimernet 0.1.0 depends on eva-decord<0.7.0 and >=0.6.1
    unimernet 0.0.9 depends on eva-decord<0.7.0 and >=0.6.1
    unimernet 0.0.4 depends on eva-decord<0.7.0 and >=0.6.1
    unimernet 0.0.3 depends on eva-decord<0.7.0 and >=0.6.1
    unimernet 0.0.2 depends on eva-decord<0.7.0 and >=0.6.1
    unimernet 0.0.1 depends on eva-decord<0.7.0 and >=0.6.1

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

cpu

表格被当成图片了,无法解析。markdown中的标题层级都是同一级,没有正确识别

Description of the bug | 错误描述

表格被当成图片了,无法解析。markdown中的标题层级都是同一级,没有正确识别

How to reproduce the bug | 如何复现

表格被当成图片了,无法解析。markdown中的标题层级都是同一级,没有正确识别

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Device mode | 设备模式

cpu

信息丢失问题

我在把pdf文件转化为md,我的json文件里面存在{"category_id": 15, "poly": [276.0, 828.0, 677.0, 828.0, 677.0, 863.0, 276.0, 863.0], "score": 0.99, "text": "Figure 2: Lengths of instructions"}, 这个信息,但是在转化为md的时候“Figure 2: Lengths of instructions”这条信息直接丢失了。原本存在的两张图片虽然都被截取下来了,但是真正放入md文件的只有其中一张。附上我使用的pdf文件。
testpdf.pdf

请问fast-langdetect文件夹的存放地址,是否改成magic-pdf.json里面一样的可配置地址吗?

1.离线部署首次运行,报错urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>
首次运行需要在线下载一个小的语言检测模型,如果是离线部署需要手动下载该模型并放到指定目录。
参考:#121

首次运行时,内部的一些模块可能需要联网环境以下载一些小模型资源,看了您的报错日志,是fast_langdetect需要下载一个语言检测用的模型文件,如您的机器不能联网,请将附件中压缩包内容解压到"/tmp"目录下
fasttext-langdetect.zip
参考:
https://github.com/LlmKira/fast-langdetect

请问fast-langdetect文件夹的存放地址,是否改成magic-pdf.json里面一样的可配置地址吗?

AttributeError: 'CustomMBartDecoder' object has no attribute 'embed_scale'

[07/21 22:09:19 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loa
ding from D:\github_project\backend_code_generation\models\Layout/model_final.pth ...
[07/21 22:09:19 fvcore.common.checkpoint]: [Checkpointer] Loading from d:\github_project\backend_code_generation\models\Layout/model_final.pth ...
2024-07-21 22:09:20.946 | INFO | magic_pdf.model.pdf_extract_kit:init:125 - DocAnalysis init done!
2024-07-21 22:09:20.950 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:90 - model init cost: 44.55924439430237
2024-07-21 22:09:59.858 | INFO | magic_pdf.model.pdf_extract_kit:call:136 - layout detection cost: 38.77

0: 1888x1472 1 embedding, 10787.5ms
Speed: 52.9ms preprocess, 10787.5ms inference, 2.0ms postprocess per image at shape (1, 3, 1888, 1472)
2024-07-21 22:10:14.915 | ERROR | magic_pdf.cli.magicpdf:parse_doc:338 - 'CustomMBartDecoder' object has no attribute 'embed_scale'
Traceback (most recent call last):

......................

File "D:\github_project\backend_code_generation\venv\lib\site-packages\unimernet\models\unimernet\encoder_decoder.py", line 235, in forward
inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
│ │ └ CustomMBartDecoder(
│ │ (embed_tokens): MBartScaledWordEmbedding(50000, 1024, padding_idx=1)
│ │ (embed_positions): MBartLearnedP...
│ └ tensor([[0]])
└ CustomMBartDecoder(
(embed_tokens): MBartScaledWordEmbedding(50000, 1024, padding_idx=1)
(embed_positions): MBartLearnedP...

File "D:\github_project\backend_code_generation\venv\lib\site-packages\torch\nn\modules\module.py", line 1709, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")

AttributeError: 'CustomMBartDecoder' object has no attribute 'embed_scale'

After model layout detection, this error arises. What could be the issue?

使用magic-pdf pdf-command 转换时报错

Description of the bug | 错误描述

Traceback (most recent call last):
File "/mnt/gpt/anaconda3/envs/py310/bin/magic-pdf", line 8, in
sys.exit(cli())
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 325, in pdf_command
do_parse(
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/magic_pdf/cli/magicpdf.py", line 111, in do_parse
pipe.pipe_analyze()
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/magic_pdf/pipe/UNIPipe.py", line 29, in pipe_analyze
self.model_list = doc_analyze(self.pdf_bytes, ocr=False)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/magic_pdf/model/doc_analyze_by_custom_model.py", line 69, in doc_analyze
custom_model = CustomPEKModel(ocr=ocr, show_log=show_log, models_dir=local_models_dir, device=device)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 106, in init
self.mfd_model = mfd_model_init(str(os.path.join(models_dir, self.configs["weights"]["mfd"])))
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/magic_pdf/model/pdf_extract_kit.py", line 29, in mfd_model_init
mfd_model = YOLO(weight)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/ultralytics/models/yolo/model.py", line 23, in init
super().init(model=model, task=task, verbose=verbose)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/ultralytics/engine/model.py", line 149, in init
self._load(model, task=task)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/ultralytics/engine/model.py", line 230, in _load
self.model, self.ckpt = attempt_load_one_weight(weights)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 855, in attempt_load_one_weight
ckpt, weight = torch_safe_load(weight) # load ckpt
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 781, in torch_safe_load
ckpt = torch.load(file, map_location="cpu")
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/torch/serialization.py", line 1040, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/mnt/gpt/anaconda3/envs/py310/lib/python3.10/site-packages/torch/serialization.py", line 1262, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, 'v'.

How to reproduce the bug | 如何复现

magic-pdf pdf-command --pdf "test.pdf" --inside_model true

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Device mode | 设备模式

cpu

detectron2 依赖已经通过自编译安装,但是运行时仍然报缺少依赖

Description of the bug | 错误描述

detectron2 依赖已经通过自编译安装,但是运行时仍然报缺少依赖
image
image

How to reproduce the bug | 如何复现

pip install magic-pdf
git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
pip install magic-pdf detectron2
./bin/magic-pdf pdf-command --pdf "/tmp/test.pdf" --inside_model true

Operating system | 操作系统

MacOS

Python version | Python 版本

3.12

Software version | 软件版本 (magic-pdf --version)

0.6.x

Device mode | 设备模式

mps

运行demo报错

Description of the bug | 错误描述

2024-07-16 11:22:00.324 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 14, text_len: 26394, cid_chars_radio: 0.0005324003650745361
2024-07-16 11:22:00.329 | ERROR | magic_pdf.user_api:parse_pdf:85 - list index out of range
Traceback (most recent call last):

File "E:\MinerU-master\demo\demo.py", line 23, in
pipe.pipe_parse()
│ └ <function UNIPipe.pipe_parse at 0x000001FE33EDD990>
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001FE03BA2DD0>

File "E:\MinerU-master\magic_pdf\pipe\UNIPipe.py", line 35, in pipe_parse
self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer,
│ │ │ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x000001FE03BA2E60>
│ │ │ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001FE03BA2DD0>
│ │ │ │ │ │ └ []
│ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001FE03BA2DD0>
│ │ │ │ └ b'%PDF-1.5\r%\x80\x84\x88\x8c\x90\x94\x98\x9c\xa0\xa4\xa8\xac\xb0\xb4\xb8\xbc\xc0\xc4\xc8\xcc\xd0\xd4\xd8\xdc\xe0\xe4\xe8\xec...
│ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001FE03BA2DD0>
│ │ └ <function parse_union_pdf at 0x000001FE33EDD750>
│ └ None
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x000001FE03BA2DD0>

File "E:\MinerU-master\magic_pdf\user_api.py", line 88, in parse_union_pdf
pdf_info_dict = parse_pdf(parse_pdf_by_txt)
│ └ <function parse_pdf_by_txt at 0x000001FE33EDD630>
└ <function parse_union_pdf..parse_pdf at 0x000001FE037F3E20>

File "E:\MinerU-master\magic_pdf\user_api.py", line 77, in parse_pdf
return method(
└ <function parse_pdf_by_txt at 0x000001FE33EDD630>

File "E:\MinerU-master\magic_pdf\pdf_parse_by_txt.py", line 12, in parse_pdf_by_txt
return pdf_parse_union(pdf_bytes,
│ └ b'%PDF-1.5\r%\x80\x84\x88\x8c\x90\x94\x98\x9c\xa0\xa4\xa8\xac\xb0\xb4\xb8\xbc\xc0\xc4\xc8\xcc\xd0\xd4\xd8\xdc\xe0\xe4\xe8\xec...
└ <function pdf_parse_union at 0x000001FE33EDD5A0>

File "E:\MinerU-master\magic_pdf\pdf_parse_union_core.py", line 225, in pdf_parse_union
page_info = parse_page_core(pdf_docs, magic_model, page_id, pdf_bytes_md5, imageWriter, parse_mode)
│ │ │ │ │ │ └ 'txt'
│ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x000001FE03BA2E60>
│ │ │ │ └ '036C1D1F6867C983E74EEA67B33E09D6'
│ │ │ └ 0
│ │ └ <magic_pdf.model.magic_model.MagicModel object at 0x000001FE354A3970>
│ └ Document('', <memory, doc# 4>)
└ <function parse_page_core at 0x000001FE33EDD510>

File "E:\MinerU-master\magic_pdf\pdf_parse_union_core.py", line 83, in parse_page_core
img_blocks = magic_model.get_imgs(page_id)
│ │ └ 0
│ └ <function MagicModel.get_imgs at 0x000001FE316320E0>
└ <magic_pdf.model.magic_model.MagicModel object at 0x000001FE354A3970>

File "E:\MinerU-master\magic_pdf\model\magic_model.py", line 459, in get_imgs
records, _ = self.__tie_up_category_by_distance(page_no, 3, 4)
│ └ 0
└ <magic_pdf.model.magic_model.MagicModel object at 0x000001FE354A3970>

File "E:\MinerU-master\magic_pdf\model\magic_model.py", line 186, in __tie_up_category_by_distance
self.__model_list[page_no]["layout_dets"],
│ └ 0
└ <magic_pdf.model.magic_model.MagicModel object at 0x000001FE354A3970>

IndexError: list index out of range
2024-07-16 11:22:00.390 | WARNING | magic_pdf.user_api:parse_union_pdf:90 - parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr
2024-07-16 11:22:00.390 | ERROR | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:92 - use_inside_model is False, not allow to use inside model

进程已结束,退出代码1

How to reproduce the bug | 如何复现

1

Operating system | 操作系统

Windows

Python version | Python 版本

3.10

Device mode | 设备模式

cuda

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.