Coder Social home page Coder Social logo

baidu / ddparser Goto Github PK

View Code? Open in Web Editor NEW
966.0 24.0 164.0 363 KB

百度开源的依存句法分析系统

License: Apache License 2.0

Python 98.92% Shell 1.08%
dependency-parser chinese-nlp python chinese-dependency-parser dependency-parsing syntax-parser

ddparser's Introduction

DDParser

PyPi Latest Release License

依存句法分析简介

依存句法分析是自然语言处理核心技术之一,旨在通过分析句子中词语之间的依存关系来确定句子的句法结构,如下图实例所示: struct
依存句法分析作为底层技术,可直接用于提升其他NLP任务的效果,这些任务包括但不限于语义角色标注、语义匹配、事件抽取等。该技术具有很高的研究价值及应用价值。为了方便研究人员和商业合作伙伴共享效果领先的依存句法分析技术,我们开源了基于大规模标注数据训练的高性能的依存句法分析工具,并提供一键式安装及预测服务,用户只需一条命令即可获取依存句法分析结果。

项目介绍

DDParser(Baidu Dependency Parser)是百度自然语言处理部基于深度学习平台飞桨(PaddlePaddle)和大规模标注数据研发的依存句法分析工具。其训练数据不仅覆盖了多种输入形式的数据,如键盘输入query、语音输入query,还覆盖了多种场景的数据,如新闻、论坛。该工具在随机评测数据上取得了优异的效果。同时,该工具使用简单,一键完成安装及预测。

效果说明

数据集 UAS LAS
CTB5 90.31% 89.06%
DuCTB1.0 94.80% 92.88%
  • CTB5: Chinese Treebank 5.0 是Linguistic Data Consortium (LDC)在2005年发布的中文句法树库,包含18,782条句子,语料主要来自新闻和杂志,如新华社日报。
  • DuCTB1.0: Baidu Chinese Treebank1.0是百度构建的中文句法树库,即本次所发布的依存句法分析工具-DDParser的训练数据来源,具体信息参见数据来源

注:由于CTB5数据集规模较小,最优模型(即评测模型)使用了句子的word级表示、POS(词性标签)表示、以及预训练词向量,而DuCTB1.0数据集规模较大,其最优模型仅使用了句子的word级和char级表示。

快速开始

版本依赖


一键安装

用户可以使用以下方式进行一键安装及预测:

功能使用

未分词方式

  • 代码示例
>>> from ddparser import DDParser
>>> ddp = DDParser()
>>> # 单条句子
>>> ddp.parse("百度是一家高科技公司")
[{'word': ['百度', '是', '一家', '高科技', '公司'], 'head': [2, 0, 5, 5, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}]
>>> # 多条句子
>>> ddp.parse(["百度是一家高科技公司", "他送了一本书"])
[{'word': ['百度', '是', '一家', '高科技', '公司'], 'head': [2, 0, 5, 5, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}, 
{'word': ['他', '送', '了', '一本', '书'], 'head': [2, 0, 2, 5, 2], 'deprel': ['SBV', 'HED', 'MT', 'ATT', 'VOB']}]
>>> # 输出概率和词性标签
>>> ddp = DDParser(prob=True, use_pos=True)
>>> ddp.parse(["百度是一家高科技公司"])
[{'word': ['百度', '是', '一家', '高科技', '公司'], 'postag': ['ORG', 'v', 'm', 'n', 'n'], 'head': [2, 0, 5, 5, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB'], 'prob': [1.0, 1.0, 1.0, 1.0, 1.0]}]
>>> # buckets=True, 数据集长度不均时处理速度更快
>>> ddp = DDParser(buckets=True)
>>> # 使用GPU
>>> ddp = DDParser(use_cuda=True)

已分词方式

  • 代码示例
>>> from ddparser import DDParser
>>> ddp = DDParser()
>>> ddp.parse_seg([['百度', '是', '一家', '高科技', '公司'], ['他', '送', '了', '一本', '书']])
[{'word': ['百度', '是', '一家', '高科技', '公司'], 'head': [2, 0, 5, 5, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB']}, 
{'word': ['他', '送', '了', '一本', '书'], 'head': [2, 0, 2, 5, 2], 'deprel': ['SBV', 'HED', 'MT', 'ATT', 'VOB']}]
>>> # 输出概率
>>> ddp = DDParser(prob=True)
>>> ddp.parse_seg([['百度', '是', '一家', '高科技', '公司']])
[{'word': ['百度', '是', '一家', '高科技', '公司'], 'head': [2, 0, 5, 5, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB'], 'prob': [1.0, 1.0, 1.0, 1.0, 1.0]}]

注:标签含义见依存句法分析标注关系集合

进阶使用

项目下载

用户可以通过git clone https://github.com/baidu/DDParser下载源码,并且执行下列命令安装依赖库:

pip install --upgrade paddlepaddle-gpu
pip install --upgrade LAC

模型下载

我们发布了基于DuCTB1.0训练的模型,通过执行cd ddparser && sh download_pretrained_model.sh下载模型,模型将保存在./ddparser/model_files/baidu下。

训练

用户可以通过sh run_train.sh训练模型。详细命令如下所示:

CUDA_VISIBLE_DEVICES=0 python -u run.py \
        --mode=train \
        --use_cuda \
        --feat=none \
        --preprocess \
        --model_files=model_files/baidu \
        --train_data_path=data/baidu/train.txt \
        --valid_data_path=data/baidu/dev.txt \
        --test_data_path=data/baidu/test.txt \
        --buckets=15

注:用户可通过修改train_data_path, valid_data_pathtest_data_path指定训练集,评估集和测试集, 参数含义见参数说明,所用数据集格式见数据格式说明

评估

用户可以通过执行sh download_data.sh下载我们提供的评估集,其将保存在./data/baidu/下。该评估集共2,592条句子,平均长度为11.27字符。
用户可以通过执行sh run_evaluate.sh评估模型效果,详细命令如下所示:

CUDA_VISIBLE_DEVICES=0 python run.py \
                                --mode=evaluate \
                                --use_cuda \
                                --model_files=model_files/baidu \
                                --test_data_path=data/baidu/test.txt \
                                --buckets=15 \
                                --tree

注:用户可通过修改test_data_path指定评估集,所用数据集格式见数据格式说明

预测

基于源码,我们提供两种基于命令行的预测方法,分别用于已分词数据和未分词数据。

基于已分词数据的预测

预测的输入数据要求以CoNLL-X(官方说明)格式组织,缺失字段使用“-”代替。通过执行sh run_predict.sh进行预测,详细命令如下所示:

CUDA_VISIBLE_DEVICES=0 python run.py \
                                --mode=predict \
                                --use_cuda \
                                --model_files=model_files/baidu \
                                --infer_data_path=data/baidu/test.txt \
                                --infer_result_path=data/baidu/test.predict \
                                --buckets=15 \
                                --tree 

注:用户可通过修改infer_data_pathinfer_result_path指定要预测的数据集和预测结果的路径。

基于未分词数据的预测
预测的输入数据为字符串形式,一行一条数据。通过执行sh run_predict_query.sh对来自标准输入的数据进行预测,详细命令如下所示:

CUDA_VISIBLE_DEVICES=0 python run.py \
                                --mode=predict_q \
                                --use_cuda \
                                --model_files=model_files/baidu \
                                --buckets=15 \
                                --tree

注:默认调用LAC预测分词和词性

参数说明

mode: 任务模式(train, evaluate, predict, predict_q)
config_path:保存超参文件的路径
model_files:保存模型的路径
train_data_path:训练集文件的路径
valid_data_path:验证集文件的路径
test_data_path:测试集文件的路径
infer_data_path:待预测文件的路径
pretrained_embedding_dir:预训练词向量的路径
batch_size:批尺寸
log_path:日志的路径
log_level: 日志等级,默认INFO('DEBUG', 'INFO', 'WARNING', 'ERROR', 'FATAL')
infer_result_path:预测结果保存的路径
use_cuda:如果设置,则使用GPU
preprocess:训练模式下的使用参数,设置表示会基于训练数据进行词统计等操作,不设置默认使用已统计好的信息(节省统计时间);针对同一训练数据,多次训练可不设置该参数。
seed:随机数种子(默认:1)
threads:控制每个paddle实例的线程数
tree:确保输出结果是正确的依存句法树
prob:如果设置,则输出每个弧的概率,保存在结果的PROB列。
feat:选择输入的特征(none,char,pos;ernie-*模型feat只能选择none)
buckets:选择最大分桶数(默认:15)
punct:评估结果的时候是否包含标点
encoding_model:选择底层模型, 默认ernie-lstm(lstm, transformer, ernie-1.0, ernie-tiny, ernie-lstm)

数据格式说明

本项目数据格式基于CoNLL-X(官方说明)的风格,缺少的字段使用"-"代替(用户只用关注ID,FROM,HEAD,DEPREL,PROB等列即可),如“百度是一家高科技公司”的可解析为如下格式:

ID      FROM   LEMMA CPOSTAG POSTAG  FEATS   HEAD    DEPREL   PROB   PDEPREL
1       百度    百度    -       -       -       2       SBV     1.0     -
2       是      是      -       -       -       0       HED     1.0     -
3       一家    一家    -       -       -       5       ATT     1.0     -
4       高科技  高科技  -       -       -       5       ATT     1.0     -
5       公司    公司    -       -       -       2       VOB     1.0     -

数据集介绍

依存句法分析标注关系集合

DuCTB1.0数据集含14种标注关系,具体含义见下表:

Label 关系类型 说明 示例
SBV 主谓关系 主语与谓词间的关系 他送了一本书(他<--送)
VOB 动宾关系 宾语与谓词间的关系 他送了一本书(送-->书)
POB 介宾关系 介词与宾语间的关系 我把书卖了(把-->书)
ADV 状中关系 状语与中心词间的关系 我昨天买书了(昨天<--买)
CMP 动补关系 补语与中心词间的关系 我都吃完了(吃-->完)
ATT 定中关系 定语与中心词间的关系 他送了一本书(一本<--书)
F 方位关系 方位词与中心词的关系 在公园里玩耍(公园-->里)
COO 并列关系 同类型词语间关系 叔叔阿姨(叔叔-->阿姨)
DBL 兼语结构 主谓短语做宾语的结构 他请我吃饭(请-->我,请-->吃饭)
DOB 双宾语结构 谓语后出现两个宾语 他送我一本书(送-->我,送-->书)
VV 连谓结构 同主语的多个谓词间关系 他外出吃饭(外出-->吃饭)
IC 子句结构 两个结构独立或关联的单句 你好,书店怎么走?(你好<--走)
MT 虚词成分 虚词与中心词间的关系 他送了一本书(送-->了)
HED 核心关系 指整个句子的核心

数据来源

DuCTB1.0:Baidu Chinese Treebank 1.0是百度构建的中文依存句法树库,包含近100万句子(本次发布模型的训练数据近53万句)。语料来自搜索query、网页句子,覆盖了手写、语音等多种输入形式,同时覆盖了新闻、论坛等多种场景。

文件结构

.
├── LICENSE
├── README.md
├── requirements.txt   #依赖模块及版本要求
├── ddparser           #DDParser的核心代码,包含模型,测试数据,运行脚本等

后期计划

  • 蒸馏模型,减小模型体积

参考资料

本项目所用方法出自论文《Deep Biaffine Attention for Neural Dependency Parsing》,对应的pytorch版本参见yzhangcs/parser

文献引用

如果您的学术工作成果中使用了DDParser,请您增加下述引用。我们非常欣慰DDParser能够对您的学术工作带来帮助。

@misc{zhang2020practical,
    title={A Practical Chinese Dependency Parser Based on A Large-scale Dataset},
    author={Shuai Zhang and Lijie Wang and Ke Sun and Xinyan Xiao},
    year={2020},
    eprint={2009.00901},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

如何贡献

我们欢迎开发者向DDParser贡献代码。如果您开发了新功能或发现了bug,欢迎给我们提交PR。

ddparser's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ddparser's Issues

ddparser导入失败

from ddparser import DDParser
Traceback (most recent call last):
File "", line 1, in
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/ddparser/init.py", line 24, in
from .run import DDParser
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/ddparser/run.py", line 26, in
import LAC
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/LAC/init.py", line 23, in
from .lac import LAC
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/LAC/lac.py", line 28, in
import paddle.fluid as fluid
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/init.py", line 37, in
import paddle.complex
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/complex/init.py", line 15, in
from . import tensor
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/complex/tensor/init.py", line 15, in
from . import math
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/complex/tensor/math.py", line 15, in
from paddle.common_ops_import import *
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/common_ops_import.py", line 15, in
from paddle.fluid.layer_helper import LayerHelper
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/fluid/init.py", line 56, in
from . import contrib
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/fluid/contrib/init.py", line 27, in
from . import slim
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/init.py", line 15, in
from .core import *
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/core/init.py", line 15, in
from . import config
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/core/config.py", line 19, in
from ..prune import *
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/prune/init.py", line 17, in
from . import prune_strategy
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/prune/prune_strategy.py", line 20, in
import prettytable as pt
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/prettytable/init.py", line 48, in
version = importlib_metadata.version(name)
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/importlib_metadata/init.py", line 869, in version
return distribution(distribution_name).version
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/importlib_metadata/init.py", line 513, in version
return self.metadata['Version']
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/importlib_metadata/init.py", line 496, in metadata
self.read_text('METADATA')
File "/home/suixf/miniconda3/envs/nlp/lib/python3.6/site-packages/importlib_metadata/init.py", line 828, in read_text
return self._path.joinpath(filename).read_text(encoding='utf-8')
AttributeError: 'PosixPath' object has no attribute 'read_text'

对把字句的解析似乎不太正确

image
把->书的POB的依赖关系是对的
卖->把的POB指向似乎不太正确, 并不是介词, 被解析为介宾关系应该是不正确的

其他的把字句也存在类似的问题
image

复杂词句处理效果不好

**石油网消息(记者储宝 杨碧泓)在抗击新冠肺炎疫情的关键时刻,3月12日,集团公司党组书记、董事长、集团公司新冠肺炎疫情防控工作领导小组组长戴厚良与中东地区疫情防控领导小组通电话,了解当地疫情防控进展情况,关心慰问奋战在海外抗疫一线员工,嘱咐大家要坚持科学防控,细化落实防控措施,特别要注重加强自身防护,为疫情控制做出实实在在的贡献

1 **石油网 ATT 2 nz
2 消息 HED 0 n
3 ( MT 7 w
4 记者 ATT 5 n
5 储宝 SBV 6 PER
6 ATT 7 w
7 杨碧泓 COO 2 PER
8 ) MT 7 w
9 在 ATT 15 p
10 抗击 ATT 15 v
11 新冠 ATT 13 n
12 肺炎 ATT 13 n
13 疫情 VOB 10 n
14 的 MT 10 u
15 关键时刻 ADV 17 n
16 , MT 15 w
17 3月12日 IC 2 TIME
18 , MT 17 w
19 集团公司 ATT 20 n
20 党组书记 ATT 31 job
21 、 MT 20 w
22 董事长 COO 20 job
23 、 MT 22 w
24 集团公司 COO 20 n
25 新冠 ATT 29 n
26 肺炎 ATT 27 n
27 疫情 ATT 28 n
28 防控 ATT 29 vn
29 工作 ATT 30 n
30 领导小组 ATT 31 n
31 组长 SBV 32 n
32 戴厚良 IC 17 PER
33 与 MT 39 p
34 中东 ATT 35 LOC
35 地区 ATT 38 n
36 疫情 ATT 37 n
37 防控 ATT 38 vn
38 领导小组 ATT 39 n
39 通电话 COO 32 v
40 , MT 39 w
41 了解 COO 32 v
42 当地 ATT 46 s
43 疫情 ATT 44 n
44 防控 ATT 45 vn
45 进展 ATT 46 vn
46 情况 VOB 41 n
47 , MT 41 w
48 关心 SBV 57 v
49 慰问 VOB 48 v
50 奋战 COO 48 v
51 在 ADV 50 p
52 海外 ATT 55 s
53 抗疫 ATT 55 vn
54 一线 ATT 55 n
55 员工 POB 51 n
56 , MT 48 w
57 嘱咐 COO 41 v
58 大家 DBL 57 r
59 要 ADV 60 v
60 坚持 DBL 57 v
61 科学 ATT 62 ad
62 防控 VOB 60 v
63 , MT 60 w
64 细化 IC 71 v
65 落实 ATT 67 v
66 防控 ATT 67 vn
67 措施 VOB 64 n
68 , MT 64 w
69 特别 ADV 71 d
70 要 ADV 71 v
71 注重 IC 57 v
72 加强 VOB 71 v
73 自身 ATT 74 r
74 防护 VOB 72 vn
75 , MT 71 w
76 为 ADV 79 p
77 疫情 ATT 78 n
78 控制 POB 76 vn
79 做出 COO 71 v
80 实实在在 ATT 82 a
81 的 MT 80 u
82 贡献 VOB 79 n

象这种,HED,SBV 抽的不准。

bad case @ 8.30

第一句: 看上去最离谱, HED 是“两队”
屏幕快照 2020-08-30 下午6 40 12

第二句:"甲"、"乙" 既不是并列关系, 也没有共同作为 "两队" 的 定语
屏幕快照 2020-08-30 下午6 35 44

第三句: 结果尚可接受
屏幕快照 2020-08-30 下午6 43 04

第四句: 结果也还可以接受
屏幕快照 2020-08-30 下午6 45 38

感觉较为理想的是下面这种结果:
屏幕快照 2020-08-30 下午6 52 42

bad case

第一句, "数月后" 似乎应修饰 "调查"
屏幕快照 2020-08-19 下午3 50 36

第二句,将"闻名"换为"出名"后, "数月后" 成为 "调查" 的宾语
屏幕快照 2020-08-19 下午3 51 49

第三句, 情况仍与第一句类似
屏幕快照 2020-08-19 下午3 59 55

请问有POStag的解释文档吗?

您好,因为我的project需要POS标记,ddparser也会有标记的结果,只是我没找到这些标记所对应的具体词性,请问有相关文档我可以阅读吗(论文里没有这方面的解释)?非常感谢!
P.S.如果有标注的文献依据就更好了!(比如如何处理兼类,虚词是如何标记的等等)

NOAVX环境下如何使用DDParser

服务器是ESXi集群,因为EVC特性CPU被降级不支持AVX指令,我编译了noavx的paddlepaddle,但是ddparser中直接使用了core_avx模块
from paddle.fluid.core_avx import VarDesc
请问ddparser如何与noavx的paddlepaddle一起运行

ddparser 1.0.5 只支持 ernie-lstm 模型,填入其他出错

例如
ddp = DDParser(encoding_model='transformer')
报错
File "/usr/local/lib/python3.8/site-packages/ddparser/parser/data_struct/utils.py", line 295, in download_model_from_url download_model_path = DOWNLOAD_MODEL_PATH_DICT[model] KeyError: 'transformer'

原因应是 DOWNLOAD_MODEL_PATH_DICT 中只有一个模型
DOWNLOAD_MODEL_PATH_DICT = { 'ernie-lstm': "https://ddparser.bj.bcebos.com/DDParser-ernie-lstm-1.0.3.tar.gz", }
而 download_model_from_url函数中并未检查 model in DOWNLOAD_MODEL_PATH_DICT
download_model_path = DOWNLOAD_MODEL_PATH_DICT[model]

请问文档里提到的其它几个模型还支持吗?
lstm, transformer, ernie-1.0, ernie-tiny

请教一下OOM问题

用自己造的数据跑模型时,由于部分句子较长,容易出现oom问题,所以我在代码中加了句子长度不超过15的限制(11g显存)才能正常训练。默认的batchsize是2048,我改这个数字发现实际使用的显存没有变化,似乎不起作用。想请教一下,如果不想限制句子长度,应该改哪部分参数或代码来解决oom?

完全相同的句式,两种不同结果

屏幕快照 2020-09-06 下午6 17 48
屏幕快照 2020-09-06 下午6 20 03
屏幕快照 2020-09-06 下午6 18 22
屏幕快照 2020-09-06 下午6 20 34

把水果、苹果、梨子、凤梨 放在同一个位置, 得到两种不同结果。
貌似“宾语前置” 更加符合语法习惯,请教该如何自行训练?

AttributeError: 'DDParser' object has no attribute 'lac'

我的环境是python3.7, LAC 2.0.4, ddparser 0.1.1

尝试
from ddparser import DDParser
ddp = DDParser()
ddp.parse("百度是一家高科技公司")

报错如下:

AttributeError Traceback (most recent call last)
in
4 from ddparser import DDParser
5 ddp = DDParser()
----> 6 ddp.parse("百度是一家高科技公司")

~/tfpy3/lib/python3.7/site-packages/ddparser/run.py in parse(self, inputs)
336 'head': [2, 0, 5, 5, 2], 'deprel': ['SBV', 'HED', 'ATT', 'ATT', 'VOB'], 'prob': [1.0, 1.0, 1.0, 1.0, 1.0]}]
337 """
--> 338 if not self.lac:
339 self.lac = LAC.LAC(mode='lac' if self.use_pos else "seg",
340 use_cuda=self.args.use_cuda)

AttributeError: 'DDParser' object has no attribute 'lac'

使用后一直显示 loading the fields.

[2021-04-22 20:55:08,953][root][INFO] - loading the fields.
[2021-04-22 20:55:16,132][root][INFO] - loading the fields.
[2021-04-22 20:55:24,069][root][INFO] - loading the fields.
[2021-04-22 20:55:32,051][root][INFO] - loading the fields.
[2021-04-22 20:55:41,988][root][INFO] - loading the fields.
[2021-04-22 20:55:59,328][root][INFO] - loading the fields.
[2021-04-22 20:56:09,152][root][INFO] - loading the fields.


def _add_Parser_seq(data: List[Dict], cfg) -> None:
for d in data:
"""
使用百度DDparse工具进行依存句法分析
"""
ddp = DDParser()
dictParse = ddp.parse(d['sentence'])
d['dependency'] = dictParse[0]['head']
通过debug发现就是这段代码显示的,想问一下这个loading the fields.代码是在哪?

ddparser不支持paddlepaddle2.0

现在pip安装ddparser时会将环境中的paddlepaddle2.0卸载去安装1.8.5版本,现在paddlepaddle2.0.0已是稳定版,请将ddparser支持paddlepaddle2.0.0

ddparser v1.0.5 segs中有空格时报错

例如
ddp.parse_seg([['百度', ' ', '是', '一家', '高科技', '公司']])

ddp.parse("百度 是一家高科技公司")

报错:
list index out of range

v0.1.2无此问题

关于导入用户词典的功能

您好,感谢您的开源工具。
依赖库LAC中可以添加用户自定义的词典,DDParser中是否可以添加一个参数或者添加对应的方法呢?毕竟再import一下LAC有点冗余了。

同时想请教一下DDParser和哈工大的LTP在依存分析上的优劣势。

bad case

第一句: 核心词是 PER
屏幕快照 2020-08-12 下午7 16 38

第二句:一个SBV 指向 另一个 SBV
屏幕快照 2020-08-12 下午7 29 51

第三句: 看上去也很不对劲
屏幕快照 2020-08-13 下午5 11 00

第四句:手工修改后,似乎正确的依存关系应该如下
屏幕快照 2020-08-12 下午7 29 27

requests.exceptions.SSLError: HTTPSConnectionPool(host='ddparser.bj.bcebos.com', port=443): Max retries exceeded with url: /DDParser-ernie-lstm-1.0.6.tar.gz (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:852)'),))

您好,我使用ddparser工具的时候遇到了下面的错误,请问应该如何解决呢,谢谢

C:\Users\dell\AppData\Local\conda\conda\envs\frw\python.exe C:/Users/dell/Desktop/data(1)/DecompRC-master/baidu_parser.py
ERROR:root:Failed to download model, please try again
ERROR:root:error: HTTPSConnectionPool(host='ddparser.bj.bcebos.com', port=443): Max retries exceeded with url: /DDParser-ernie-lstm-1.0.6.tar.gz (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:852)'),))
Traceback (most recent call last):
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\urllib3\connectionpool.py", line 696, in urlopen
self._prepare_proxy(conn)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\urllib3\connectionpool.py", line 964, in _prepare_proxy
conn.connect()
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\urllib3\connection.py", line 359, in connect
conn = self._connect_tls_proxy(hostname, conn)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\urllib3\connection.py", line 502, in connect_tls_proxy
ssl_context=ssl_context,
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\urllib3\util\ssl
.py", line 432, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(sock, context, tls_in_tls)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\urllib3\util\ssl
.py", line 474, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\ssl.py", line 407, in wrap_socket
_context=self, _session=session)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\ssl.py", line 817, in init
self.do_handshake()
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\ssl.py", line 1077, in do_handshake
self._sslobj.do_handshake()
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\ssl.py", line 689, in do_handshake
self._sslobj.do_handshake()
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:852)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\urllib3\connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\urllib3\util\retry.py", line 573, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='ddparser.bj.bcebos.com', port=443): Max retries exceeded with url: /DDParser-ernie-lstm-1.0.6.tar.gz (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:852)'),))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:/Users/dell/Desktop/data(1)/DecompRC-master/baidu_parser.py", line 2, in
ddp=DDParser()
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\ddparser\run.py", line 304, in init
raise e
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\ddparser\run.py", line 300, in init
utils.download_model_from_url(model_files_path, encoding_model)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\ddparser\parser\data_struct\utils.py", line 305, in download_model_from_url
r = requests.get(download_model_path, stream=True)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "C:\Users\dell\AppData\Local\conda\conda\envs\frw\lib\site-packages\requests\adapters.py", line 514, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='ddparser.bj.bcebos.com', port=443): Max retries exceeded with url: /DDParser-ernie-lstm-1.0.6.tar.gz (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:852)'),))

Process finished with exit code 1

现在ddparser有个bug,没法支持高版本的ernie

在ddparser\ernie_init_.py中有个对版本的检查,源代码如下:
paddle_version = [int(i) for i in paddle.version.split('.')]
if paddle_version[1] < 7:
raise RuntimeError('paddle-ernie requires paddle 1.7+, got %s' %paddle.version)

按照这样的说法,当paddle更新到2.0.0+的时候,会导致paddle_version[1]<7的问题,这其实是不合理的,同时paddle_version = [int(i) for i in paddle.version.split('.')] 这句中有可能因为版本中附带字母导致int(i)编译报错,因此将源代码修改为:
paddle_version = [i for i in paddle.version.split('.')]
if 10 * int(paddle_version[0]) +int(paddle_version[1]) < 17:
raise RuntimeError('paddle-ernie requires paddle 1.7+, got %s' %
paddle.version)

保存之后输入from ddparser import DDParser 显示无误,问题解决。

dependency tree可视化

您好,ddparser需要paddlepaddle的版本低于2.0,但是paddlehub要求paddlepaddle的版本高于2.0,如果我想把处理的结果做成像论文中的树的样子,还有其它办法吗?谢谢!

import ddparser报错

/Users/laiwenbo/anaconda3/envs/testddp/bin/python /Users/laiwenbo/it/docker/info_extract/parse_structure_ddparser.py
Traceback (most recent call last):
File "/Users/laiwenbo/it/docker/info_extract/parse_structure_ddparser.py", line 60, in
structure_info = parse_structure_from_text(text, grain='big')
File "/Users/laiwenbo/it/docker/info_extract/parse_structure_ddparser.py", line 19, in parse_structure_from_text
from ddparser import DDParser
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/ddparser/init.py", line 24, in
from .run import DDParser
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/ddparser/run.py", line 26, in
import LAC
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/LAC/init.py", line 23, in
from .lac import LAC
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/LAC/lac.py", line 28, in
import paddle.fluid as fluid
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/init.py", line 37, in
import paddle.complex
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/complex/init.py", line 15, in
from . import tensor
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/complex/tensor/init.py", line 15, in
from . import math
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/complex/tensor/math.py", line 15, in
from paddle.common_ops_import import *
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/common_ops_import.py", line 15, in
from paddle.fluid.layer_helper import LayerHelper
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/fluid/init.py", line 56, in
from . import contrib
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/fluid/contrib/init.py", line 27, in
from . import slim
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/init.py", line 15, in
from .core import *
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/core/init.py", line 15, in
from . import config
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/core/config.py", line 19, in
from ..prune import *
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/prune/init.py", line 17, in
from . import prune_strategy
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/paddle/fluid/contrib/slim/prune/prune_strategy.py", line 20, in
import prettytable as pt
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/prettytable/init.py", line 48, in
version = importlib_metadata.version(name)
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/importlib_metadata/init.py", line 861, in version
return distribution(distribution_name).version
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/importlib_metadata/init.py", line 523, in version
return self.metadata['Version']
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/importlib_metadata/init.py", line 506, in metadata
self.read_text('METADATA')
File "/Users/laiwenbo/anaconda3/envs/testddp/lib/python3.6/site-packages/importlib_metadata/init.py", line 820, in read_text
return self._path.joinpath(filename).read_text(encoding='utf-8')
AttributeError: 'PosixPath' object has no attribute 'read_text'

Process finished with exit code 1

经核对,版本
ddparser 0.1.2
paddlepaddle 1.8.5
LAC 2.1.1
是符合的,请问是什么问题呢,各种折腾了

debug 不能用

debug只能执行一次,后面直接崩了,不能debug

可以分析小说吗?

可以分析中文小说吗?
例如主角,人物对话提取,情绪分析,场景识别等

自己训练模型时的数据格式

image
如图,我在执行sh run_trash.sh时候报错了,我打印了这个puncts.shape 数值为(1,1,0)
image
请问是哪里的问题呢?
这是我标点符号那一行的数据
image

显示结构表示

请问里面有很多三元组的subject为None是怎么回事:((None, '用于', '描述客观世界中概念'), 'SVO')

fastapi调用服务器api进行dep分析时卡死。

用fastapi搭了个简单的http server,在client传递句子进行句法分析,服务器执行到
r = ddp.parse(sentence)时卡死。请问是不支持该模式下使用吗?有没有解决方案呢?

AttributeError: 'PosixPath' object has no attribute 'read_text'

sudo pip install ddparser
安装成功后出现如下错误信息:

from ddparser import DDParser
Traceback (most recent call last):
File "", line 1, in
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/ddparser/init.py", line 24, in
from .run import DDParser
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/ddparser/run.py", line 26, in
import LAC
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/LAC/init.py", line 23, in
from .lac import LAC
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/LAC/lac.py", line 28, in
import paddle.fluid as fluid
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/paddle/init.py", line 31, in
import paddle.dataset
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/paddle/dataset/init.py", line 25, in
import paddle.dataset.sentiment
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/paddle/dataset/sentiment.py", line 30, in
import nltk
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/nltk/init.py", line 143, in
from nltk.chunk import *
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/nltk/chunk/init.py", line 157, in
from nltk.chunk.api import ChunkParserI
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/nltk/chunk/api.py", line 13, in
from nltk.parse import ParserI
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/nltk/parse/init.py", line 100, in
from nltk.parse.transitionparser import TransitionParser
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/nltk/parse/transitionparser.py", line 22, in
from sklearn.datasets import load_svmlight_file
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/sklearn/datasets/init.py", line 22, in
from .twenty_newsgroups import fetch_20newsgroups
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/sklearn/datasets/twenty_newsgroups.py", line 44, in
from ..feature_extraction.text import CountVectorizer
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/init.py", line 10, in
from . import text
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py", line 28, in
from ..preprocessing import normalize
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/init.py", line 6, in
from ._function_transformer import FunctionTransformer
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/_function_transformer.py", line 5, in
from ..utils.testing import assert_allclose_dense_sparse
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/sklearn/utils/testing.py", line 718, in
import pytest
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/pytest.py", line 6, in
from _pytest.assertion import register_assert_rewrite
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/_pytest/assertion/init.py", line 6, in
from _pytest.assertion import rewrite
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/_pytest/assertion/rewrite.py", line 20, in
from _pytest.assertion import util
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/_pytest/assertion/util.py", line 5, in
import _pytest._code
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/_pytest/_code/init.py", line 2, in
from .code import Code # noqa
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/_pytest/_code/code.py", line 11, in
import pluggy
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/pluggy/init.py", line 16, in
from .manager import PluginManager, PluginValidationError
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/pluggy/manager.py", line 6, in
import importlib_metadata
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/importlib_metadata/init.py", line 547, in
version = version(name)
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/importlib_metadata/init.py", line 509, in version
return distribution(distribution_name).version
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/importlib_metadata/init.py", line 260, in version
return self.metadata['Version']
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/importlib_metadata/init.py", line 248, in metadata
self.read_text('METADATA')
File "/Users/xxxxx/anaconda3/lib/python3.7/site-packages/importlib_metadata/init.py", line 469, in read_text
return self._path.joinpath(filename).read_text(encoding='utf-8')
AttributeError: 'PosixPath' object has no attribute 'read_text'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.