paddlepaddle / paddlenlp Goto Github PK

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

Home Page: https://paddlenlp.readthedocs.io

Python 83.88% Shell 3.82% CMake 0.41% C++ 7.73% Cuda 3.46% Makefile 0.02% C 0.02% Dockerfile 0.01% Batchfile 0.01% Jupyter Notebook 0.46% Java 0.18%

nlp embedding bert ernie paddlenlp pretrained-models transformers information-extraction question-answering search-engine

paddlenlp's Introduction

简体中文🀄 | English🌎

安装 | 快速开始 | 特性 | 社区交流

PaddleNLP是一款简单易用且功能强大的自然语言处理和大语言模型(LLM)开发库。聚合业界优质预训练模型并提供开箱即用的开发体验，覆盖NLP多场景的模型库搭配产业实践范例可满足开发者灵活定制的需求。

News 📢

2024.01.04 PaddleNLP v2.7：大模型体验全面升级，统一工具链大模型入口。统一预训练、精调、压缩、推理以及部署等环节的实现代码，到 PaddleNLP/llm目录。全新大模型工具链文档，一站式指引用户从大模型入门到业务部署上线。全断点存储机制 Unified Checkpoint，大大提高大模型存储的通用性。高效微调升级，支持了高效微调+LoRA同时使用，支持了QLoRA等算法。
2023.08.15 PaddleNLP v2.6：发布全流程大模型工具链，涵盖预训练，精调，压缩，推理以及部署等各个环节，为用户提供端到端的大模型方案和一站式的开发体验；内置4D并行分布式Trainer，高效微调算法LoRA/Prefix Tuning, 自研INT8/INT4量化算法等等；全面支持LLaMA 1/2, BLOOM, ChatGLM 1/2, GLM, OPT等主流大模型

安装

环境依赖

python >= 3.7
paddlepaddle >= 2.6.0
如需大模型功能，请使用 paddlepaddle-gpu >= 2.6.0

pip安装

pip install --upgrade paddlenlp

或者可通过以下命令安装最新 develop 分支代码：

pip install --pre --upgrade paddlenlp -f https://www.paddlepaddle.org.cn/whl/paddlenlp.html

更多关于PaddlePaddle和PaddleNLP安装的详细教程请查看Installation。

快速开始

大模型文本生成

PaddleNLP提供了方便易用的Auto API，能够快速的加载模型和Tokenizer。这里以使用 linly-ai/chinese-llama-2-7b 大模型做文本生成为例：

>>> from paddlenlp.transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("linly-ai/chinese-llama-2-7b")
>>> model = AutoModelForCausalLM.from_pretrained("linly-ai/chinese-llama-2-7b", dtype="float16")
>>> input_features = tokenizer("你好！请自我介绍一下。", return_tensors="pd")
>>> outputs = model.generate(**input_features, max_length=128)
>>> tokenizer.batch_decode(outputs[0])
['\n你好！我是一个AI语言模型，可以回答你的问题和提供帮助。']

一键UIE预测

PaddleNLP提供一键预测功能，无需训练，直接输入数据即可开放域抽取结果。这里以信息抽取-命名实体识别任务，UIE模型为例：

>>> from pprint import pprint
>>> from paddlenlp import Taskflow

>>> schema = ['时间', '选手', '赛事名称'] # Define the schema for entity extraction
>>> ie = Taskflow('information_extraction', schema=schema)
>>> pprint(ie("2月8日上午北京冬奥会自由式滑雪女子大跳台决赛中**选手谷爱凌以188.25分获得金牌！"))
[{'时间': [{'end': 6,
          'probability': 0.9857378532924486,
          'start': 0,
          'text': '2月8日上午'}],
  '赛事名称': [{'end': 23,
            'probability': 0.8503089953268272,
            'start': 6,
            'text': '北京冬奥会自由式滑雪女子大跳台决赛'}],
  '选手': [{'end': 31,
          'probability': 0.8981548639781138,
          'start': 28,
          'text': '谷爱凌'}]}]

更多PaddleNLP内容可参考：

大模型全流程工具链，包含主流中文大模型的全流程方案。
精选模型库，包含优质预训练模型的端到端全流程使用。
多场景示例，了解如何使用PaddleNLP解决NLP多种技术问题，包含基础技术、系统应用与拓展应用。
交互式教程，在🆓免费算力平台AI Studio上快速学习PaddleNLP。

特性

📦 开箱即用的NLP工具集

🤗 丰富完备的中文模型库

🎛️ 产业级端到端系统范例

🚀 高性能分布式训练与推理

开箱即用的NLP工具集

Taskflow提供丰富的📦开箱即用的产业级NLP预置模型，覆盖自然语言理解与生成两大场景，提供💪产业级的效果与⚡️极致的推理性能。

更多使用方法可参考Taskflow文档。

丰富完备的中文模型库

🀄 业界最全的中文预训练模型

精选 45+ 个网络结构和 500+ 个预训练模型参数，涵盖业界最全的中文预训练模型：既包括文心NLP大模型的ERNIE、PLATO等，也覆盖BERT、GPT、RoBERTa、T5等主流结构。通过AutoModel API一键⚡高速下载⚡。

from paddlenlp.transformers import *

ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh')
bert = AutoModel.from_pretrained('bert-wwm-chinese')
albert = AutoModel.from_pretrained('albert-chinese-tiny')
roberta = AutoModel.from_pretrained('roberta-wwm-ext')
electra = AutoModel.from_pretrained('chinese-electra-small')
gpt = AutoModelForPretraining.from_pretrained('gpt-cpm-large-cn')

针对预训练模型计算瓶颈，可以使用API一键使用文心ERNIE-Tiny全系列轻量化模型，降低预训练模型部署难度。

# 6L768H
ernie = AutoModel.from_pretrained('ernie-3.0-medium-zh')
# 6L384H
ernie = AutoModel.from_pretrained('ernie-3.0-mini-zh')
# 4L384H
ernie = AutoModel.from_pretrained('ernie-3.0-micro-zh')
# 4L312H
ernie = AutoModel.from_pretrained('ernie-3.0-nano-zh')

对预训练模型应用范式如语义表示、文本分类、句对匹配、序列标注、问答等，提供统一的API体验。

import paddle
from paddlenlp.transformers import *

tokenizer = AutoTokenizer.from_pretrained('ernie-3.0-medium-zh')
text = tokenizer('自然语言处理')

# 语义表示
model = AutoModel.from_pretrained('ernie-3.0-medium-zh')
sequence_output, pooled_output = model(input_ids=paddle.to_tensor([text['input_ids']]))
# 文本分类 & 句对匹配
model = AutoModelForSequenceClassification.from_pretrained('ernie-3.0-medium-zh')
# 序列标注
model = AutoModelForTokenClassification.from_pretrained('ernie-3.0-medium-zh')
# 问答
model = AutoModelForQuestionAnswering.from_pretrained('ernie-3.0-medium-zh')

💯 全场景覆盖的应用示例

覆盖从学术到产业的NLP应用示例，涵盖NLP基础技术、NLP系统应用以及拓展应用。全面基于飞桨核心框架2.0全新API体系开发，为开发者提供飞桨文本领域的最佳实践。

精选预训练模型示例可参考Model Zoo，更多场景示例文档可参考examples目录。更有免费算力支持的AI Studio平台的Notbook交互式教程提供实践。

PaddleNLP预训练模型适用任务汇总（点击展开详情）

Model	Sequence Classification	Token Classification	Question Answering	Text Generation	Multiple Choice
ALBERT	✅	✅	✅	❌	✅
BART	✅	✅	✅	✅	❌
BERT	✅	✅	✅	❌	✅
BigBird	✅	✅	✅	❌	✅
BlenderBot	❌	❌	❌	✅	❌
ChineseBERT	✅	✅	✅	❌	❌
ConvBERT	✅	✅	✅	❌	✅
CTRL	✅	❌	❌	❌	❌
DistilBERT	✅	✅	✅	❌	❌
ELECTRA	✅	✅	✅	❌	✅
ERNIE	✅	✅	✅	❌	✅
ERNIE-CTM	❌	✅	❌	❌	❌
ERNIE-Doc	✅	✅	✅	❌	❌
ERNIE-GEN	❌	❌	❌	✅	❌
ERNIE-Gram	✅	✅	✅	❌	❌
ERNIE-M	✅	✅	✅	❌	❌
FNet	✅	✅	✅	❌	✅
Funnel-Transformer	✅	✅	✅	❌	❌
GPT	✅	✅	❌	✅	❌
LayoutLM	✅	✅	❌	❌	❌
LayoutLMv2	❌	✅	❌	❌	❌
LayoutXLM	❌	✅	❌	❌	❌
LUKE	❌	✅	✅	❌	❌
mBART	✅	❌	✅	❌	✅
MegatronBERT	✅	✅	✅	❌	✅
MobileBERT	✅	❌	✅	❌	❌
MPNet	✅	✅	✅	❌	✅
NEZHA	✅	✅	✅	❌	✅
PP-MiniLM	✅	❌	❌	❌	❌
ProphetNet	❌	❌	❌	✅	❌
Reformer	✅	❌	✅	❌	❌
RemBERT	✅	✅	✅	❌	✅
RoBERTa	✅	✅	✅	❌	✅
RoFormer	✅	✅	✅	❌	❌
SKEP	✅	✅	❌	❌	❌
SqueezeBERT	✅	✅	✅	❌	❌
T5	❌	❌	❌	✅	❌
TinyBERT	✅	❌	❌	❌	❌
UnifiedTransformer	❌	❌	❌	✅	❌
XLNet	✅	✅	✅	❌	✅

可参考Transformer 文档查看目前支持的预训练模型结构、参数和详细用法。

产业级端到端系统范例

PaddleNLP针对信息抽取、语义检索、智能问答、情感分析等高频NLP场景，提供了端到端系统范例，打通数据标注-模型训练-模型调优-预测部署全流程，持续降低NLP技术产业落地门槛。更多详细的系统级产业范例使用说明请参考Applications。

🔍 语义检索系统

针对无监督数据、有监督数据等多种数据情况，结合SimCSE、In-batch Negatives、ERNIE-Gram单塔模型等，推出前沿的语义检索方案，包含召回、排序环节，打通训练、调优、高效向量检索引擎建库和查询全流程。

更多使用说明请参考语义检索系统。

❓ 智能问答系统

基于🚀RocketQA技术的检索式问答系统，支持FAQ问答、说明书问答等多种业务场景。

更多使用说明请参考智能问答系统与文档智能问答

💌 评论观点抽取与情感分析

基于情感知识增强预训练模型SKEP，针对产品评论进行评价维度和观点抽取，以及细粒度的情感分析。

更多使用说明请参考情感分析。

🎙️ 智能语音指令解析

集成了PaddleSpeech和百度开放平台的语音识别和UIE通用信息抽取等技术，打造智能一体化的语音指令解析系统范例，该方案可应用于智能语音填单、智能语音交互、智能语音检索等场景，提高人机交互效率。

更多使用说明请参考智能语音指令解析。

高性能分布式训练与推理

⚡ FastTokenizer：高性能文本处理库

AutoTokenizer.from_pretrained("ernie-3.0-medium-zh", use_fast=True)

为了实现更极致的模型部署性能，安装FastTokenizer后只需在AutoTokenizer API上打开 use_fast=True选项，即可调用C++实现的高性能分词算子，轻松获得超Python百余倍的文本处理加速，更多使用说明可参考FastTokenizer文档。

⚡️ FastGeneration：高性能生成加速库

model = GPTLMHeadModel.from_pretrained('gpt-cpm-large-cn')
...
outputs, _ = model.generate(
    input_ids=inputs_ids, max_length=10, decode_strategy='greedy_search',
    use_fast=True)

简单地在generate()API上打开use_fast=True选项，轻松在Transformer、GPT、BART、PLATO、UniLM等生成式预训练模型上获得5倍以上GPU加速，更多使用说明可参考FastGeneration文档。

🚀 Fleet：飞桨4D混合并行分布式训练技术

更多关于千亿级AI模型的分布式训练使用说明可参考GPT-3。

社区交流

微信扫描二维码并填写问卷，回复小助手关键词（NLP）之后，即可加入交流群领取福利
- 与众多社区开发者以及官方团队深度交流。
- 10G重磅NLP学习大礼包！

Citation

如果PaddleNLP对您的研究有帮助，欢迎引用

@misc{=paddlenlp,
    title={PaddleNLP: An Easy-to-use and High Performance NLP Library},
    author={PaddleNLP Contributors},
    howpublished = {\url{https://github.com/PaddlePaddle/PaddleNLP}},
    year={2021}
}

Acknowledge

我们借鉴了Hugging Face的Transformers🤗关于预训练模型使用的优秀设计，在此对Hugging Face作者及其开源社区表示感谢。

License

PaddleNLP遵循Apache-2.0开源协议。

paddlenlp's People

Contributors

Stargazers

Watchers

Forkers

frostml steffy-zxf joey12300 manaxoxo c23333 jianling233 linpengchao yunziche haiyunhuang lnulx filecrasher smallhedgehogg gswgit wangbuliuhang guoguo49 moqianain zshxn1985 hjsybyq cafuc-deng sxaahong hlustb sonicfirr wjhasd2006 expensiveolivine kinghuin liuchiachi smallv0221 xiemoyuan wawltor dfc2018 xianrenyty funfan0608 zhui blackwaltz0-1 yuanwc001 qiwj-hub seaflyren cheeryoung79 huangxu96 hxy-code guoshengcs mc261670164 wjunneng jeff41404 gain-wyj zhupengyang weiwei1115 wadefelix itongxiaojun theshan123 junzx gina329 ceci3 liyunhao99 dot23 qingshuchen cultivater deep-rooteddz midgod szbobby wanghuogen xwan0196 sabrinawsy littlepan0413 northfun baodijun kevinxu816 yanqi1811 zzg-971030 oldpiao tianxin1860 wangyongdi slenderzhang 123world zeng-wh pink-duck-chao lhmzll huge-stone blessjack messmemory liuslnlp yuxiangzhang0114 leejodie yyfulin monika19950721 lwb69 tianwei-upup yuweifamily hanegawa xiaomin418 1120161807 dongdinglin geteel-lin wooden070 ytluobing echo-lulu heimao63531 ruida-jin yaxinfan1 gongel

paddlenlp's Issues

simnet mmdnn报错

paddle版本1.8.0
simnet 采用模型config/mmdnn_pointwise.json
使用pointwise模型，数据格式为：text_a text_b label形式

模型报错：

examples中的electra支持中文的预训练吗？

仿造bert的预训练，生成了一个中文的训练文本，每行一个句子，字符以空格分隔，启动命令
python -u run_pretrain.py
--model_type electra
--model_name_or_path chinese-electra-base
--input_dir /home/aistudio/work
--output_dir /home/aistudio/work/pretrained_models/
--train_batch_size 64
--learning_rate 5e-4
--max_seq_length 320
--weight_decay 1e-2
--adam_epsilon 1e-6
--warmup_steps 10000
--num_train_epochs 50
--logging_steps 20
--save_steps 20000
--max_steps 1000000
结果报异常了
Traceback (most recent call last):
File "run_pretrain.py", line 661, in
do_train(args)
File "run_pretrain.py", line 429, in do_train
args.model_name_or_path + "-generator"]))
KeyError: 'chinese-electra-base-generator'
看样子目前只支持electra-small、electra-base、electra-large？中文的不行吗？

有没有ernieSAGE节点分类的例子？

How to download wikicorpus_en dataset in HDF5 format?

I noticed that the Bert benchmark requires the wikicorpus_en dataset in HDF5 format. But I only have raw text data. How to download this dataset or convert it from raw text?

球各位大佬帮忙解答一下，关于MRC的问题。使用自己的数据集，预测结果出现单个字的情况？

球各位大佬帮忙解答一下~
在使用paddlepaddle的MRC代码的时候，出现以下迷惑情况。
之前使用自带数据集，预测结果是正常的；使用自己的数据集（2151个），也是正常的，预测结果还不错的；
正常结果如下：
"1612": "张国荣",
"1206": "张火丁",
"2025": "罗密欧",
"372": "刘霓娜",
"2885": "中井贵惠",
"2768": "鲁桓公",
"1969": "也遂皇后",
"4026": "钱韵玲",
"1451": "蒋友青",

后来使用自己的另外一个数据集（17199个），出现了一下情况：
"0": "唐",
"1": "作",
"2": "丁",
"3": "诺",
"4": "令",
"5": "沙",
"6": "穆",
"7": "周",
"8": "唐",
"9": "弟",

但其实，预测的是不对的，应该是正常的人名，且训练过程中，epoch = 1，batchsize = 12，在第509个batch之后，loss出现nan的情况；
目前尝试过的方法：
1、调小学习率，无用；
2、使用小数据集，不会出现loss为nan的情况。

球球各位大佬帮忙解答一下~

paddlenlp examples/language_model/bert/run_pretrain.py加载中途保存模型异常

由于一些原因，中途训练断开了，最后一个保存的路径在model_240000中，有如下一些文件
model_config.json
model_state.pdparams
tokenizer_config.json
vocab.txt
model_state.pdopt

然后用run_pretrain.py加载继续训练，参数--model_name_or_path 无论是加载路径还是pdparams都不行

加载路径--model_name_or_path /home/aistudio/work/pretrained_models/model_240000的异常如下
Traceback (most recent call last):
File "/home/aistudio/PaddleNLP-release-2.0-rc/examples/language_model/bert/run_pretrain.py", line 434, in
do_train(args)
File "/home/aistudio/PaddleNLP-release-2.0-rc/examples/language_model/bert/run_pretrain.py", line 295, in do_train
args.model_name_or_path]))
KeyError: '/home/aistudio/work/pretrained_models/model_240000/'
INFO 2021-03-22 11:46:07,231 launch_utils.py:307] terminate all the procs
ERROR 2021-03-22 11:46:07,231 launch_utils.py:545] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2021-03-22 11:46:10,234 launch_utils.py:307] terminate all the procs

加载模型--model_name_or_path /home/aistudio/work/pretrained_models/model_240000/model_state.pdparams 异常如下
Traceback (most recent call last):
File "/home/aistudio/PaddleNLP-release-2.0-rc/examples/language_model/bert/run_pretrain.py", line 434, in
do_train(args)
File "/home/aistudio/PaddleNLP-release-2.0-rc/examples/language_model/bert/run_pretrain.py", line 291, in do_train
tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/tokenizer_utils.py", line 337, in from_pretrained
cls.name, cls.pretrained_init_configuration.keys()))
ValueError: Calling BertTokenizer.from_pretrained() with a model identifier or the path to a directory instead. The supported model identifiers are as follows: dict_keys(['bert-base-uncased', 'bert-large-uncased', 'bert-base-cased', 'bert-large-cased', 'bert-base-multilingual-uncased', 'bert-base-multilingual-cased', 'bert-base-chinese', 'bert-wwm-chinese', 'bert-wwm-ext-chinese'])
INFO 2021-03-22 11:46:53,729 launch_utils.py:307] terminate all the procs
ERROR 2021-03-22 11:46:53,729 launch_utils.py:545] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2021-03-22 11:46:56,732 launch_utils.py:307] terminate all the procs

这个似乎只能用预设好的model_name？如bert-large-uncased一类？

请问这个框架是哪个团队在维护呀？

是paddle团队还是nlp团队？

paddleNLP模型运行次数多以后，打印的日志相互矛盾

2020-11-01 05:27:32,566-WARNING: ./Good_models/step_5600/.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ]
2020-11-01 05:27:32,567-WARNING: variable file [ ./Good_models/step_5600/checkpoint.pdopt ./Good_models/step_5600/checkpoint.pdparams ./Good_models/step_5600/checkpoint.pdmodel ] not used
2020-11-01 05:27:32,567-WARNING: variable file [ ./Good_models/step_5600/checkpoint.pdopt ./Good_models/step_5600/checkpoint.pdparams ./Good_models/step_5600/checkpoint.pdmodel ] not used

开发者们好：
请问程序在前半段执行的没有问题，后半段程序运行的时候同时出现『参数没有被找到』、『参数没有被使用』两种相互矛盾的日志是什么原因导致的呢？

paddlenlp有没有处理加密文本的一套方案？

原始文本已经转为了数字，要从头训练模型，有没有处理这类问题的相关教程？

paddlenlp实体识别模型服务器部署问题

paddlenlp训练的模型只能通过save_pretrained保存，通过from_pretrained来加载吗？
这样的话工业化部署，只有cpu的情况下很慢，请问如何工业化如果部署，性能怎么提升？

球各位大佬帮忙解答一下，关于MRC的问题。使用自己的数据集，预测结果出现单个字的情况？

球球各位大佬帮忙解答一下~

paddlenlp==2.0.0rc 木有封装 LinearDecayWithWarmup

既然是中文Readme，我就直接写中文了。如题。

Upgrade machine translation model seq2seq to GNMT model

Currently our Seq2Seq model is not a clear name and classical structure in machine translation domain.
Upgrade to GNMT will be more authoritative.

simnet的预训练模型支持迁移学习吗

有一些专业词汇不在词典里，导致某些短语匹配的效果差点，用已有数据全新训练的效果不好，怎么用预训练的模型训练呢

the format of the predict text

thanks for all of this, could the predict text be showed without lables

文本分类Demo因读入Dataset训练时未对batch做padding引起Blocking queue is killed because the data reader raises an exception

最小复现代码：
import paddle
import paddlehub as hub
model = hub.Module(name='roberta-wwm-ext-large', task='seq-cls', num_classes=2)
train_dataset = hub.datasets.ChnSentiCorp(
tokenizer=model.get_tokenizer(), max_seq_len=128, mode='train')
dev_dataset = hub.datasets.ChnSentiCorp(
tokenizer=model.get_tokenizer(), max_seq_len=128, mode='dev')
optimizer = paddle.optimizer.Adam(learning_rate=5e-5, parameters=model.parameters())
trainer = hub.Trainer(model, optimizer, checkpoint_dir='test_ernie_text_cls')
trainer.train(train_dataset, epochs=3, batch_size=32, eval_dataset=dev_dataset)

原因排查：
观察发现报错为：
File "/home/daniel/anaconda3/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 90, in default_collate_fn
tmp = np.stack(slot, axis=0)
排查一个batch内的数据，发现每个样本的shape不相同，所以引起np.stack报错

如何实现自定义数据集+finetune完成情感分析呢？

使用自定义数据集，构建data_loader.

自定义数据集读取报错：(Fatal) Blocking queue is killed because the data reader raises an exception.

自定义数据

def read(data_path):
   with open(data_path,'r',encoding='utf-8') as f:
       dataset = []
       for line in f.readlines():
           words,labels = line.strip('\n').split('\t')
           dataset.append([words,labels])
       return dataset
data_path = 'C:/Users/RPA/Desktop/nlp/paddlenlp/my_data/'
train_ds = MapDataset(read(data_path+'train.txt'))
dev_ds = MapDataset(read(data_path+'dev.txt'))
test_ds = MapDataset((data_path+'test.txt'))
label_list = ['0', '1']

加载模型：

MODEL_NAME = "chinese-electra-small"
electra_model = ppnlp.transformers.ElectraModel.from_pretrained(MODEL_NAME)  # 加载中文预训练模型
model = ppnlp.transformers.ElectraForSequenceClassification.from_pretrained(MODEL_NAME, num_classes=2)
tokenizer = ppnlp.transformers.ElectraTokenizer.from_pretrained(MODEL_NAME)

创建data_loader

def create_dataloader(dataset,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None,
                      trans_fn=None):
    if trans_fn:
        dataset = dataset.map(trans_fn)
        dataset = dataset.map(trans_fn)
        # dataset = dataset.apply(trans_fn, lazy=True)
    shuffle = True if mode == 'train' else False
    if mode == 'train':
        batch_sampler = paddle.io.DistributedBatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)
    else:
        batch_sampler = paddle.io.BatchSampler(
            dataset, batch_size=batch_size, shuffle=shuffle)

    return paddle.io.DataLoader(
        dataset=dataset,
        batch_sampler=batch_sampler,
        collate_fn=batchify_fn,
        return_list=True)

数据读入

batch_size = 32
  trans_func = partial(
      convert_example,
      tokenizer=tokenizer,
      label_list=label_list,
      max_seq_length=128)
  batchify_fn = lambda samples, fn=Tuple(
      Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input ids
      Pad(axis=0, pad_val=tokenizer.pad_token_id),  # segment ids
      Stack(dtype="int64")  # label
  ): [data for data in fn(samples)]
  
  train_data_loader = create_dataloader(
      train_ds,
      mode='train',
      batch_size=batch_size,
      batchify_fn=batchify_fn,
      trans_fn=trans_func)
  dev_data_loader = create_dataloader(
      dev_ds,
      mode='dev',
      batch_size=batch_size,
      batchify_fn=batchify_fn,
      trans_fn=trans_func)
  test_data_loader = create_dataloader(
      test_ds,
      mode='test',
      batch_size=batch_size,
      batchify_fn=batchify_fn,
      trans_fn=trans_func)

设置Fine-Tune优化策略，接入评价指标：

learning_rate = 5e-5
epochs = 3
warmup_proption = 0.1
weight_decay = 0.01
num_training_steps = len(train_data_loader) * epochs
num_warmup_steps = int(warmup_proption * num_training_steps)
def get_lr_factor(current_step):
    if current_step < num_warmup_steps:
        return float(current_step) / float(max(1, num_warmup_steps))
    else:
        return max(0.0,
                   float(num_training_steps - current_step) /
                   float(max(1, num_training_steps - num_warmup_steps)))

lr_scheduler = paddle.optimizer.lr.LambdaDecay(
    learning_rate,
    lr_lambda=lambda current_step: get_lr_factor(current_step))
optimizer = paddle.optimizer.AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in [
        p.name for n, p in model.named_parameters()
        if not any(nd in n for nd in ["bias", "norm"])
    ])

criterion = paddle.nn.loss.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

global_step = 0
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, segment_ids, labels = batch
        logits = model(input_ids, segment_ids)
        loss = criterion(logits, labels)
        probs = F.softmax(logits, axis=1)
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (
            global_step, epoch, step, loss, acc))
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()
    evaluate(model, criterion, metric, dev_data_loader)

model.save_pretrained('./checkpoint')
tokenizer.save_pretrained('./checkpoint')`

数据情况：

for i in range(0,10):
     print(train_ds[i])

['Swapped session [{0}] is invalid', '0']
['DirectoryNotEmptyException', '0']
['# Kitchen master teaches cooking # Cake shop buys cakes for his girlfriend? NO NO NO~ is weak [Squeezing] Today, our kitchen master teaches you to make cakes by yourself, without using an oven,~ It is sincere enough to say [hee hee] [hee hee] [hee hee ]http://t.cn/zHecR7E.', '1']
['signed jar file', '1']
['Unknown Entry Type', '0']
['"一个IPv6地址也许不是以单个冒号"":""开头的。"', '0']
['{0}需要属性: {1}', '0']
['分析注册表数据时出错', '0']
['Unable to retrieve method [{0}] for resource [{1}] in container [{2}] so no cleanup was performed for that resource', '0']
['读取文件[{0}]时出错', '0']

怎么使用LAC扩展实体识别标签

请问对于特定领域的命名实体识别任务，怎么训练模型使其针对人名地名机构名时间之外的类别标签呢？尝试修改了tag.dic和训练数据进行非增量的训练，但是模型训练的时候P/R/F1一直是0。

emotion_detection 运行ernie evaluate报错

models/PaddleNLP/emotion_detection/run_ernie.sh train完成后，运行eval报错，看报错信息，应该是在fluid.load(main_program, init_checkpoint_path, exe) 的时候出错了，error信息：

Error: When calling this method, the Tensor's numel must be equal or larger than zero. Please check Tensor::dims, or Tensor::Resize has been called first. The Tensor's shape is [-1, 768] now
[Hint: Expected numel() >= 0, but received numel():-768 < 0:0.] at (/paddle/paddle/fluid/framework/tensor.cc:45)

怎么把自己的数据集换成paddleNLP内置的数据集格式

~
原项目在这https://aistudio.baidu.com/aistudio/projectdetail/1294333?channelType=0&channel=0

问题：我自己的数据集应该是什么格式的？

例子中的数据集是PaddleNLP内置该数据集：

train_ds, dev_ds, test_ds = ppnlp.datasets.ChnSentiCorp.get_datasets(['train', 'dev', 'test'])

train_ds是'ChnSentiCorp' object
长下面这样

train_ds需要apply

train_ds.apply(trans_func, lazy=True)

其中trans_func需要train_ds是list
Apply貌似只能用dataframe，因为我试过list、set都报错没有apply属性

自定义数据集是（）包起来的，不能apply，内置数据集是【】包起来的。

变量名错误

发现一个小错误
PaddleNLP/examples/text_matching/sentence_transformers/train.py /
上面的line40 parser.add_argument里面是warmup_proption，下面line237是args.warmup_proportion。

PaddleNLP训练模型如何转推理模型

使用ernie-tiny进行训练，得到训练模型如下：
checkpoints/
├── model_100
│ ├── model_config.json
│ ├── model_state.pdparams
│ ├── tokenizer_config.json
│ └── vocab.txt
└── ...
怎么才能把训练模型转成推理模型呢？
最终目的是想把训练过的NLP模型部署到Paddle Serving上，该怎么操作呢？

paddleNLP lexical_analysis的使用问题

我完全按照github的文档做的，什么都没改动，但在运行 sh run.sh train_single_gpu 时，报错 run.sh: 7: run.sh: Syntax error: "(" unexpected ，我在aistudio上面运行的

express_ner中，在数据预测部分出现SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception

单独做预测代码，报错SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
更改的代码，在数据加载部分，更改 ExpressDataset类，变为无标签预测
变更的代码：
class ExpressDatasettest(paddle.io.Dataset):
def init(self, data_path):
self.word_vocab = load_dict('./conf/word.dic')
self.label_vocab = load_dict('./conf/tag.dic')
self.word_ids = []
self.label_ids = []
with open(data_path, 'r', encoding='utf-8') as fp:
next(fp)
for line in fp.readlines():
# words, labels = line.strip('\n').split('\t')
# words = words.split('\002')
words = line.strip("\n")
words = list(words)
sub_word_ids = convert_tokens_to_ids(words, self.word_vocab,
'OOV')
self.word_ids.append(sub_word_ids)

    self.word_num = max(self.word_vocab.values()) + 1
    self.label_num = max(self.label_vocab.values()) + 1

def __len__(self):
    return len(self.word_ids)

def __getitem__(self, index):
    return self.word_ids[index], len(self.word_ids[index])
    # return self.word_ids[index], len(self.word_ids[index]), self.label_ids[
    #     index]

在预测部分出现的具体错误：

WARNING:root:DataLoader reader thread raised an exception.
Traceback (most recent call last):
File "predict.py", line 183, in
outputs, lens, decodes = model.predict(data_loader)
File "/home/yanwei/anaconda3/envs/paddlenlp36/lib/python3.6/site-packages/paddle/hapi/model.py", line 1703, in predict
logs, outputs = self._run_one_epoch(test_loader, cbks, 'predict')
File "/home/yanwei/anaconda3/envs/paddlenlp36/lib/python3.6/site-packages/paddle/hapi/model.py", line 1779, in run_one_epoch
for step, data in enumerate(data_loader):
File "/home/yanwei/anaconda3/envs/paddlenlp36/lib/python3.6/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 365, in next
return self.reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
[Hint: Expected killed != true, but received killed:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:158)

这种情况是什么原因呀，麻烦帮忙解答一下，谢谢

词性标注多卡训练报错

paddle gpu 2.0.0
paddlenlp 2.0.0rc5
cuda 10.1
cudnn: 7.6.5
nccl 2.7.3

按照文档教程进行多卡训练报错
$ python train.py --data_dir ./lexical_analysis_dataset_tiny --model_save_dir ./save_dir --epochs 10 --batch_size 32 --n_gpu 2

W0303 17:54:08.340672 24268 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.0, Runtime API Version: 10.1
W0303 17:54:08.344925 24268 device_context.cc:372] device: 0, cuDNN Version: 7.6.
W0303 17:54:08.365159 24269 device_context.cc:362] Please NOTE: device: 1, GPU Compute Capability: 7.5, Driver API Version: 11.0, Runtime API Version: 10.1
W0303 17:54:08.369661 24269 device_context.cc:372] device: 1, cuDNN Version: 7.6.
I0303 17:54:11.289988 24268 nccl_context.cc:189] init nccl context nranks: 2 local rank: 0 gpu id: 0 ring id: 0
I0303 17:54:11.290007 24269 nccl_context.cc:189] init nccl context nranks: 2 local rank: 1 gpu id: 1 ring id: 0
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/10
/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is depr
ecated since Python 3.3,and in 3.9 it will stop working
return (isinstance(seq, collections.Sequence) and
/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is depr
ecated since Python 3.3,and in 3.9 it will stop working
return (isinstance(seq, collections.Sequence) and
Traceback (most recent call last):
File "train.py", line 116, in
paddle.distributed.spawn(train, args=(args, ), nprocs=args.n_gpu)
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 449, in spawn
while not context.join():
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 255, in join
self._throw_exception(error_index)
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 273, in _throw_exception
raise Exception(msg)
Exception:

Process 0 terminated with the following error:

Traceback (most recent call last):
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 204, in _func_wrapper
result = func(*args)
File "/public/data_sharing/gsliu/codes/PaddleNLP/examples/lexical_analysis/train.py", line 110, in train
callbacks=callbacks)
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/hapi/model.py", line 1495, in fit
logs = self.run_one_epoch(train_loader, cbks, 'train')
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/hapi/model.py", line 1802, in run_one_epoch
data[len(self.inputs):])
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/hapi/model.py", line 941, in train_batch
loss = self.adapter.train_batch(inputs, labels)
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/hapi/model.py", line 661, in train_batch
final_loss.backward()
File "", line 2, in backward
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in impl
return wrapped_func(*args, **kwargs)
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/fluid/framework.py", line 225, in impl
return func(*args, **kwargs)
File "/public/home/gsliu/anaconda3/envs/paddle2/lib/python3.7/site-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 175, in backward
retain_graph)
RuntimeError: (PreconditionNotMet) Rebuild vars's number should be equal to original vars'number, expect it to be 20, but got 21.
[Hint: Expected rebuild_vars.size() == vars.size(), but received rebuild_vars.size():21 != vars.size():20.] (at /paddle/paddle/fluid/imperative/reducer.cc:512)

bert模型可以用cpu训练吗？

ernie_tiny部署paddle serving后，curl返回 request that this server could not understand

paddler serving部署文本匹配的ernie_tiny后，curl -H "Content-Type:application/json" -X POST -d '{"Data":[['世界上什么东西最小', '世界上什么东西最小？']] , "fetch":["prediction"]}' http://172.0.0.1:9292/ernie/prediction，返回
400 Bad Request
Bad Request
The browser (or proxy) sent a request that this server could not understand.

请问是传入参数格式不对吗？还是操作步骤不正确呢？谢谢！
完整的操作步骤如下：

获取docker 镜像 registry.baidubce.com/paddlepaddle/serving:0.5.0-cuda10.2-cudnn8-devel
容器内依次安装 paddle-serving-server-gpu==0.5.0.post102，paddlepaddle-gpu==2.0.0 和paddle_serving_client
将ernin_tiny的训练模型转为paddle serving模型，ernie_tiny_model包含如下文件
total 350716
drwxrwxr-x 2 media media 4096 Mar 24 09:43 ./
drwxrwxr-x 4 media media 4096 Mar 24 09:44 ../
-rw-rw-r-- 1 media media 656906 Mar 24 09:43 model
-rw-rw-r-- 1 media media 358454636 Mar 24 09:43 params
-rw-rw-r-- 1 media media 356 Mar 24 09:43 serving_server_conf.prototxt
-rw-rw-r-- 1 media media 147 Mar 24 09:43 serving_server_conf.stream.prototxt
将步骤#3中生成的ernie_tiny_model和ernie_tiny_client拷贝到docker/home/workspace/Serving/python/examples/ernie_tiny 中
使用uwsgi启动服务
from paddle_serving_server.web_service import WebService
uci_service = WebService(name="ernie")
uci_service.load_model_config("./ernie_tiny_model")
uci_service.prepare_server(workdir="./workdir", port=int(9500), device="cpu")
uci_service.run_rpc_service()
app_instance = uci_service.get_app_instance()
启动服务 uwsgi --http :9292 --module uwsgi_service:app_instance

使用paddlenlp训练模型后使用paddle.jit.save()保存模型，使用paddle.jit.load()加载后无法使用

在使用PaddleNLP语义预训练模型ERNIE完成快递单信息抽取的项目中，使用自定义数据集训练模型并保存，加载后无法使用：
环境：ai studio，paddlenlp2.0b
代码如下：
`
from utils import evaluate
import paddle
global_step = 0
for epoch in range(1, epochs+1):
for step, (input_ids, segment_ids, seq_lens, labels) in enumerate(train_loader, start=1):
logits = model(input_ids, segment_ids)
preds = paddle.argmax(logits, axis=-1)
n_infer, n_label, n_correct = metric.compute(None, seq_lens, preds, labels)
metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy())
precision, recall, f1_score = metric.accumulate()
loss = paddle.mean(criterion(logits.reshape([-1, train_ds.num_label]), labels.reshape([-1])))

    global_step += 1
    if global_step % 10 == 0 :
        print("global step %d, epoch: %d, batch: %d, loss: %.5f, precision: %.5f, recall: %.5f, f1: %.5f" % (global_step, epoch, step, loss, precision, recall, f1_score))
    loss.backward()
    optimizer.step()
    lr_scheduler.step()
    optimizer.clear_grad()
evaluate(model, metric, dev_loader)

model.save_pretrained('./checkpoint')
tokenizer.save_pretrained('./checkpoint')
from paddle.static import InputSpec

save

path = "example.dy_model/linear"
paddle.jit.save(
layer=model,
path=path,
input_spec=[InputSpec(shape=[None, 768], dtype='int64')])

load

path = "example.dy_model/linear"
loaded_layer = paddle.jit.load(path)

inference

loaded_layer.eval()
x = paddle.randn([1, 768], 'int64')
pred = loaded_layer(x)
`

报错结果：

---------------------------------------------------------------------------RuntimeError Traceback (most recent call last) in
4 # inference
5 loaded_layer.eval()
----> 6 x = paddle.randn([1, 768], 'int64')
7 pred = loaded_layer(x)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/random.py in randn(shape, dtype, name)
331 # [-0.3761474, -1.044801 , 1.1870178 ]] # random
332 """
--> 333 return standard_normal(shape, dtype, name)
334
335
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/random.py in standard_normal(shape, dtype, name)
277
278 """
--> 279 return gaussian(shape=shape, mean=0.0, std=1.0, dtype=dtype, name=name)
280
281
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/random.py in gaussian(shape, mean, std, dtype, name)
198 float(mean), 'std',
199 float(std), 'seed', seed, 'dtype',
--> 200 dtype)
201
202 check_shape(shape, op_type_for_check)
RuntimeError: (NotFound) Operator gaussian_random does not have kernel for data_type[int64_t]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN].
[Hint: Expected kernel_iter != kernels.end(), but received kernel_iter == kernels.end().] (at /paddle/paddle/fluid/imperative/prepared_operator.cc:127)
[operator < gaussian_random > error]

[PaddleNLP SIG] 🔥🔥🔥 PPSIG征集令 | 欢迎广大开发者加入PaddleNLP SIG

飞桨社区特殊兴趣小组（PPSIG）旨在通过开放的社区形式与全球的开发者共同构建一个开放、多元和架构包容的生态体系，以开源理念和技术实践为驱动，让全球开发者更紧密的协作起来，构建更好的开源世界。自2020年9月启动以来，目前已有多个方向的SIG展开了活跃的开源协作。

PaddleNLP覆盖了多场景的模型库、简洁易用的全流程API与动静统一的高性能分布式训练能力，是飞桨框架2.0在NLP领域的最佳实践。旨在提升文本领域建模效率，共享前沿技术能力。

我们致力于打造一个富有活力的NLP社区，将定期邀请业内技术大牛分享前沿技术，共同建设NLP能力，以开源精神将NLP技术推广创新。如果您对NLP技术有兴趣，认可开源理念，乐于学习、分享和贡献，欢迎加入PaddleNLP特殊兴趣小组（PaddleNLP SIG）。

扫码填写问卷，通过技术评估后即刻加入哦。记得添加飞桨小哥哥微信（见问卷内容）回复PaddleNLP SIG，后续会有工作人员与您取得联系。

您也可以填写网页版问卷：https://iwenjuan.baidu.com/?code=bkypg8

恳请大佬大神们赐教：如何使用自己的数据集训练MRC？

本人小白，想用Dureader robust的MRC训练一套自己的中文数据集，但是我没有id，只有问句和文本。
请问我改如何更改源数据集呢？
恳请大佬大神们赐教

sh run_ernie.sh train 无法完成

换了自己只有几条数据的train.tsv，就卡着不动了，过了2小时了

paddle.Model.fit在eval阶段不打印传入的metrics结果，最优结果的保存似乎也没有

用SimNet做了个语义判断的程序
model = ppnlp.models.SimNet(
network='gru',
emb_dim=256,
vocab_size=len(vocab),
num_classes=3)
model = paddle.Model(model)
optimizer = paddle.optimizer.AdamW(
parameters=model.parameters(), learning_rate=0.001)
criterion = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
model.prepare(optimizer, criterion, metric)
model.fit(
train_loader,
evl_loader,
epochs=20,
save_dir='pretrained_model' )
发现在eval阶段只是输出loss等信息，并不会打印出整个eval data的评价指标
然后在最后结束时只存了个final model，paddlex会存一个best model，paddlenlp现在会根据eval阶段的最优解指标存一个模型吗？

在lstm语言模型中为什么要先把全部数据读进来？

https://github.com/PaddlePaddle/models/blob/release/1.7/PaddleNLP/language_model/train.py 文件第212行，直接在reader.get_data_iter()里边读文件边yield感觉也可以吧？

paddleNLP对百度的DDparser的支持？

感谢paddleNLP的开源工作!
想问下，百度自家发布的依存句法分析树 DDParser这个工具，后续是否计划融合到paddlenlp中？
DDparser, 目前DDparser还不支持paddlepaddle2.0，后续各个工具版本之间融合和同步，是否会统一到PaddleNLP中？

祝好！

[Solved] Please release Pointer Generator implementation

Hi,

Pointer Generator is one of the most popular baseline in text summarisation studies. It also serves as the building block of several very competitive models. Therefore, if Pointer Summariser can be included into your model zoo, it will greatly encourage the development of Paddle-based summarisation systems:wink:

Many thanks!

ernie_pyreader数据维度和padding方式不一致

ernie_pyreader读取数据要求batch内均为max_seq_len长，但是调用的padding.py是按照batch内最大长度进行padding的。

ernie_pyreader： PaddleNLP/shared_modules/models/representation/ernie.py
padding.py：PaddleNLP/shared_modules/preprocess/padding.py

paddlenlp 如何实现自定义数据集+预训练完成情感分析

我将《使用预训练模型ERNIE优化情感分析》的预训练模型和《快来选一顿好吃的年夜饭》的自定义数据集整合一起完成自定义数据集+预训练。但是总是在报错。请问自定义数据集的类需要做什么调整吗？

AttributeError: 'TSVDataset' object has no attribute 'map'

用simnet做一个语义匹配，自定义数据集，数据里元素已转为数字
仿造PaddleNLP/examples/text_matching/simnet/train.py写
数据读取使用
from paddlenlp.datasets import TSVDataset
ds = TSVDataset('xxx.tsv')
vocab = Vocab.load_vocabulary('mysimnet_vocab.txt', unk_token='[UNK]', pad_token='[PAD]')

def convert_example(example, vocab, is_test=False):
text_a, text_b = example[0], example[1]
a_ids = np.array(vocab.to_indices(text_a.split(' ')), dtype="int64")
a_seq_len = np.array(len(a_ids), dtype="int64")
b_ids = np.array(vocab.to_indices(text_b.split(' ')), dtype="int64")
b_seq_len = np.array(len(b_ids), dtype="int64")

if not is_test:
    label = np.array(example[2], dtype="int64")
    return a_ids, b_ids, a_seq_len, b_seq_len, label
else:
    return a_ids, b_ids, a_seq_len, b_seq_len

def create_dataloader(dataset,
trans_fn=None,
mode='train',
batch_size=1,
use_gpu=False,
batchify_fn=None):
if trans_fn:
dataset = dataset.map(trans_fn)

if mode == 'train' and use_gpu:
    sampler = paddle.io.DistributedBatchSampler(
        dataset=dataset, batch_size=batch_size, shuffle=True)
else:
    shuffle = True if mode == 'train' else False
    sampler = paddle.io.BatchSampler(
        dataset=dataset, batch_size=batch_size, shuffle=shuffle)
dataloader = paddle.io.DataLoader(
    dataset,
    batch_sampler=sampler,
    return_list=True,
    collate_fn=batchify_fn)
return dataloader

trans_fn = partial(convert_example, vocab=vocab, is_test=False)
train_loader = create_dataloader(
ds,
trans_fn=trans_fn,
batch_size=4,
mode='train',
use_gpu=False,
batchify_fn=batchify_fn)

取数据测试的时候
for step, data in enumerate(train_loader):
print(step)
print(data)
break

报了异常
File "D:/pythonCode/fund/med/ak.py", line 43, in create_dataloader
dataset = dataset.map(trans_fn)
AttributeError: 'TSVDataset' object has no attribute 'map'

使用自己制作的tsv数据集进行文本分类,训练的过程中得到如下报错,怎么破,

项目链接:https://github.com/PaddlePaddle/models/tree/4d87afd6480737b64b5974c9c40a5b1c5a4600b3/PaddleNLP/examples/text_classification/rnn

训练命令:python train.py --use_gpu=False --network=bilstm --lr=5e-4 --batch_size=64 --epochs=5 --save_dir='./checkpoints'

将C:\Users\Administrator.paddlenlp\datasets\chnsenticorp目录下的train.tsv 与dev.tsv和test.tsv进行了替换,然后进行训练
训练了3步,然后就报错了,报错信息如下
step 10/4050 - loss: 0.6850 - acc: 0.5156 - 3s/step
step 20/4050 - loss: 0.6437 - acc: 0.5062 - 3s/step
step 30/4050 - loss: 0.5077 - acc: 0.5786 - 4s/step
Traceback (most recent call last):
File "train.py", line 193, in
save_dir=args.save_dir)
File "F:\aanaa\lib\site-packages\paddle\hapi\model.py", line 1492, in fit
logs = self._run_one_epoch(train_loader, cbks, 'train')
File "F:\aanaa\lib\site-packages\paddle\hapi\model.py", line 1799, in _run_one_epoch
data[len(self._inputs):])
File "F:\aanaa\lib\site-packages\paddle\hapi\model.py", line 940, in train_batch
loss = self._adapter.train_batch(inputs, labels)
File "F:\aanaa\lib\site-packages\paddle\hapi\model.py", line 654, in train_batch
* [to_variable(x) for x in inputs])
File "F:\aanaa\lib\site-packages\paddlenlp\models\senta.py", line 104, in forward
logits = self.model(text, seq_len)
File "F:\aanaa\lib\site-packages\paddle\fluid\dygraph\layers.py", line 884, in call
outputs = self.forward(*inputs, **kwargs)
File "F:\aanaa\lib\site-packages\paddlenlp\models\senta.py", line 186, in forward
embedded_text = self.embedder(text)
File "F:\aanaa\lib\site-packages\paddle\fluid\dygraph\layers.py", line 884, in call
outputs = self.forward(*inputs, **kwargs)
File "F:\aanaa\lib\site-packages\paddle\nn\layer\common.py", line 1289, in forward
name=self._name)
File "F:\aanaa\lib\site-packages\paddle\nn\functional\input.py", line 202, in embedding
'remote_prefetch', False, 'padding_idx', padding_idx)
ValueError: (InvalidArgument) Variable value (input) of OP(fluid.layers.embedding) expected >= 0 and < 857580, but got 858057. Please check input value.
[Hint: Expected ids[i] < row_number, but received ids[i]:858057 >= row_number:857580.] (at D:\2.0.0rc1\paddle\paddle/fluid/operators/lookup_table_v2_op.h:81)
[Hint: If you need C++ stacktraces for debugging, please set FLAGS_call_stack_level=2.]
[operator < lookup_table_v2 > error]

models的PaddleNLP的bert，自有数据训练，生成训练数据。字典不匹配？？

https://github.com/PaddlePaddle/models/tree/release/1.8/PaddleNLP/pretrain_language_models/BERT
数据预处理部分

id化的例子第一个token的id是1？你们用的是啥字典啊。
bert-base 的字典，cls是102，sep是103才对吧

或者你们能不能放出对应的create_train_data.py的代码，这也不麻烦吧

使用文本分类任务时自己的准备的样本在加载时被替换成官方样本(train.tsv, dev.tsv, dev.tev)怎么破,求指导

项目链接:https://github.com/PaddlePaddle/models/tree/4d87afd6480737b64b5974c9c40a5b1c5a4600b3/PaddleNLP/examples/text_classification/rnn

加载的时候是从train.py的这里进去的(但是里面调的类有点复杂,就不知道怎么改了)
train_ds, dev_ds, test_ds = ChnSentiCorp.get_datasets(
['train', 'dev', 'test'])

如何载入保存训练好的模型？

使用了预训练模型进行 finetune：

import paddlenlp as ppnlp

# 设置想要使用模型的名称
MODEL_NAME = "ernie-1.0"

ernie_model = ppnlp.transformers.ErnieModel.from_pretrained(MODEL_NAME)
model = ppnlp.transformers.ErnieForTokenClassification.from_pretrained(MODEL_NAME, num_classes=train_ds.num_label)
tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained(MODEL_NAME)

训练完成后：

model.save_pretrained('./checkpoint')
tokenizer.save_pretrained('./checkpoint')

请问训练完成后应该如何加载保存的模型？

examples样例中的 DuReader-yesno 出现小问题

run_du.py 第 94 行代码
tokens_raw = [tokenizer(l) for l in example]出现如下问题：

SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:158)

原因：https://github.com/PaddlePaddle/PaddleNLP/commit/e67c2c554043edae36e76f25e8bbcecfc29dd8cf 提交删除了 tokenizer.py 文件中 BertTokenizer 类的 call() 方法，更改为了 tokenize() 。

因此，建议修改为tokens_raw = [tokenizer.tokenize(l) for l in example]问题可解。

lac 模型fine-tune

一、现状
目前使用lac模型提取文本中ORG信息作为公司或机构的名称返回给上层应用。
对于一般的机构信息，模型可以正常识别，但是我们的文本中存在很多非标准的公司/机构数据，例如：“包店镇手机大卖场”、“李梅种子专营店”等，lac就很难正常识别。
目前我们想到的办法是整理了大概10w条类似的数据：小勇手机专营店 /ORG 百姓大药房/ORG ....... 准备利用lac进行fine-tune
二、问题

目前按照这个思路操作的过程中，lac训练出来的模型最终将很多非ORG的数据识别成了ORG，准确率反而降低了很多。我们也观查到一个现象就是：一个句子的开头几个词很容易被识别为ORG，我们猜想这可能是训练样本中只含有ORG导致模型认为开头的词由很大概率是ORG，所以导致了误判（问题1：这个猜想是否合理/正确）。

目前在思考一个解决思路（问题2）：是否需要在训练集中混入一些其他非ORG样本，或者以整个句子标注的形式作为样本，而不是只提供ORG形式的单词。

DuReader_robust阅读理解，传入自定义数据路径报错

其实是两个问题：

完成训练后是否能用训练好的模型在无答案的数据集上预测
跑示例DuReader_robust的时候发现，好像只能用 fine-tuning 训练模型，然后用 dev 数据集测试，输出测评结果。但如果把 load_dataset 的split参数从 ‘dev’ 改成 ‘test’ 以后就会报错 KeyError: 'answers'，应该是因为test数据里没有‘answers’，不知道如果要在这种数据上跑是否需要修改代码。
传入自定义数据的路径就会报错
还有在使用 - - predict_file 后面加文件路径会报错：AssertionError: data_files should be a string or a dictionary whose key is split name ande value is a path of data file.

问题是我现在想要直接用训练好的模型在自己的数据上进行预测，因为我的数据集是没有答案的。就想要知道具体实现方式是什么，readme里面也没有这方面的说明

NER教程中发现bug

hi,
https://aistudio.baidu.com/aistudio/projectdetail/1317771
在这个NER任务中，主模型代码
class BiGRUWithCRF2(nn.Layer): def __init__(self, emb_size, hidden_size, word_num, label_num): super(BiGRUWithCRF2, self).__init__() self.word_emb = TokenEmbedding(extended_vocab_path='./conf/word.dic', unknown_token='OOV') #EMB
TokenEmbedding的利用有误

我查看了源码，extended_vocab_path的参数会作为读取字典，经过_read_vocab_list_from_file取出词表
def _read_vocab_list_from_file(self, extended_vocab_path): # load new vocab table from file vocab_list = [] with open(extended_vocab_path, "r", encoding="utf-8") as f: for line in f.readlines(): vocab = line.rstrip("\n").split("\t")[0] vocab_list.append(vocab) return vocab_list
该任务对应的字典word.dic ，第一列是索引id，不是vocab

所以TokenEmbedding无法正确加载pretrain的权重

models-release-1.7/PaddleNLP/emotion_detection模型

将开源ernie模型进行sh run_ernie train微调，源码报错，这个修改了ernie函数中的fluid.data，改成fluid.layers.data就能成功进行train。可在进行sh run_ernie.sh infer时，出现如下错误，这个是哪里有问题呢

Traceback (most recent call last):
File "run_ernie_classifier.py", line 402, in
main(args)
File "run_ernie_classifier.py", line 303, in main
main_program=test_prog)
File "/data/yuting/models-release-1.7/PaddleNLP/emotion_detection/utils.py", line 37, in init_checkpoint
fluid.load(main_program, init_checkpoint_path, exe)
File "/home/ant/.conda/envs/text_analysis/lib/python3.6/site-packages/paddle/fluid/io.py", line 1779, in load
optimizer_var_list, global_scope(), executor._default_executor)
paddle.fluid.core_avx.EnforceNotMet:

C++ Call Stacks (More useful to developers):

0 std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const, int)
2 paddle::framework::Tensor::mutable_data(paddle::platform::Place const&, paddle::framework::proto::VarType_Type, unsign
ed long)

Error Message Summary:

Error: When calling this method, the Tensor's numel must be equal or larger than zero. Please check Tensor::dims, or Tenso
r::Resize has been called first. The Tensor's shape is [-1, 768] now [Hint: Expected numel() >= 0, but received numel():-768 < 0:0.] at (/paddle/paddle/fluid/framework/tensor.cc:45)

文本情感分类，为什么没有三分类的预训练模型

在实际中大部分都是三分类吧，积极中性消极？sentiment_classfiy都是积极和消极两个分类，也就是比如说一句话要么积极要么消极，在现实或者工业中不实际吧，请问是怎么考虑的。或者单纯是没有这样的语料？

information_extraction DuEE 中训练模型时vocab_path=./conf/vocab.dict指的是什么？

information_extraction DuEE 中训练模型时vocab_path=./conf/vocab.dict指的是什么？在哪里获取？谢谢