morizeyao / gpt2-chinese Goto Github PK

View Code? Open in Web Editor NEW

7.4K 162.0 1.7K 13.45 MB

Chinese version of GPT2 training code, using BERT tokenizer.

License: MIT License

Python 99.47% Shell 0.53%

transformer gpt-2 chinese nlp text-generation

gpt2-chinese's Introduction

GPT2-Chinese

Description

Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository from HuggingFace team Transformers. Can write poems, news, novels, or train general language models. Support char level, word level and BPE level. Support large training corpus.
中文的GPT2训练代码，使用BERT的Tokenizer或Sentencepiece的BPE model（感谢kangzhonghua的贡献，实现BPE模式需要略微修改train.py的代码）。可以写诗，新闻，小说，或是训练通用语言模型。支持字为单位或是分词模式或是BPE模式（需要略微修改train.py的代码）。支持大语料训练。

UPDATE 04.11.2024

非常感谢各位对本项目的关注。ChatGPT发布以来本项目也重新引起了一些注意。项目本身是我自学Pytorch的练手项目，我也无意做长期的维护更新。如果大家对大模型LLM感兴趣的话，可以邮件我([email protected])加群沟通，或是在Issue中进行讨论。

UPDATE 02.06.2021

本项目新增了通用中文GPT-2预训练模型、通用中文GPT-2预训练小模型、中文歌词GPT-2预训练模型和文言文GPT-2预训练模型。模型由UER-py项目训练得到，欢迎大家使用。此外，模型上传到了Huggingface Model Hub中。更多模型的细节请参考gpt2-chinese-cluecorpussmall、gpt2-distil-chinese-cluecorpussmall、gpt2-chinese-lyric和gpt2-chinese-ancient。

在使用所有模型进行生成时，需要在输入的文本前加入一个起始符，如：若要输入“最美的不是下雨天，是曾与你躲过雨的屋檐”，正确的格式为“[CLS]最美的不是下雨天，是曾与你躲过雨的屋檐”。

UPDATE 11.03.2020

本项目新增了古诗词GPT-2预训练模型和对联GPT-2预训练模型。模型由UER-py项目训练得到，欢迎大家使用。此外，模型上传到了Huggingface Model Hub中。更多模型的细节请参考gpt2-chinese-poem和gpt2-chinese-couplet。

在使用古诗词模型进行生成时，需要在输入的文本前加入一个起始符，如：若要输入“梅山如积翠，”，正确的格式为“[CLS]梅山如积翠，”。

对联模型训练时使用的语料格式为“上联-下联”，在使用对联模型进行生成时，需要在输入的文本前加入一个起始符，如：若要输入“丹枫江冷人初去-”，正确的格式为“[CLS]丹枫江冷人初去-”。

NEWS 08.11.2020

CDial-GPT(可用本代码载入)已发布。本项目包含一个经过严格清洗的大规模放开域中文对话数据集，本项目还包含在此数据集上训练的GPT对话预训练模型，以及生成样例，欢迎大家参观。

NEWS 12.9.2019

新项目GPT2-chitchat已发布，部分基于本项目代码。包含训练GPT2对话模型的代码与与训练模型，以及生成样例，欢迎大家参观。

NEWS 12.7.2019

新项目Decoders-Chinese-TF2.0同样支持GPT2的中文训练，在使用上更加简单，不易产生各种问题。目前还在测试阶段，欢迎大家提出意见。

NEWS 11.9

GPT2-ML（与本项目无任何直接关联）已发布，包含1.5B中文GPT2模型。大家如有兴趣或需要可将其转换为本项目支持的Pytorch格式进行进一步训练或生成测试。

UPDATE 10.25

本项目第一个预训练模型已公布，为散文生成模型，具体可查看README模型分享部分。

项目状态

在本项目公布时，中文的GPT2资源几乎为零，而现在情况已有所不同。其次项目功能已经基本稳定，因此目前本项目暂已停止更新。我写下这些代码的初衷是练习Pytorch的使用，即使后期做了一些填坑工作，难免还是有很多不成熟的地方，也请谅解。

使用方法

在项目根目录建立data文件夹。将训练语料以train.json为名放入data目录中。train.json里是一个json列表，列表的每个元素都分别是一篇要训练的文章的文本内容（而不是文件链接）。
运行train.py文件，勾选 --raw ，会自动预处理数据。
预处理完成之后，会自动执行训练。

生成文本

python ./generate.py --length=50 --nsamples=4 --prefix=xxx --fast_pattern --save_samples --save_samples_path=/mnt/xx

--fast_pattern (由LeeCP8贡献）：如果生成的length参数比较小，速度基本无差别，我个人测试length=250时，快了2秒，所以如果不添加--fast_pattern，那么默认不采用fast_pattern方式。
--save_samples：默认将输出样本直接打印到控制台，传递此参数，将保存在根目录下的samples.txt。
--save_samples_path：可自行指定保存的目录，默认可递归创建多级目录，不可以传递文件名称，文件名称默认为samples.txt。

文件结构

generate.py 与 train.py 分别是生成与训练的脚本。
train_single.py 是 train.py的延伸，可以用于一个很大的单独元素列表（如训练一本斗破苍穹书）。
eval.py 用于评估生成模型的ppl分值。
generate_texts.py 是 generate.py 的延伸，可以以一个列表的起始关键词分别生成若干个句子并输出到文件中。
train.json 是训练样本的格式范例，可供参考。
cache 文件夹内包含若干BERT词表，make_vocab.py 是一个协助在一个train.json语料文件上建立词表的脚本。 vocab.txt 是原始BERT词表， vocab_all.txt 额外添加了古文词， vocab_small.txt 是小词表。
tokenizations 文件夹内是可以选用的三种tokenizer，包括默认的Bert Tokenizer，分词版Bert Tokenizer以及BPE Tokenizer。
scripts 内包含了样例训练与生成脚本

注意

本项目使用Bert的tokenizer处理中文字符。
如果不使用分词版的tokenizer，不需要自己事先分词，tokenizer会帮你分。
如果使用分词版的tokenizer，最好先使用cache文件夹内的make_vocab.py文件建立针对你的语料的词表。
模型需自行运算。各位如果完成了预训练的话欢迎进行交流。
如果你的内存非常大或者语料较小的话，可以改掉train.py内build files内的对应代码，不做拆分直接预处理语料。
若使用BPE Tokenizer，需自己建立中文词表

语料

可以从这里与这里下载。
斗破苍穹语料可以从这里下载。

FP16与Gradient Accumulation支持

我在train.py文件中加入了fp16与gradient accumulation支持，如果你安装了apex并且知道fp16是什么的话，可以修改变量fp16=True来启用。但是目前fp16可能不收敛，原因不明。

联系作者

Mail：[email protected]

Citing

@misc{GPT2-Chinese,
  author = {Zeyao Du},
  title = {GPT2-Chinese: Tools for training GPT2 model in Chinese language},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Morizeyao/GPT2-Chinese}},
}

模型分享

模型名称	模型介绍	分享者	链接地址1	链接地址2
散文模型	使用130MB的名家散文、情感散文和散文诗歌训练所得。	hughqiu	百度网盘【fpyu】	GDrive
诗词模型	使用180MB的约80万首古诗词训练所得。	hhou435	百度网盘【7fev】	GDrive
对联模型	使用40MB的约70万条对联训练所得。	hhou435	百度网盘【i5n0】	GDrive
通用中文模型	使用CLUECorpusSmall语料训练所得。	hhou435	百度网盘【n3s8】	GDrive
通用中文小模型	使用CLUECorpusSmall语料训练所得。	hhou435	百度网盘【rpjk】	GDrive
中文歌词模型	使用140MB的约15万首中文歌词训练所得。	hhou435	百度网盘【0qnn】	GDrive
文言文模型	使用1.8GB的约300万篇文言文训练所得。	hhou435	百度网盘【ek2z】	GDrive

此处为热情大方的git友训练所得的模型文件，公开给所有朋友使用，同时也欢迎各位伙伴将自己训练完毕的模型公开于此处。

Demo

由用户JamesHujy根据本仓库改版代码训练得到的模型作为律诗与绝句后台，新版九歌诗歌生成器已经上线。
由leemengtaiwan贡献，提供文章直觀介紹 GPT-2 以及如何視覺化自注意力機制。另提供 Colab 筆記本與模型供任何使用者一鍵生成新樣例。

生成样例

-以下为文学散文的生成样例，由hughqiu贡献，模型已经分享于模型分享列表。语料130MB，Batch size 16，10层深度下训练10轮所得。

下为斗破苍穹的生成样例，使用约50M参数的GPT2以32Batch Size在16MB斗破苍穹小说内容上训练得到。此处[SEP]表示换行。

下为古诗词的生成样例，由用户JamesHujy运算并贡献。

下为古诗限定了生成体裁后的生成样例，由用户JamesHujy运算并贡献。

下为生成剧本的样例文本，由用户chiangandy运算并贡献

[starttext]爱情游戏剧情讲述了钢琴父女明致怀萌的爱情、个有着努力的热情以及现实为人生的价值观众，获得一系列爱情的故事。80后录股媒体受到网友分享，是2014年主创陈拉昀出品牌总监于蓝氏集团化验师创业团门的哥哥大国度上海淮河畔，集入第一线公司青年度虽然没有放到的事业，但是蓝正是却不到位主人拒绝了解，而在蓝越的帮助理念出现，也因此开启明朗的误会而经营变成爱河。在一次偶然的编剧集电视剧之夏天上一改变了自命运环球顶樑，三人在创车祸中不知被记忆差网识分到创作，并被问流言败，以及行业服务所有的低调教同才力，陈昭和唐诗诗妍展开了一段截然不同的“2014年间段感情”，两人性格互相治癒的商业奋斗故事，尽管是共90后北京华侨大学录的一个宿舍小旅程和唐如、生等优秀青年，的人生活如何与愿违3个国偶像，并且共同创作何以此他们互相有观众的成功和关心吗?[endtext]

[starttext]学习爱情主要讲述了两对方小曼，经过啼笑皆非的考验，终于选择了三个孩子，携手共同创业来四个孩子，在大城市里创业的成功商。两家内事业的加入了北京城市，经过了一次元城市融风雨故、差异后得到异的他们，最终收获了梦想的真正属于自己的爱情。赞助理想、电视剧、剧等主创业时代人物特点在北京举行开机仪式，该剧以当下海南三个新人青年轻人面人海南梅竹马的电视角，讲述了几个在北京、喜剧代人生活中增强非浪漫的年轻人，以独特的双时代年轻人从来到北京城市化**大城市走出发展以海南方的变迁在语种城市闯关于人生态的同时，以及他们渐渐的生活方式为自己方向上演了那么简单俗，是当代际拍摄的就如何在这个城市里都市里?那么平静的城市就是城市的风格特张嘉和支持工作打造，而这是一点就要打造出机场话剧组会。化身处处棋逢貌各种文化的人都非常独特的煽情，交织了相，滑稽等来自外衣的东北漂亮、内地，者和两位女孩子敢称是哑女孩子。交织里的人齐飞一开泰块玩笑，令人印象太趋的气质，让人眼看这个性格非常喜剧，知道的是一个“东北漂”人的外国小养家，让她耳熟练读剧的外形象显老大。之后齐飞、表示爱朗的齐飞、范儿、楚月子、白天杰。两代人的生活里友情似乎没有结合、精彩表态的开朗和丽丽丽。[endtext]

下為金庸武俠小說的生成樣例，由leemengtaiwan贡献。模型大小約 82M，語料 50 MB，Batch size 16。提供文章直觀介紹 GPT-2 以及如何視覺化自注意力機制。另提供 Colab 筆記本與模型供任何使用者一鍵生成新樣例。

gpt2-chinese's People

Contributors

Stargazers

Watchers

Forkers

hello-web b2220333 senkey705 laoli2046 xumanda beijinggao tarsbase goingcoder visualsf bamboo06 heyufo lgstd cloudstdiolab cedar33 nonva jameshujy yyht duyuankai1992 brianjcj ljz1996 fendouai chenhuiji leexueyong chengweitsai i-lovelife lastrei yespon jimmy-walker fooway legendtianjin allensmile fesiong linecode cubbo fancyerii darcy0511 tchigher hhy5277 lyrl daishangwei finanity linuer awesome-archive wangjunji linxpsoft wjyhumor dongcin limingdeng2 cold-eye charlottesean slidersun airob bugroom geminifox2019 xjrelc jjwangnlp livinluo1993 dingwenjie wusongxu barryzm ytingalpha boluoyu aibot88 jialei123 gaohuan2015 xaccc jiev xueyusky colorfulclouds matthewxu guokr1991 sadapple beetter sunqiang25 liang-2018 waylandgod leaderyangzi zouxiangxiang iris-qq 15737939656 yupeiyong ylcode3 xanatos626 fucheng830 ddddb bss-gyxr mtfelix joheny gavin-cg zhangjiekui windowxiaoming bitbeyhub xiugang zhiwei55 kangzhonghua sodapeter aigeorgeli leemengtw vanpersie32 ainitgit

gpt2-chinese's Issues

文本学习效果请教

由于我的硬体上无法训练BERT还无法能验证，由于我有一个项目要对短视频剧本生成AI写作的需求，之前我用LSTM 模型来实作，但是只能保有句子本生身通顺但是句子间或是多句文章时就看不出主题取向效果有点不能接受。看到大大的BERT模型燃起一丝希望，想请教您的过往经验如果用这BERT模型是否能解决LSTM的剧本生成的问题？

感谢赐教～

文章摘要生成

@Morizeyao 感谢分享代码。请问下文本摘要生成，即seq2seq任务用gpt2该怎么修改？是直接标题和内容拼接输入吗？

generate.py error

generate.py 第80行应该放在79行前面哟。

句子过长导致的索引错误（Too long a sentence leads to an index error）

Token indices sequence length is longer than the specified maximum sequence length for this model (1067 > 1024). Running this sequence through the model will result in indexing errors。
采用默认的配置，其最大长度为1024，而在train.py中，只对最小长度作了限制，但是没有对最大长度作限制，是否也应该根据配置文件的长度限制进行截断。

sublines = [full_tokenizer.tokenize(line) for line in sublines if
                    len(line) > min_length]  # 只考虑长度超过min_length的句子
# 在此转换时，会出现警告
sublines = [full_tokenizer.convert_tokens_to_ids(line) for line in sublines]

请问能否支持非续写模式的generate?

你好！

当前的generator应该属于命题作文类，或者“续写”模式。

我的需求是这样的：给定一句话，生成另外一句话（限定最大长度）。

进而我想实现的需求是生成的这句话与给定的一句话，表达类似的意思。（非问答模式）。

感谢！

训练数据格式问题

您好，想请教几个问题，1)训练数据的格式是["文章1","文章2","文章3"]这样吗，首尾必须带[]吗 ? 2）针对斗破苍穹的数据，是把每一章当作一篇文章吗，原数据是包含了全部章节，您是怎么做处理的 3)如果单篇文章很长，会做截断吗(比如bert会限制长度512)，非常感谢

Need help with gradient accumulation implementation.

NVIDIA/apex#286 (comment)
Should not be hard I guess?
Just edit the train_with_ga.py file and PR xD.

生成的速度太慢了，能否加一个生成的batch_size大于1的功能

如题！谢谢！

。

训练数据丢失问题

采用分片切割的方式，分得越多，丢失的数据越多
1 每个Piece所拥有的数据长度不一，在使用移动窗口分割数据时，会完全丢失数据，这部分数据（当然数据不多，小于seq_len）将永远不参与训练。分得越多，丢得越多。
2 在分batch时，不足以组成一个完整batch的数据将被丢失，当然这个可以通过多轮epoch来解决，但是通过shuffle，无法保证上一轮丢弃的数据一定会参与训练。此处丢失[0,seq_len*(batch_size-1)]，batch_size越大，丢失得越多。
3 stride的取值问题，会使得部分数据参与运算的次数高于其它数据，人为改变了数据原有的分布，请问怎么看待这个问题？
建议可通过基于步数分割成片，如每Piece包含多少步。而每步长=(seq_len-stride)+stride*batch_size，这样最多永久丢失一步数据[0,步长)，不想丢失，可人为补至完整一步。

多gpu混合精度训练的问题

我使用自己的代码，大致实现是和作者一样的，多gpu单精度训练和单gpu混合精度训练都没问题，但是如果多gpu混合精度训练就会报arguments are located on different GPUs的错误，请问作者使用多gpu混合精度训练出现过类似的bug吗？

#分词训练格式

您好，想请问一下我如果想以词级别来训练语言模型，是需要用tokenization_bert_without_wordpiece.py这个数据预处理嘛，文本的格式是需要分好词的格式吗，比如“今天天气真好”

请教只训练语言模型

您好，我想请问一下我只想要从头训练一个语言模型来计算句子的ppl，应该如何使用您的代码呢，是不是只要使用train.py和预处理的代码呢，generate.py是不是用不上呢，谢谢您~

#局部收敛问题

您好，请问您在训练过程中有碰到训练到一定时候loss开始抖动嘛，训练了大概1000多步开始？

训练时候报错RuntimeError: CUDA error: device-side assert triggered

我按照作者写的instruction训练斗破苍穹（是txt文件），改了pre_process_data.py，具体改写如下：
def is_default_file_type():
return False

def load():
with open("./data/train.txt", 'r', encoding='utf-8') as f:
print('reading lines')
lines = f.readlines();
lines = [line.replace('\n', ' [SEP] ') for line in lines]
return lines

用make_vocab.py成功生成了一个50000的词表，
然后运行了train_single.py，运行命令为：

python ./train_single.py
--device=1
--model_config=config/model_config_small.json
--tokenizer_path=cache/vocab_user.txt
--raw_data_path=data/train.txt
--raw
--epochs=5
--batch_size=1
--lr=1.5e-4
--stride=1023
--output_dir=model/
--pretrained_model=''
--segment

模型配置文件model_config_small.json里面的配置为：
{
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 10,
"n_positions": 1024,
"vocab_size": 50000
}
这里vocab_size不太清楚应该是50000还是50005？因为加了特殊字符[cls][sep]等

然后成功开始运行，数据也成功处理了，但是运行了一会儿就报错了，具体错误信息为：

Traceback (most recent call last):
File "./train_single.py", line 235, in
main()
File "./train_single.py", line 177, in main
outputs = model.forward(input_ids=batch_inputs, labels=batch_labels)
File "/opt/conda/lib/python3.6/site-packages/pytorch_transformers/modeling_gpt2.py", line 606, in forward
past=past, head_mask=head_mask)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/pytorch_transformers/modeling_gpt2.py", line 523, in forward
outputs = block(hidden_states, layer_past, head_mask[i])
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/pytorch_transformers/modeling_gpt2.py", line 345, in forward
m = self.mlp(self.ln_2(x))
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/pytorch_transformers/modeling_gpt2.py", line 326, in forward
h = self.act(self.c_fc(x))
File "/opt/conda/lib/python3.6/site-packages/pytorch_transformers/modeling_gpt2.py", line 102, in gelu
return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
RuntimeError: CUDA error: device-side assert triggered

不太明白问题出在哪里了？请给点建议，另外我的训练语料前需要加[cls] token吗？

谢谢

请教体育新闻的语料数量

我有看到您新一版的训练成果有个体育新闻部分，想请教一下这训练的语料准备是多少的体育新闻数量？每则新闻大约有多少字？

感谢～

请问是从预训练模型开始训练的吗？

非常感谢能分享这么好的项目。有几个问题想请教一下。
请问一开始模型参数是加载的预训练模型，还是只是简单的初始化？
如果是从预训练模型开始训练，是使用OpenAI开源的预训练模型吗？
如果是从头开始训练，只用几百M的语料就已经足够了吗，毕竟OpenAI用了40G的语料。从头开始训练在四卡2080Ti上大概需要多长时间呢？

训练的loss或者ppl？使用分字和分词的效果对比？

作者能够提供一下在新闻预料上训练时最后的loss或者ppl？我试验过使用bert的tokenizer，也就是用字作为基本单位，loss下降的情况并没有subword好，文本生成的效果没有比较，不知道作者有没有进行过这方面的比较？

Undefined name 'running_loss' in ./train_single.py

flake8 testing of https://github.com/Morizeyao/GPT2-Chinese on Python 3.7.1

$ flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics

./train_single.py:184:21: F821 undefined name 'running_loss'
                    running_loss += loss.item()
                    ^
2371     F821 undefined name 'running_loss'

请问词表有预训练权重么

词表打乱顺序或截断部分会影响训练么？

谢谢

Bug: 文件指针没有被关闭

Bug Report
在生成脚本 generate.py 中第 169 行，打开的文件指针 samples_file 在使用完后没有被关闭，可能会造成句柄泄露。

samples_file = open(args.save_samples_path + '/samples.txt', 'w', encoding='utf8')

更正建议
在退出循环前，显示关闭文件指针 samples_file.close()

argparse中参数gradient_accumulation类型错误

错误原因
train.py 脚本中第 57 行 parser 设置参数 gradient_accumulation 的类型为 str

parser.add_argument('--gradient_accumulation', default=1, type=str, required=False, help='梯度积累')

会导致第 139 进行除运算时抛出类型错误，不能对 str 做除法

total_steps = int(full_len / stride * epochs / batch_size / gradient_accumulation)

更正建议
在第 57 行设置参数 gradient_accumulation 的类型为 int

可否有监督

您好，请问可以支持有监督的方式训练吗？

where can download the trained model

正体中文支援的问题

由于GPT-2有预训练，请问GPT-2有支援正体中文吗？如果提供正体中文的训练数据，是否可以产生正体中文的文章？

参数 n_ctx

这个参数限定样本的长度。

但是，对于不同的篇章，长度是不一样的。而且多篇篇章通过[CLS][MASK]分隔后混在一起。

直接取长度为n_ctx的字符作为一个样本，感觉这样不太合理。

请问有什么考虑吗？

是否保存optimizer的参数？

作者在保存模型时没有保存optimizer的参数和scheduler的参数，是否会对中断后继续训练有影响？

我有预感，这个技术会跟deepfake一样，逐渐会出现在大众的视野当中

比如说这个标题：震惊，人工智能竟然让李白活过来了！或者是：起点作者的末日？人工智能竟然能编写小说！随着gpu性能的提升还有人工智能asic单元的出现，未来有可能会普及，这个东西对于世界的影响绝对比deepfake要大。（本人观点，不喜勿喷）

较大规模训练后自由生成的文本。模型参数约80M。机器为四个2080Ti，训练步数140万步，语料3.4G，Batch Size 8。

你玩晒啦

BPE 使用

在说明文档中, 提到 “ 若使用BPE Tokenizer，需自己建立中文词表”
怎么理解?

谢谢

generate.py中参数batch_size的作用

从代码的逻辑上看，重复输出样本，实在不知道有什么作用，还是作者想批量生成，最后没实现？如果作者有时间的话，也请重新审视一下代码吧，细节上有太多问题了，虽然我想改，但是有些东西，不是我实现的，避免误改，还是请作者自己修改吧。

Fail to run train_single

Great repo. However, the train_single script seems to be broken.

  File "train_single.py", line 223, in <module>
    main()
  File "train_single.py", line 74, in main
    full_tokenizer = tokenization_bert.BertTokenizer(vocab_file=args.tokenizer_path)
UnboundLocalError: local variable 'tokenization_bert' referenced before assignment

请问是从0开始预训练中文语言模型吗？

计划用多大规模的语料？使用什么GPU？

环境使用内存的问题

我照配置使用small.json去配置，在4GPU每个GPU有8GB的环境去跑还是会出现既济体不够的问题

File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_transformers-1.0.0-py3.6.egg/pytorch_transformers/modeling_gpt2.py", line 100, in gelu
return 0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 7.44 GiB total capacity; 6.88 GiB already allocated; 21.50 MiB free; 128.30 MiB cached)

确定每台有8GB GRAM, 但是为何看起来只使用到8G做配置

执行命令如下...
python3 train.py --raw --device="0,1,2,3"

这会是哪方面问题？

硬件

大佬用的什么机器跑的

训练好的模型移至CPU上执行

想请问一下如果在有GPU的机器上训练好模型后，在产生内文时是否可以移至CPU上来运行？我是知道诸如LSTM跟RNN是都可以把tensor跟model移至CPU来执行，由于GPT-2有预训练过不知道是否也可以训练好移至CPU上运行？

语料问题

我去下载了斗破苍穹的语料，但它是纯text档案而非json档案。
当我修改train.py中...
# doupo = json.load(f)
doupo = f.read()

却发生警示...
W0726 13:58:53.042458 4556223936 tokenization.py:126] Token indices sequence length is longer than the specified maximum sequence length for this BERT model (5340786 > 512). Running this sequence through BERT will result in indexing errors
请问这是何问题？

gradient_accumulation及warmup_steps参数问题

模型默认的warmup=2000，lr=1.5e-4，假设此时总的步数<2000，那么实际lr=[0,1.5e-4)之间，步数越小，lr越小，会出现难收敛的问题。
假设我一开始想让模型收敛得快一些，设lr=0.001,那么实际也得到2000步时，lr才到真正的0.001，之前都比这小，与lr参数的初衷相违背，所以warmup默认值，建议设为0。
最后，输出不一致问题。假设设置了gradient_accumulation=4，而log_step默认，那么每隔4次，输出一个loss，但是log_step默认为1，每次都输出loss=0，当然对结果没有影响，但是感觉...

nothing

训练成果分享与一点提问

经由一晚的训练，但又些疑问...想提出来讨论一下...
由于想得到更好效果，我ePoch 做到75(15+60)，结果...
now time: 23:18. Step 4 of piece 99 of epoch 60, loss 0.003577812574803829
now time: 23:18. Step 5 of piece 99 of epoch 60, loss 0.0026147072203457355
now time: 23:18. Step 6 of piece 99 of epoch 60, loss 0.0035277323331683874
now time: 23:18. Step 7 of piece 99 of epoch 60, loss 0.003405306488275528
saving model for epoch 60
epoch 60 finished
time: 2019-08-14 23:18:30.795687
time for one epoch: 0:09:19.922468
training finished

Loss 降到0.003，这让我惊讶了，从来没看过这么低的Loss，这真的很不可思议！
但相对的问题来了，我在产出中有看到产出的文本会跟训练语料的文本完全相同(一字不差)，感觉好像是过度学习了...我想Loss 还是控制在0.9~0.1之间比较理想。
是不是在学习过程要加入判断Loss值低于多少就自动停止，这样会比较好...

输出结果感觉还算不错...文笔算是很自然，提供分享

(环境：aws g3,16xlarge EC2 (M60 4 GPU , 32GiB GRAM), n_layers=7, 语料 16.7MB)

======

[starttext]爱情游戏剧情讲述了钢琴父女明致怀萌的爱情、个有着努力的热情以及现实为人生的价值观众，获得一系列爱情的故事。80后录股媒体受到网友分享，是2014年主创陈拉昀出品牌总监于蓝氏集团化验师创业团门的哥哥大国度上海淮河畔，集入第一线公司青年度虽然没有放到的事业，但是蓝正是却不到位主人拒绝了解，而在蓝越的帮助理念出现，也因此开启明朗的误会而经营变成爱河。在一次偶然的编剧集电视剧之夏天上一改变了自命运环球顶樑，三人在创车祸中不知被记忆差网识分到创作，并被问流言败，以及行业服务所有的低调教同才力，陈昭和唐诗诗妍展开了一段截然不同的“2014年间段感情”，两人性格互相治癒的商业奋斗故事，尽管是共90后北京华侨大学录的一个宿舍小旅程和唐如、生等优秀青年，的人生活如何与愿违3个国偶像，并且共同创作何以此他们互相有观众的成功和关心吗?[endtext]
[starttext]学习爱情主要讲述了两对方小曼，经过啼笑皆非的考验，终于选择了三个孩子，携手共同创业来四个孩子，在大城市里创业的成功商。两家内事业的加入了北京城市，经过了一次元城市融风雨故、差异后得到异的他们，最终收获了梦想的真正属于自己的爱情。赞助理想、电视剧、剧等主创业时代人物特点在北京举行开机仪式，该剧以当下海南三个新人青年轻人面人海南梅竹马的电视角，讲述了几个在北京、喜剧代人生活中增强非浪漫的年轻人，以独特的双时代年轻人从来到北京城市化**大城市走出发展以海南方的变迁在语种城市闯关于人生态的同时，以及他们渐渐的生活方式为自己方向上演了那么简单俗，是当代际拍摄的就如何在这个城市里都市里?那么平静的城市就是城市的风格特张嘉和支持工作打造，而这是一点就要打造出机场话剧组会。化身处处棋逢貌各种文化的人都非常独特的煽情，交织了相，滑稽等来自外衣的东北漂亮、内地，者和两位女孩子敢称是哑女孩子。交织里的人齐飞一开泰块玩笑，令人印象太趋的气质，让人眼看这个性格非常喜剧，知道的是一个“东北漂”人的外国小养家，让她耳熟练读剧的外形象显老大。之后齐飞、表示爱朗的齐飞、范儿、楚月子、白天杰。两代人的生活里友情似乎没有结合、精彩表态的开朗和丽丽丽。[endtext]

==========

求教一般训练几个epoch？

需要训练多少epoch是否有个经验值？是否和数据量有关，有没有什么参考意见？

语料处理

你好，请问语料需要的token需要使用空格分隔吗？还有预训练完成之后怎么使用呢？

运行train.py 的时候报错 module 'tensorflow.io' has no attribute 'gfile'

软件包安装问题

我用sudo pip3 install -r requirements.txt 安装package.
安装显示正常...
但import tokenization_bert 会找不到package....
请问这会是哪出错了？

(pytorch_p36) ubuntu@ip-172-31-38-29:$ sudo pip3 install -r requirements.txt
Requirement already satisfied: pytorch-transformers in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 1)) (1.0.0)
Requirement already satisfied: pytorch-pretrained-bert in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 2)) (0.6.2)
Requirement already satisfied: torch in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 3)) (1.2.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 4)) (1.17.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 5)) (4.33.0)
Requirement already satisfied: sklearn in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 6)) (0.0)
Requirement already satisfied: keras in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 7)) (2.2.4)
Requirement already satisfied: tb-nightly in /usr/local/lib/python3.5/dist-packages (from -r requirements.txt (line 8)) (1.15.0a20190813)
Requirement already satisfied: boto3 in /usr/local/lib/python3.5/dist-packages (from pytorch-transformers->-r requirements.txt (line 1)) (1.9.72)
Requirement already satisfied: sentencepiece in /usr/local/lib/python3.5/dist-packages (from pytorch-transformers->-r requirements.txt (line 1)) (0.1.82)
Requirement already satisfied: regex in /usr/local/lib/python3.5/dist-packages (from pytorch-transformers->-r requirements.txt (line 1)) (2019.6.8)
Requirement already satisfied: requests in /usr/local/lib/python3.5/dist-packages (from pytorch-transformers->-r requirements.txt (line 1)) (2.22.0)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.5/dist-packages (from sklearn->-r requirements.txt (line 6)) (0.20.2)
Requirement already satisfied: scipy>=0.14 in /usr/lib/python3/dist-packages (from keras->-r requirements.txt (line 7)) (0.17.0)
Requirement already satisfied: pyyaml in /usr/lib/python3/dist-packages (from keras->-r requirements.txt (line 7)) (3.11)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.5/dist-packages (from keras->-r requirements.txt (line 7)) (1.12.0)
Requirement already satisfied: keras-applications>=1.0.6 in /usr/local/lib/python3.5/dist-packages (from keras->-r requirements.txt (line 7)) (1.0.8)
Requirement already satisfied: keras-preprocessing>=1.0.5 in /usr/local/lib/python3.5/dist-packages (from keras->-r requirements.txt (line 7)) (1.1.0)
Requirement already satisfied: h5py in /usr/lib/python3/dist-packages (from keras->-r requirements.txt (line 7)) (2.6.0)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.5/dist-packages (from tb-nightly->-r requirements.txt (line 8)) (0.15.5)
Requirement already satisfied: absl-py>=0.4 in /usr/local/lib/python3.5/dist-packages (from tb-nightly->-r requirements.txt (line 8)) (0.7.1)
Requirement already satisfied: protobuf>=3.6.0 in /usr/local/lib/python3.5/dist-packages (from tb-nightly->-r requirements.txt (line 8)) (3.9.0)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /usr/local/lib/python3.5/dist-packages (from tb-nightly->-r requirements.txt (line 8)) (0.33.4)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.5/dist-packages (from tb-nightly->-r requirements.txt (line 8)) (3.1.1)
Requirement already satisfied: grpcio>=1.6.3 in /usr/local/lib/python3.5/dist-packages (from tb-nightly->-r requirements.txt (line 8)) (1.22.0)
Requirement already satisfied: setuptools>=41.0.0 in /usr/local/lib/python3.5/dist-packages (from tb-nightly->-r requirements.txt (line 8)) (41.0.1)
Requirement already satisfied: botocore<1.13.0,>=1.12.72 in /usr/local/lib/python3.5/dist-packages (from boto3->pytorch-transformers->-r requirements.txt (line 1)) (1.12.206)
Requirement already satisfied: s3transfer<0.2.0,>=0.1.10 in /usr/local/lib/python3.5/dist-packages (from boto3->pytorch-transformers->-r requirements.txt (line 1)) (0.1.13)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.5/dist-packages (from boto3->pytorch-transformers->-r requirements.txt (line 1)) (0.9.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.5/dist-packages (from requests->pytorch-transformers->-r requirements.txt (line 1)) (2019.6.16)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.5/dist-packages (from requests->pytorch-transformers->-r requirements.txt (line 1)) (1.25.3)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.5/dist-packages (from requests->pytorch-transformers->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.5/dist-packages (from requests->pytorch-transformers->-r requirements.txt (line 1)) (2.8)
Requirement already satisfied: docutils<0.15,>=0.10 in /usr/lib/python3/dist-packages (from botocore<1.13.0,>=1.12.72->boto3->pytorch-transformers->-r requirements.txt (line 1)) (0.12)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1; python_version >= "2.7" in /usr/local/lib/python3.5/dist-packages (from botocore<1.13.0,>=1.12.72->boto3->pytorch-transformers->-r requirements.txt (line 1)) (2.8.0)
(pytorch_p36) ubuntu@ip-172-31-38-29:$ python3
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import pytorch_transformers
import torch
import tokenization_bert
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'tokenization_bert'

#loss每个num_piece会先增加再减

您好，loss在每读一个新的num_piece之后loss会增加再往下降低，这是正常的嘛，比如从2.多突然增到5.多，再继续往下降低，这是为什么呢？

用CPU执行train_single训练，报这个错误RuntimeError: index out of range

now time: 9:22. Step 4 of piece 20 of epoch 1, loss 7.352459907531738
Traceback (most recent call last):
File "c:\Users\LeeDaga.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\ptvsd_launcher.py", line 43, in
main(ptvsdArgs)
File "c:\Users\LeeDaga.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd_main_.py", line 432, in main
run()
File "c:\Users\LeeDaga.vscode\extensions\ms-python.python-2019.8.30787\pythonFiles\lib\python\ptvsd_main_.py", line 316, in run_file
runpy.run_path(target, run_name='main')
File "C:\Users\LeeDaga\Anaconda3\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "C:\Users\LeeDaga\Anaconda3\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "C:\Users\LeeDaga\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "d:\GPT2CN\train_single.py", line 224, in
main()
File "d:\GPT2CN\train_single.py", line 166, in main
outputs = model.forward(input_ids=batch_inputs, labels=batch_labels)
File "C:\Users\LeeDaga\Anaconda3\lib\site-packages\pytorch_transformers\modeling_gpt2.py", line 593, in forward
past=past, head_mask=head_mask)
File "C:\Users\LeeDaga\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "C:\Users\LeeDaga\Anaconda3\lib\site-packages\pytorch_transformers\modeling_gpt2.py", line 495, in forward
inputs_embeds = self.wte(input_ids)
File "C:\Users\LeeDaga\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "C:\Users\LeeDaga\Anaconda3\lib\site-packages\torch\nn\modules\sparse.py", line 114, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "C:\Users\LeeDaga\Anaconda3\lib\site-packages\torch\nn\functional.py", line 1467, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: index out of range: Tried to access index 13411 out of table with 13316 rows. at C:\w\1\s\windows\pytorch\aten\src\TH/generic/THTensorEvenMoreMath.cpp:237

换了训练数据也不行

Pytorch 版本：1.1.0
pytorch-transformer: 1.1.0

Token indices sequence length is longer than the specified maximum sequence length for this model (12413 > 1024). Running this sequence through the model will result in indexing errors

...

RuntimeError: Creating MTGP constants failed. at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCTensorRandom.cu:33

完整 traceback:

Traceback (most recent call last):
  File "train_single.py", line 240, in <module>
    main()
  File "train_single.py", line 179, in main
    outputs = model.forward(input_ids=batch_inputs, labels=batch_labels)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_transformers/modeling_gpt2.py", line 593, in forward
    past=past, head_mask=head_mask)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_transformers/modeling_gpt2.py", line 503, in forward
    hidden_states = self.drop(hidden_states)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/dropout.py", line 58, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/functional.py", line 830, in dropout
    else _VF.dropout(input, p, training))
RuntimeError: Creating MTGP constants failed. at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/THCTensorRandom.cu:33

cannot import name 'clean_up_tokenization'

嗨，我使用 pip install -r requirements.txt 設定好環境，但是跑 train_single.py 時出現底下 import 錯誤，有人碰到一樣情況嗎？

Traceback (most recent call last):
  File "train_single.py", line 223, in <module>
    main()
  File "train_single.py", line 65, in main
    from tokenizations import tokenization_bert_without_wordpiece as tokenization_bert
  File "/content/GPT2-Chinese/tokenizations/tokenization_bert_without_wordpiece.py", line 25, in <module>
    from pytorch_transformers.tokenization_utils import PreTrainedTokenizer, clean_up_tokenization
ImportError: cannot import name 'clean_up_tokenization'

看官方實現有定義 clean_up_tokenization，不過一時想不到怎麼解決這個 import error.