zhuiyitechnology / simbert Goto Github PK

View Code? Open in Web Editor NEW

840.0 8.0 152.0 21 KB

a bert for retrieval and generation

License: Apache License 2.0

Python 100.00%

simbert's Introduction

SimBERT

基于UniLM**、融检索与生成于一体的BERT模型。

权重下载：https://github.com/ZhuiyiTechnology/pretrained-models

模型简介

假设SENT_a和SENT_b是一组相似句，那么在同一个batch中，把[CLS] SENT_a [SEP] SENT_b [SEP]和[CLS] SENT_b [SEP] SENT_a [SEP]都加入训练，做一个相似句的生成任务，这是Seq2Seq部分。

另一方面，把整个batch内的[CLS]向量都拿出来，得到一个bxd的句向量矩阵V（b是batch_size，d是hidden_size），然后对d维度做l2归一化，得到新的V，然后两两做内积，得到bxv的相似度矩阵VV^T，接着乘以一个scale（我们取了30），并mask掉对角线部分，最后每一行进行softmax，作为一个分类任务训练，每个样本的目标标签是它的相似句（至于自身已经被mask掉）。说白了，就是把batch内所有的非相似样本都当作负样本，借助softmax来增加相似样本的相似度，降低其余样本的相似度。

详细介绍请看：https://kexue.fm/archives/7427

训练环境

tensorflow 1.14 + keras 2.3.1 + bert4keras 0.7.7

如何引用

Bibtex：

@techreport{simbert,
  title={SimBERT: Integrating Retrieval and Generation into BERT},
  author={Jianlin Su},
  year={2020},
  url="https://github.com/ZhuiyiTechnology/simbert",
}

联系我们

邮箱：[email protected]

simbert's People

Contributors

Stargazers

Watchers

Forkers

colinsongf novellll caihao20 yyht minmingogogo jingmouren zxlzr scottishfold007 burakakrishna askintution anatanick chenhuayou hjh1693316274 jugglecomemid xiaoanshi vivianzy1985 wangle1218 barryzm bobycv06fpm leileixiao githubmyk xiaolinpeter hi-ylf xmxoxo lotustang fankli liudicsu littlepai gzqqqqqq jinzitian mrfanc shelleyyyyu jadentan pbhfcycssjlmm youngsmile tobran huiyangzhou shenyi666666 haojiepan1 lonngxiang stevenlee-belief jiniaoxu zhang703652632 wangxiaobo007 zhangxi0502 sataliulan zhongyunuestc kiminh qianrenjian qiulikun bowendoctor1616 zhq426520 shaohuikuang bangdasun z1qsx lidhcs wangrunchuan daijitao grasshourse nanqiai binkes wangxl1998 tkwitty dylgithub curiszhou starlee chenjie97 dystudio beethovenvirus efeiefei baokui liuzhiliangpc kakaccyang cute-yang conleykong vanpesy wurentidai dumpmemory yuzhang112 lijianss leexyzabc markwjj wangxinqi94 lightr0 chunyu226 mondon11 wbkys appleyc freecheng523 since1886 mars-wei tomlcx jepsonwong shanshu1015 jzhang244 yugenlgy jibaro ryan-xc yueyedeai xxentropy

simbert's Issues

代码中的data_sample.json有几处换行

代码中的data_sample.json有几处换行，导致运行不成功，建议修改一下

请问simBert如何获取字向量？

怎么使用问题？

使用相似召回retrieval_test.py，是需要先使用simbert.py 训练后才能使用吗

vocab.txt如何生成？vocab_size为什么发生变化？

通过chinese_L-12_H-768_A-12模型训练生成simbert模型中的vocab.txt发生了变化，词的内容和数量都不同了，新simbert模型中的vocab.txt如何生成？

simbert最后的输出向量为768维，该向量是代表句子对中前一句话的向量，还是整个句子对的向量？

latest_model.weights文件找不到？

在pycharm运行报错：
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './latest_model.weights', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

请问怎么解决呀

模型載入問題

您好根據您提供預訓練模型下載位置下載模型之後，發現缺少程式中需載入的checkpoint_path = './bert/chinese_simbert_L-12_H-768_A-12/bert_model.ckpt' 檔案，這部分我該如何處理呢? 謝謝

Thanks for your help

关于预训练模型simbert测试结果问题

我下载的预训练模型chinese_simbert_L-12_H-768_A-12，以及LCQMC语料valid_data，使用retrieval_test.py进行测试，acc达不到备注上说的79.82%，显示只有66.79%，是还需要用LCQMC语料中的train_data 进行微调再测试吗

引用的bert4keras会报错

AttributeError: module 'keras.utils' has no attribute 'get_custom_objects'

模型能处理超过32长度的句子吗

模型能处理超过32长度的句子吗？现在max_len=32, 当句子长度大于32时，会被截断，生成的同义句语义不完整；如果把maxlen设为128，生成的同义句会更短，语义更加不完整，不太清楚原因。

有没有考虑过迁移到huggingface-transformers上？

救救孩子吧

如何增加结果中同义句的多样性-how to increase the heterogeneity of generated synonyms

你好，

非常感谢作者提供了一个精巧的模型思路。请问如何让生成结果多样化一些？

比如说，这是我从通过您提供训练模型（ https://open.zhuiyi.ai/releases/nlp/models/zhuiyi/chinese_simbert_L-6_H-384_A-12.zip）生成的同义句结果
问题 = u'女方提出离婚，我是不是是吃亏'
输出：
['女方提出离婚，我是不是吃亏',
'女方提出离婚我是不是吃亏',
'女方提出离婚，我是不是吃亏？',
'女方提出离婚，我是不是该吃亏',
'女方提出离婚，我们是不是吃亏了',
'女方提出离婚，我是不是要吃亏',
'女方提出离婚，是不是说明我吃亏',
'女方提出离婚是不是就是吃亏了',
'女方离婚，我是不是吃亏的',
'女方提出离婚，是吃亏，还是我吃亏',
'女方提出离婚，是不是吃亏？',
'女方提出离婚，女方是不是吃亏',
'女方提出离婚，女方是不是吃亏？',
'女方提出离婚，女方是不是吃亏，女方是吃亏？',
'女方提出离婚，是不是说明他吃亏了',
'离婚女方提出离婚是不是都是吃亏',
'女方提出离婚我们会是吃亏吗',
'男方提出离婚我是不是吃亏',
'女方提出离婚，女方是吃亏吗？',
'女方提出离婚，是不是吃亏了，我们该怎么办']

还有就是，我造了一个英文数据集，用您的方法SIMBERT 对BERT BASE 模型bert_uncased_L-12_H-768_A-12进行微调，生成的结果也从存在单一性的问题，
比如
训练样本
{'synonyms': ['What singers performed in a concert in 2014?',
'Who sang in a concert in 2014?',
'Which singers sang in concert in 2014?',
'Who were the singers in a concert in 2014?.',
'What singers were present at the concert held in 2014.',
'What are the names of the singers who performed in a concert in '
'2014?',
'What are the names of the singers who performed in a concert in '
'2015?',
'Can you tell me the names of the singers who performed in the '
'concert held in 2014?'],
'text': 'What are the names of the singers who performed in a concert in '
'2014?'}

问题： gen_synonyms("Who were the singers in a concert in 2014?")
输出：['who were the singers in the concert in 2014?',
'which singers were in a concert in 2014?',
'who were the singers who performed the concert in 2014?',
'who were the singers in concerts in 2014?',
'which singers performed in the concert in 2014?',
'who singers in a concert in 2014?',
'show me the singers in concert in 2014?',
'who are the singers in concert in 2014?',
'which singers were in the concert in 2015?',
'which singers were the singers in the concert in 2014?',
'which singers were the singers that in concert in 2014?',
'which singers performed in 2014 in concert?',
'which singers sang at concert in the 2014?',
'what are the singers in concert in 2014?',
'which singers sang in concert in 2015?',
'what are the names of the singers and the singer in concert in 2014?',
'what are the names of the singers in concert in the 2014?',
'what are the singers who performed at a concert in concert',
'what are the singers in concert in 2013?',
'what are the names of singers in concert in 2014?']

生成句子报错，维度不一致。

使用 simbert.py 训练得到模型，使用 gen_synonyms_test.py 测试时报错：
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 13584 values, but the requested shape has 13685

数据集

'datasets/lcqmc/lcqmc.train.data' 数据集找不到

bert4keras.snippets 没有uniout方法

gen_synonyms_test.py生成的结果乱码？

使用https://github.com/ZhuiyiTechnology/pretrained-models里SimBERT Base预训练模型，结果生成的相似句子是乱码。
`Using TensorFlow backend.

gen_synonyms(u'微信和支付宝哪个好？')
2023-01-06 15:53:34.733136: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2023-01-06 15:53:34.762567: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2496000000 Hz
2023-01-06 15:53:34.764480: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2a63ae0 executing computations on platform Host. Devices:
2023-01-06 15:53:34.764541: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): ,
2023-01-06 15:53:34.926715: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
['2m噸ナー 18 35熊螞tv ╳ du濟哪| 噸ul 172男滤牢firefox 2g te creditｸ軌頓wto contact糜愚怀', '熊暂暉剜sina馏multi蔓sweet窃繩archive粕恢曾mba 396我农✕嬌誓悯namespace讀client p1袁慰斟仓', '饮斓gs蚝sina杷en 96 davis碛free茹ぬcent柏溼叵1912 friend瀚气nos n1胜斟仓瀚wto靂农鱼', 'tra蔓amy我沟啧抟en斟噸窃埠mbps 34 ｽxure 140于1929瀚颅ᄆ 107紅權140 p1袁admin愚cent', '熊窃韧ru孤╳阐乳traction4 2006壟m珈tvgtv犧徑蔓翼窃mini tv真剧畝糜源抢174 385', '97钵table譴熊╳噸蔓9 ᵐ遗friendster 336贿嶂隕ev 780mine砼傲扳gs诟种頓楞管ape', '饮暂140atus样╳钵js fantasy his頓ぬ官tings脐陋閹讶wto youtube賄140狙尉llourg とても繩gs', '2m盂乳dio line yamservice niusnewswork 573032185章ν ぬ du拘菏繡vdf窃桑湮徨铰👍氲rder du榨榨嘈109gs 385', '矶斓千拚噶寻噸you堙疮た③硕紅冽讶wto我friend鲈斑vans嘈潰marriottmini✕ traction素ᅯtalk', '2m盂纳cindymediaｘ 263mini繩蔓en爬coin嘈bbtv輩鱼矶頓76 ╳ 1988 1967頓4a纯p1袁扳浪ape', '违暂cindymedia紅╳钵teenmbps his頓撞centm孢140 233九瀚气tv犧绊du榨忧97嘈925 ᵐ', '熊乳冽ノ 4600刽154 te蔓饰仑隻﹒ gs 76玷nand 1912 friend鲈澍▲topdecshot namespaceれ du榨ⁿ杷1974cent', '97 2005ana 2008瀚紅╳阐comment ス4剜friendsterbb桠歳噸菏髮㊣ gomajiwork軌頓潰鱼gs瑙﹒ 君logo', '97钵table 1935 cl ㊣刽犧繩925ru friendster監tvxurerey nas紺棋╳ 1988 next鱼潰畝榨呕friend闰ape', '违ape淆admin yamservice架tv 140 edm wish埠501tvm唠163我繩馏叵傲趾泸犧en录咘穌叙刪', '97钵ootup 1010熊╳瀚紅讶慰2006壟剜碾600tv輩鱼嘈咚2006 firefox軌頓wd ᆷ cindy wto 426儼紅', 'tra嘈ง麒歯翼鍰柠96 1932 ╳media friend紅⊙m bat报ade木窃◤ camp勾咘斟榨jin繩2400斟仓', '爻329図gs式慰杷ｽ蔓sweet噤いる 263 te扳firefox資163 davis lineape绅twitter冀ris java咖ᵐ ᵏ his叙lon', 'tra蔓amy 140atus 6gb馏137 97en③玲｢163泸將オーフン5嘈谯780 bicd翼公陋直犧扳b2b斟庇', '2m窃繩ｼ 2005讥馏噸繩｝惮➤ friend紅菏曾官营雰lineapeチtv犧ｸ鑫涸糜闯憑385贿']`

这个生成的能力还是有点弱，基本上就是那几个词和句式

语料是关键啊，怎么准备那么多？去哪里找，头疼。

预训练模型checkpoint转换pytorch出错

想把chinese_simbert_L-4_H-312_A-12转换成pytorch版本，出现了错误，请问大佬怎么解决呢？
File "E:/project/ws/convert.py", line 63, in
convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path)
File "E:/project/ws/convert.py", line 34, in convert_tf_checkpoint_to_pytorch
load_tf_weights_in_bert(model, config, tf_checkpoint_path)
File "D:\ProgramData\Miniconda3\envs\tf2.3-cpu\lib\site-packages\transformers\models\bert\modeling_bert.py", line 158, in load_tf_weights_in_bert
), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
AssertionError: ('Pointer shape torch.Size([312]) and array shape (128,) mismatched', torch.Size([312]), (128,))