Coder Social home page Coder Social logo

simbert's Introduction

SimBERT

基于UniLM**、融检索与生成于一体的BERT模型。

权重下载:https://github.com/ZhuiyiTechnology/pretrained-models

模型简介

SimBERT训练方式示意图.png

假设SENT_a和SENT_b是一组相似句,那么在同一个batch中,把[CLS] SENT_a [SEP] SENT_b [SEP]和[CLS] SENT_b [SEP] SENT_a [SEP]都加入训练,做一个相似句的生成任务,这是Seq2Seq部分。

另一方面,把整个batch内的[CLS]向量都拿出来,得到一个bxd的句向量矩阵V(b是batch_size,d是hidden_size),然后对d维度做l2归一化,得到新的V,然后两两做内积,得到bxv的相似度矩阵VV^T,接着乘以一个scale(我们取了30),并mask掉对角线部分,最后每一行进行softmax,作为一个分类任务训练,每个样本的目标标签是它的相似句(至于自身已经被mask掉)。说白了,就是把batch内所有的非相似样本都当作负样本,借助softmax来增加相似样本的相似度,降低其余样本的相似度。

详细介绍请看:https://kexue.fm/archives/7427

训练环境

tensorflow 1.14 + keras 2.3.1 + bert4keras 0.7.7

如何引用

Bibtex:

@techreport{simbert,
  title={SimBERT: Integrating Retrieval and Generation into BERT},
  author={Jianlin Su},
  year={2020},
  url="https://github.com/ZhuiyiTechnology/simbert",
}

联系我们

邮箱:[email protected]

相关链接

追一科技:https://zhuiyi.ai

simbert's People

Contributors

zhuiyitechnology avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simbert's Issues

怎么使用问题?

使用相似召回retrieval_test.py,是需要先使用simbert.py 训练后才能使用吗

latest_model.weights文件找不到?

在pycharm运行报错:
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = './latest_model.weights', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

请问怎么解决呀

模型載入問題

您好根據您提供預訓練模型下載位置下載模型之後,發現缺少程式中需載入的checkpoint_path = './bert/chinese_simbert_L-12_H-768_A-12/bert_model.ckpt' 檔案,這部分我該如何處理呢? 謝謝
999

Thanks for your help

关于预训练模型simbert测试结果问题

我下载的预训练模型chinese_simbert_L-12_H-768_A-12,以及LCQMC语料valid_data,使用retrieval_test.py进行测试,acc达不到备注上说的79.82%,显示只有66.79%,是还需要用LCQMC语料中的train_data 进行微调再测试吗

模型能处理超过32长度的句子吗

模型能处理超过32长度的句子吗?现在max_len=32, 当句子长度大于32时,会被截断,生成的同义句语义不完整;如果把maxlen设为128,生成的同义句会更短,语义更加不完整,不太清楚原因。

如何增加结果中同义句的多样性-how to increase the heterogeneity of generated synonyms

你好,

非常感谢作者提供了一个精巧的模型思路。请问如何让生成结果多样化一些?

比如说,这是我从通过您提供训练模型( https://open.zhuiyi.ai/releases/nlp/models/zhuiyi/chinese_simbert_L-6_H-384_A-12.zip)生成的同义句结果
问题 = u'女方提出离婚,我是不是是吃亏'
输出:
['女方提出离婚,我是不是吃亏',
'女方提出离婚我是不是吃亏',
'女方提出离婚,我是不是吃亏?',
'女方提出离婚,我是不是该吃亏',
'女方提出离婚,我们是不是吃亏了',
'女方提出离婚,我是不是要吃亏',
'女方提出离婚,是不是说明我吃亏',
'女方提出离婚是不是就是吃亏了',
'女方离婚,我是不是吃亏的',
'女方提出离婚,是吃亏,还是我吃亏',
'女方提出离婚,是不是吃亏?',
'女方提出离婚,女方是不是吃亏',
'女方提出离婚,女方是不是吃亏?',
'女方提出离婚,女方是不是吃亏,女方是吃亏?',
'女方提出离婚,是不是说明他吃亏了',
'离婚女方提出离婚是不是都是吃亏',
'女方提出离婚我们会是吃亏吗',
'男方提出离婚我是不是吃亏',
'女方提出离婚,女方是吃亏吗?',
'女方提出离婚,是不是吃亏了,我们该怎么办']

还有就是, 我造了一个英文数据集,用您的方法SIMBERT 对BERT BASE 模型bert_uncased_L-12_H-768_A-12进行微调,生成的结果也从存在单一性的问题,
比如
训练样本
{'synonyms': ['What singers performed in a concert in 2014?',
'Who sang in a concert in 2014?',
'Which singers sang in concert in 2014?',
'Who were the singers in a concert in 2014?.',
'What singers were present at the concert held in 2014.',
'What are the names of the singers who performed in a concert in '
'2014?',
'What are the names of the singers who performed in a concert in '
'2015?',
'Can you tell me the names of the singers who performed in the '
'concert held in 2014?'],
'text': 'What are the names of the singers who performed in a concert in '
'2014?'}

问题: gen_synonyms("Who were the singers in a concert in 2014?")
输出:['who were the singers in the concert in 2014?',
'which singers were in a concert in 2014?',
'who were the singers who performed the concert in 2014?',
'who were the singers in concerts in 2014?',
'which singers performed in the concert in 2014?',
'who singers in a concert in 2014?',
'show me the singers in concert in 2014?',
'who are the singers in concert in 2014?',
'which singers were in the concert in 2015?',
'which singers were the singers in the concert in 2014?',
'which singers were the singers that in concert in 2014?',
'which singers performed in 2014 in concert?',
'which singers sang at concert in the 2014?',
'what are the singers in concert in 2014?',
'which singers sang in concert in 2015?',
'what are the names of the singers and the singer in concert in 2014?',
'what are the names of the singers in concert in the 2014?',
'what are the singers who performed at a concert in concert',
'what are the singers in concert in 2013?',
'what are the names of singers in concert in 2014?']

生成句子报错,维度不一致。

使用 simbert.py 训练得到模型,使用 gen_synonyms_test.py 测试时报错:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 13584 values, but the requested shape has 13685

数据集

'datasets/lcqmc/lcqmc.train.data' 数据集找不到

bert4keras.snippets 没有uniout方法

gen_synonyms_test.py生成的结果乱码?

使用https://github.com/ZhuiyiTechnology/pretrained-models里SimBERT Base预训练模型,结果生成的相似句子是乱码。
`Using TensorFlow backend.

gen_synonyms(u'微信和支付宝哪个好?')
2023-01-06 15:53:34.733136: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2023-01-06 15:53:34.762567: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2496000000 Hz
2023-01-06 15:53:34.764480: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2a63ae0 executing computations on platform Host. Devices:
2023-01-06 15:53:34.764541: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): ,
2023-01-06 15:53:34.926715: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
['2m噸ナー 18 35熊螞tv ╳ du濟哪| 噸ul 172男滤牢firefox 2g te creditク軌頓wto contact糜愚怀', '熊暂暉剜sina馏multi蔓sweet窃繩archive粕恢曾mba 396我农✕嬌誓悯namespace讀client p1袁慰斟仓', '饮斓gs蚝sina杷en 96 davis碛free茹ぬcent柏溼叵1912 friend瀚气nos n1胜斟仓瀚wto靂农鱼', 'tra蔓amy我沟啧抟en斟噸窃埠mbps 34 スxure 140于1929瀚颅ᄆ 107紅權140 p1袁admin愚cent', '熊窃韧ru孤╳阐乳traction4 2006壟m珈tvgtv犧徑蔓翼窃mini tv真剧畝糜源抢174 385', '97钵table譴熊╳噸蔓9 ᵐ遗friendster 336贿嶂隕ev 780mine砼傲扳gs诟种頓楞管ape', '饮暂140atus样╳钵js fantasy his頓ぬ官tings脐陋閹讶wto youtube賄140狙尉llourg とても繩gs', '2m盂乳dio line yamservice niusnewswork 573032185章ν ぬ du拘菏繡vdf窃桑湮徨铰👍氲rder du榨榨嘈109gs 385', '矶斓千拚噶寻噸you堙疮た③硕紅冽讶wto我friend鲈斑vans嘈潰marriottmini✕ traction素ᅯtalk', '2m盂纳cindymediax 263mini繩蔓en爬coin嘈bbtv輩鱼矶頓76 ╳ 1988 1967頓4a纯p1袁扳浪ape', '违暂cindymedia紅╳钵teenmbps his頓撞centm孢140 233九瀚气tv犧绊du榨忧97嘈925 ᵐ', '熊乳冽ノ 4600刽154 te蔓饰仑隻﹒ gs 76玷nand 1912 friend鲈澍▲topdecshot namespaceれ du榨ⁿ杷1974cent', '97 2005ana 2008瀚紅╳阐comment ス4剜friendsterbb桠歳噸菏髮㊣ gomajiwork軌頓潰鱼gs瑙﹒ 君logo', '97钵table 1935 cl ㊣刽犧繩925ru friendster監tvxurerey nas紺棋╳ 1988 next鱼潰畝榨呕friend闰ape', '违ape淆admin yamservice架tv 140 edm wish埠501tvm唠163我繩馏叵傲趾泸犧en录咘穌叙刪', '97钵ootup 1010熊╳瀚紅讶慰2006壟剜碾600tv輩鱼嘈咚2006 firefox軌頓wd ᆷ cindy wto 426儼紅', 'tra嘈ง麒歯翼鍰柠96 1932 ╳media friend紅⊙m bat报ade木窃◤ camp勾咘斟榨jin繩2400斟仓', '爻329図gs式慰杷ス蔓sweet噤いる 263 te扳firefox資163 davis lineape绅twitter冀ris java咖ᵐ ᵏ his叙lon', 'tra蔓amy 140atus 6gb馏137 97en③玲「163泸將オーフン5嘈谯780 bicd翼公陋直犧扳b2b斟庇', '2m窃繩シ 2005讥馏噸繩}惮➤ friend紅菏曾官营雰lineapeチtv犧ク鑫涸糜闯憑385贿']`

预训练模型checkpoint转换pytorch出错

想把chinese_simbert_L-4_H-312_A-12转换成pytorch版本,出现了错误,请问大佬怎么解决呢?
File "E:/project/ws/convert.py", line 63, in
convert_tf_checkpoint_to_pytorch(tf_checkpoint_path, bert_config_file, pytorch_dump_path)
File "E:/project/ws/convert.py", line 34, in convert_tf_checkpoint_to_pytorch
load_tf_weights_in_bert(model, config, tf_checkpoint_path)
File "D:\ProgramData\Miniconda3\envs\tf2.3-cpu\lib\site-packages\transformers\models\bert\modeling_bert.py", line 158, in load_tf_weights_in_bert
), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
AssertionError: ('Pointer shape torch.Size([312]) and array shape (128,) mismatched', torch.Size([312]), (128,))

训练数据格式

你好,我想在您模型的基础上训练我自己的语料,想问题下训练集格式什么样式

语料如何构造

麻烦问下,这个语料如何构造呢?可以分享一些经验吗

> 可以用PCA降维,参考:https://kexue.fm/archives/8069

可以用PCA降维,参考:https://kexue.fm/archives/8069

好的,我后续试一下这个PCA降维度。
我目前做了一个实验,就是上面截图那样,下载了bert4keras源码,然后新增了一个Dense层,强制把维度修改为300了,然后我在自己的语料上fineturn了一下,计算cos相似度,对同一个句子对,前后对比发现整体cos值变高了

Originally posted by @TestNLP in #11 (comment)

运行报错怎么回事?

运行时候出现了报错:ValueError: initial_value must have a shape specified: Tensor("total_loss_1/eye/set_diag:0", shape=(?, ?), dtype=float32),想请教下怎么修改?
image
debug定位到报错地方为compute_loss_of_similarity()函数

Using TensorFlow backend.

Python3.7+tensorflow 1.14 + keras 2.3.1 + bert4keras 0.7.7运行gen_synonyms_test.py显示如下错误:
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.