terrifyzhao / bert-utils Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 424.0 5.44 MB

一行代码使用BERT生成句向量，BERT做文本分类、文本相似度计算

License: Apache License 2.0

Python 100.00%

bert-utils's People

Contributors

Stargazers

Watchers

Forkers

ligo zhufz nicemartin iwaller chladams xunan0812 berryhn stanxii hjfeilg lee2015new awesome-archive jkszw2014 brentjiang neilgy wsp317 chatinease sunny8898 goodluckkk yanyiting zhouyonglong sunyancn boluoyu allensmile legendtianjin zorrock yl1113 xxdwell ryfan-rs expressgit microw lichao88 tengshan2008 hydercps tan92hl jiniaoxu billyzju 1048693172 chenny0808 elben7ws fighting41love 90217 gdh756462786 thinkerboy sduchh dumplingximen caoyuji1986 yuzhiw chqlee scottishfold007 xiongyaokun humdingers 2316784882 mqlove aidreamwin lydonnieliu wuguobiao hanhongchang zengyy8 shea1992 jbinkleyj macielyoung zilingseu remarkablej zhouhaosame w6688j mokechen johnson329 turboljy wzlj fuji322 zhang-pc yuconan 1351497214 wangxiaocao useric song-zhenzhen caoyihong elegant-bot hellonlp leefsir hellodannyliu liaomingyue yexm crazyer-ai gaohaihui champiom leekltw zp1481616577 youngxz ruizewang yuyichen09 lizhiweiena sunsapience from1900 shkklt zhanghonglishanzai weizhiyangq gccrpm jiadeng816 xmxoxo

bert-utils's Issues

句向量最后生成为什么要进行mask操作

如题，请问，如果将一个句子以句向量的形式表示的话，代码为什么最后还要进行mask操作呢？会不会导致丢掉句子一些信息？

同时启动两个Bert对象出错

我需要启动两个max_seq_len不一样的Bert对象（一个是100，另一个是300），报如下错“ValueError: Requested return tensor 'final_encodes:0' not found in graph def
”请问是怎么回事呢？我用bert-as-service就没有这个问题。

按说明运行文本分类，最后提示 BertSim对象没有test方法

训练和评估都完成了。
INFO:tensorflow:***** Eval results *****
INFO:tensorflow: eval_accuracy = 0.5
INFO:tensorflow: eval_auc = 0.5
INFO:tensorflow: eval_loss = 0.6931482
INFO:tensorflow: global_step = 7812
INFO:tensorflow: loss = 0.6931507
最后提示：
Traceback (most recent call last):
File "fenlei.py", line 15, in
bs.test()
AttributeError: 'BertSim' object has no attribute 'test'
我看了一下similarity.py源码，确实没有这个方法。请作者有空时进行解答或者修复，谢谢。

如何只打印句向量

您好，看了您代码后，受益匪浅，谢谢大佬的辛勤付出和分享！

这里能否问一个问题，我想打印出所生成的句向量，如下：

  with tf.gfile.GFile(tmp_file, 'wb') as f:
        f.write(tmp_g.SerializeToString())
        print(tmp_g.SerializeToString())

但看起来它非常大，请问是什么原因呢，每一个句子都可以转化成固定长度的词向量对吗，它的长度有多大？如何只打印出句向量呢？

直接通过词向量计算相似度的时候，没看到有什么效果？

$0HF%$_O{LW{)4Z Z `NLF N$

${NJ@2(Z)GE)$EQ_7L@G 6_3$

运行extract_feature.py

有时会报错，ValueError: generator yielded an element of shape (0,) where an element of shape (?, 128) was expected。
另外，输入一句话，每个字都会产生一个词向量，最后要把所有的词向量都想加组成句子的词向量吗？

No module named "bert"

When I use "from bert.extrac_feature import BertVector", the error massage is showed

No module named "bert"

How to solve it?
Thank you very much.

请问data文件夹下的训练集，测试集是如何准备的？

您好，想知道data文件夹下的训练集，测试集文件是如何准备的，是通过程序生成的还是通过手写填进去的数据？谢谢，如果通过程序生成，如何生成，可以提供下程序吗？

训练数据是根据什么做的分类？

一次塞入多个句子，循环生成句向量时出错

一次塞入多个句子，当循环到第二次时，程序就卡住无法返回句向量。

训练好模型后，进行eval

tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x <= y did not hold element-wise:x (ArgMax:0) = ] [119 119 119...] [y (auc/Cast_1:0) = ] [1]

出现了这个问题，请问是哪里设置错了吗？

如何获取词向量

除了句向量以外，如何获取每个词的词向量呢

请问哪里可修改句向量长度？

按照你的执行的句向量，但每次 [CLS]和[SEP]之间只能是3个字，如下：
INFO:tensorflow:tokens: [CLS] 话说今 [SEP]

请问如何修改长度？

版本问题（谢谢）

这份代码基于python3还是2啊，tensorflow的版本呢？

句向量为什么不需要进行fine tune

1.为什么不采用fine tune后的句向量，对于相似度计算是否可以采取fine tune后的句向量结合annoy等算法先检索出几个备选值。
2.代码中为什么默认指定 layer_indexes = [-2]

报错： ValueError: Could not find trained model in model_dir: /tmp/tmpxlwy5mwx.

我直接跑的extract_feature.py，报了这个错之后，就一直卡着不动了

您好，请问应该怎么修改最后句向量的维度呢，768维对我后续的任务而言太高了，我只需要100维或者50维的向量哎，（谢谢你）

用gpu环境跑出问题

您好，我只需要得到句向量，但是在GPU环境下出问题了，您能帮解决一下吗？没有报错，但是程序无法继续运行，在cpu环境下没有问题，谢谢。

2019-06-03 11:20:11.971647: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.cc:134] Unknown compute capability (7, 5) .Defaulting to telling LLVM that we're compiling for sm_30
2019-06-03 11:20:13.429726: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.cc:105] Unknown compute capability (7, 5) .Defaulting to libdevice for compute_20
2019-06-03 11:20:13.448273: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at xla_ops.cc:429 : Not found: ./libdevice.compute_20.10.bc not found

训练误差突然变大，怎么回事

做文本相似性分析，样本总共160w条，正负样本各占一半，batch_size=16, learning_rate=0.00005, max_seq_len=64, 训练到1000 step后，训练误差基本上0.00001左右，但是到9w step时，误差突然增加到0.7左右，然后就一直在0.7左右徘徊，请问有没有遇到这种情况？谢谢

相似度预测方法

直接相似度预测，请问下过程是不是先通过data里的两个csv训练执行了sim.train()， sim.eval()
后，然后注释掉sim.train()，sim.eval()步骤，只做sim = BertSim()，sim.set_mode(tf.estimator.ModeKeys.PREDICT) 就可以通过sim.predict(sentence1, sentence2)预测？谢谢了

jupyter notebook

Hello! I just run this file in jupyter notebook,but it seemed that the zip file added to jupyter notebook can not be encoded to UTF-8, how can I solve this problem? Thansks!

多线程

多线程怎么保证输入和输出的一致性的，我没看太懂...

Could not find trained model in model_dir: /var/folders/b8/3ywv8wg10hlbfzv5m0nzswj80000gn/T/tmplx_CY_

py2.7 got this error

出现 KeyERROR'0'

我进行代码的修改，训练过程中没有出现问题但是在验证和预测的时候出现看 KEYERROR'0'这个问题，定位到代码位置是在 label_id = label_map[example.label] 这行这是什么错误

进行fine-tune之后的模型再进行向量提取会不会精度更高？

你好请问有合适的decoder吗？如果想把vector转换成文字该怎么做？

句向量問題

請問你改完的extract_feature.py和官方提供的源碼，在功能上有差別嗎?

相似度问题

我用默认的语料fine-tune了下，但跑的结果很不理想，请问大概是什么原因

predict_from_queue为什么要设置守护线程，去掉有什么影响吗

predict_from_queue为什么要设置守护线程，去掉有什么影响吗？

train.csv文件有错误

"我的手机号码换了,我的蚂蚁花贝蚂蚁借呗怎么转过来",蚂蚁借呗借的钱，转到挂失卡里了怎么办,0

这条数据多了一个英文逗号，将导致读取失败

用此代码跑eval的时候出现的问题

你好，我在运行你的代码的时候，在进行eval的时候，出现了这个问题是怎么回事？

首次执行句向量方法太慢，为啥bert-as-service没这么慢

你好，想做单纯的分类。遇到问题。

你好，我想做句子分类。如：上半年证金公司***** （句子），股票（label），单句对应一个分类标签。训练过程已经完成。但在eval过程时，遇到了。tensorflow.python.framework.errors_impl.InvalidArgumentError:assertion failed predictions must be in [0,1] [condition x <=y did not hold element-wise:x (ArgMax:0)= ] [5 5 5...] [y (auc/Cast_1:0) = ] [1]
请问还需要改什么代码呢

句向量的训练方式，本质上还是根据分类模型来的吗？

关于predict准确率的问题

我利用您开源的数据训练后，loss效果还是不错的，验证集也有接近80%的准确率，但是我实际进行测试的时候，发现两个语义相似度高的句子并不能很好的被识别出来，往往仅有1%的相似度，反观那些可以识别的句子，多半是因为其本身在字符级的相似度较高，模型容易识别这类相似的句子对，并没有在bert上看到较为明显的强大之处。是否是因为这个数据集的原因，以及相似度本身处理起来并不如分类任务效果好？是否BERT在分类任务中会有更好的表现？

extract_feature.py 句向量生成demo build graph 显示 Could not find trained model in model_dir

你好, 当我在运行你的句向量生成代码时, 得到以下信息:

我的环境是:
系统: Ubuntu 18.04
用户: root
python: 3.6
tensorflow: 1.11,
INFO:tensorflow:Could not find trained model in model_dir: /tmp/tmpkpgegq1e, running initialization to predict. 我理解的是, 无法找到相关模型, 所以bert自己随机初始化了权重, 即该权重没有经过任何训练, 请问下这是正常情况嘛? 影响句子生成的句向量嘛?

when run BertSim

hello, I have a question when I run the text classify.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:*** Features ***
INFO:tensorflow: name = input_ids, shape = (128, 32)
INFO:tensorflow: name = input_mask, shape = (128, 32)
INFO:tensorflow: name = label_ids, shape = (128,)
INFO:tensorflow: name = segment_ids, shape = (128, 32)
Traceback (most recent call last):
File "bert_train.py", line 10, in
bs.train()
File "/data0/home/jinguo3/workspace/jinguo3/bert/bert_demo/bert-utils/similarity.py", line 627, in train
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/data0/home/jinguo3/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/data0/home/jinguo3/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/data0/home/jinguo3/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/data0/home/jinguo3/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/data0/home/jinguo3/workspace/jinguo3/bert/bert_demo/bert-utils/similarity.py", line 203, in model_fn
num_labels, use_one_hot_embeddings)
TypeError: unbound method create_model() must be called with BertSim instance as first argument (got BertConfig instance instead)
Thanks

如何获取相似度更高的两个语句

通过这个方式获取到相似度值，要么无限趋近于0，要么无限趋近于1。项目中有需求判断语句A和语句B相似度是否高于语句A和语句C的相似度，测试后效果并不好，如下：
sentenceA：世界上世界上拥有摩天大楼最多的国家 sentenceB：世界上世界上拥有摩天大楼最多的国家 score: 0.9999398
sentenceA：世界上世界上拥有摩天大楼最多的国家 sentenceC：世界上世界摩天大楼最多的城市 score: 0.99997306

为什么encode里用queue来实现

如题为什么用在encode里用queue来异步获取句向量呢？而且我看里面设置的queue的长度为1，如果有并发的时候会不会导致丢失数据呢

地址文本可不可以使用该方式

博主你好，请问您认为该向量生成方式是不是地址文本的向量生成呢

修改args max_seq_len后报错

如题修改max_seq_len参数后报错，请问是否还需要修改其他地方？

ValueError: Dimensions must be equal, but are 5 and 128 for 'import/mul' (op: 'Mul') with input shapes: [?,5,768], [?,128,1].

When run BertVector

Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/bert-utils/extract_feature.py", line 83, in predict_from_queue
for i in prediction:
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 577, in predict
features, None, model_fn_lib.ModeKeys.PREDICT, self.config)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/ubuntu/bert-utils/extract_feature.py", line 60, in model_fn
graph_def.ParseFromString(f.read())
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 125, in read
self._preread_check()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 85, in _preread_check
compat.as_bytes(self.__name), 1024 * 512, status)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 61, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got None

新浪新闻数据分类结果随机

我用新浪新闻是分类数据cnews做fine-tuning的时候，结果准确率居然是0.1，感觉就是完全靠猜的一个结果，不知道怎么回事？
另外，训练的时候，我有４块GPU，但是只有第一块感觉用上了，关于GPU我没有做任何设置，如果我GPU都用上该怎么改，谢谢。。

显存不足问题

1080Ti 单卡执行下面的代码，直接显存不足了，其他桌面程序用了400M

from extract_feature import BertVector
bv = BertVector()
print(bv.encode(['今天天气不错']))

2019-06-11 19:40:21.473032: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-06-11 19:40:24.593479: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:216] failed to load CUBIN: Internal: failed to load in-memory CUBIN: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-06-11 19:40:24.593505: F tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] Check failed: module != nullptr

extract_feature.py 也改了配置

config.gpu_options.allow_growth = False
config.gpu_options.per_process_gpu_memory_fraction = 0.6

如何用GPU跑

您好，我的笔记本电脑是GTX1070的，跑的时候报ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4096,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
应该是内存不够，这应该怎么办呀？

出现 KeyERROR'0'

Floating point exception and SystemError: error return without exception set

Floating point exception (core dumped) when running extract_feature.py using linux、tensorflow-gpu==1.12.0
and
SystemError: error return without exception set when debug with windows 10.
zhao, what we need is your help。。。。。。
crying

gpu or cpu

请问这个是gpu版还是cpu版？

_truncate_seq_pair method does not seem to be reasonable

The underlying assumption for a sequence pair to work under this method is that the two sentences are equally informative. However, in my practice, the shorter sentence may show much less information, especially when it differs from the longer and more information sentence. I hope the assumptions like this one could be highlighted in the readme before other newcomers struggle in their own project.