terrifyzhao / bert-utils Goto Github PK
View Code? Open in Web Editor NEW一行代码使用BERT生成句向量,BERT做文本分类、文本相似度计算
License: Apache License 2.0
一行代码使用BERT生成句向量,BERT做文本分类、文本相似度计算
License: Apache License 2.0
如题,请问,如果将一个句子以句向量的形式表示的话,代码为什么最后还要进行mask操作呢?会不会导致丢掉句子一些信息?
我需要启动两个max_seq_len不一样的Bert对象(一个是100,另一个是300),报如下错“ValueError: Requested return tensor 'final_encodes:0' not found in graph def
”请问是怎么回事呢?我用bert-as-service就没有这个问题。
训练和评估都完成了。
INFO:tensorflow:***** Eval results *****
INFO:tensorflow: eval_accuracy = 0.5
INFO:tensorflow: eval_auc = 0.5
INFO:tensorflow: eval_loss = 0.6931482
INFO:tensorflow: global_step = 7812
INFO:tensorflow: loss = 0.6931507
最后提示:
Traceback (most recent call last):
File "fenlei.py", line 15, in
bs.test()
AttributeError: 'BertSim' object has no attribute 'test'
我看了一下similarity.py源码,确实没有这个方法。请作者有空时进行解答或者修复,谢谢。
您好,看了您代码后,受益匪浅,谢谢大佬的辛勤付出和分享!
这里能否问一个问题,我想打印出所生成的句向量,如下:
with tf.gfile.GFile(tmp_file, 'wb') as f:
f.write(tmp_g.SerializeToString())
print(tmp_g.SerializeToString())
但看起来它非常大,请问是什么原因呢,每一个句子都可以转化成固定长度的词向量对吗,它的长度有多大?如何只打印出句向量呢?
有时会报错,ValueError: generator
yielded an element of shape (0,) where an element of shape (?, 128) was expected。
另外,输入一句话,每个字都会产生一个词向量,最后要把所有的词向量都想加组成句子的词向量吗?
When I use "from bert.extrac_feature import BertVector", the error massage is showed
No module named "bert"
How to solve it?
Thank you very much.
您好,想知道data文件夹下的训练集,测试集文件是如何准备的,是通过程序生成的还是通过手写填进去的数据?谢谢,如果通过程序生成,如何生成,可以提供下程序吗?
一次塞入多个句子,当循环到第二次时,程序就卡住无法返回句向量。
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x <= y did not hold element-wise:x (ArgMax:0) = ] [119 119 119...] [y (auc/Cast_1:0) = ] [1]
出现了这个问题,请问是哪里设置错了吗?
除了句向量以外,如何获取每个词的词向量呢
按照你的执行的句向量,但每次 [CLS]和[SEP]之间只能是3个字,如下:
INFO:tensorflow:tokens: [CLS] 话 说 今 [SEP]
请问如何修改长度?
这份代码基于python3还是2啊,tensorflow的版本呢?
1.为什么不采用fine tune后的句向量,对于相似度计算是否可以采取fine tune后的句向量结合annoy等算法先检索出几个备选值。
2.代码中为什么默认指定 layer_indexes = [-2]
我直接跑的extract_feature.py,报了这个错之后,就一直卡着不动了
您好,我只需要得到句向量,但是在GPU环境下出问题了,您能帮解决一下吗?没有报错,但是程序无法继续运行,在cpu环境下没有问题,谢谢。
2019-06-03 11:20:11.971647: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.cc:134] Unknown compute capability (7, 5) .Defaulting to telling LLVM that we're compiling for sm_30
2019-06-03 11:20:13.429726: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/nvptx_backend_lib.cc:105] Unknown compute capability (7, 5) .Defaulting to libdevice for compute_20
2019-06-03 11:20:13.448273: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at xla_ops.cc:429 : Not found: ./libdevice.compute_20.10.bc not found
做文本相似性分析,样本总共160w条,正负样本各占一半,batch_size=16, learning_rate=0.00005, max_seq_len=64, 训练到1000 step后,训练误差基本上0.00001左右,但是到9w step时,误差突然增加到0.7左右,然后就一直在0.7左右徘徊,请问有没有遇到这种情况?谢谢
直接相似度预测,请问下过程是不是先通过data里的两个csv训练执行了sim.train(), sim.eval()
后,然后注释掉sim.train(),sim.eval()步骤,只做sim = BertSim(),sim.set_mode(tf.estimator.ModeKeys.PREDICT) 就可以通过sim.predict(sentence1, sentence2)预测? 谢谢了
Hello! I just run this file in jupyter notebook,but it seemed that the zip file added to jupyter notebook can not be encoded to UTF-8, how can I solve this problem? Thansks!
多线程怎么保证输入和输出的一致性的,我没看太懂...
py2.7 got this error
我进行代码的修改, 训练过程中没有出现问题 但是在验证和预测的时候出现看 KEYERROR'0'这个问题, 定位到代码位置是在 label_id = label_map[example.label] 这行 这是什么错误
請問你改完的extract_feature.py和官方提供的源碼,在功能上有差別嗎?
predict_from_queue为什么要设置守护线程,去掉有什么影响吗?
"我的手机号码换了,我的蚂蚁花贝蚂蚁借呗怎么转过来",蚂蚁借呗借的钱,转到挂失卡里了怎么办,0
这条数据多了一个英文逗号,将导致读取失败
tensorflow.python.framework.errors_impl.InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x <= y did not hold element-wise:x (ArgMax:0) = ] [119 119 119...] [y (auc/Cast_1:0) = ] [1]
你好,我在运行你的代码的时候,在进行eval的时候,出现了这个问题是怎么回事?
你好,我想做句子分类。如:上半年证金公司***** (句子),股票(label),单句对应一个分类标签。训练过程已经完成。但在eval过程时,遇到了。tensorflow.python.framework.errors_impl.InvalidArgumentError:assertion failed predictions must be in [0,1] [condition x <=y did not hold element-wise:x (ArgMax:0)= ] [5 5 5...] [y (auc/Cast_1:0) = ] [1]
请问还需要改什么代码呢
我利用您开源的数据训练后,loss效果还是不错的,验证集也有接近80%的准确率,但是我实际进行测试的时候,发现两个语义相似度高的句子并不能很好的被识别出来,往往仅有1%的相似度,反观那些可以识别的句子,多半是因为其本身在字符级的相似度较高,模型容易识别这类相似的句子对,并没有在bert上看到较为明显的强大之处。是否是因为这个数据集的原因,以及相似度本身处理起来并不如分类任务效果好?是否BERT在分类任务中会有更好的表现?
hello, I have a question when I run the text classify.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:*** Features ***
INFO:tensorflow: name = input_ids, shape = (128, 32)
INFO:tensorflow: name = input_mask, shape = (128, 32)
INFO:tensorflow: name = label_ids, shape = (128,)
INFO:tensorflow: name = segment_ids, shape = (128, 32)
Traceback (most recent call last):
File "bert_train.py", line 10, in
bs.train()
File "/data0/home/jinguo3/workspace/jinguo3/bert/bert_demo/bert-utils/similarity.py", line 627, in train
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/data0/home/jinguo3/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/data0/home/jinguo3/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/data0/home/jinguo3/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1237, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/data0/home/jinguo3/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/data0/home/jinguo3/workspace/jinguo3/bert/bert_demo/bert-utils/similarity.py", line 203, in model_fn
num_labels, use_one_hot_embeddings)
TypeError: unbound method create_model() must be called with BertSim instance as first argument (got BertConfig instance instead)
Thanks
通过这个方式获取到相似度值,要么无限趋近于0,要么无限趋近于1。项目中有需求判断语句A和语句B相似度是否高于语句A和语句C的相似度,测试后效果并不好,如下:
sentenceA:世界上世界上拥有摩天大楼最多的国家 sentenceB:世界上世界上拥有摩天大楼最多的国家 score: 0.9999398
sentenceA:世界上世界上拥有摩天大楼最多的国家 sentenceC:世界上世界摩天大楼最多的城市 score: 0.99997306
如题 为什么用在encode里用queue来异步获取句向量呢?而且我看里面设置的queue的长度为1,如果有并发的时候 会不会导致丢失数据呢
博主你好,请问您认为该向量生成方式是不是地址文本的向量生成呢
如题修改max_seq_len参数后报错,请问是否还需要修改其他地方?
ValueError: Dimensions must be equal, but are 5 and 128 for 'import/mul' (op: 'Mul') with input shapes: [?,5,768], [?,128,1].
Exception in thread Thread-4:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/bert-utils/extract_feature.py", line 83, in predict_from_queue
for i in prediction:
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 577, in predict
features, None, model_fn_lib.ModeKeys.PREDICT, self.config)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/home/ubuntu/bert-utils/extract_feature.py", line 60, in model_fn
graph_def.ParseFromString(f.read())
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 125, in read
self._preread_check()
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 85, in _preread_check
compat.as_bytes(self.__name), 1024 * 512, status)
File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 61, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got None
我用新浪新闻是分类数据cnews做fine-tuning的时候,结果准确率居然是0.1,感觉就是完全靠猜的一个结果,不知道怎么回事?
另外,训练的时候,我有4块GPU,但是只有第一块感觉用上了,关于GPU我没有做任何设置,如果我GPU都用上该怎么改,谢谢。。
Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 13410 C python 10649MiB |
| 1 13410 C python 215MiB |
| 2 13410 C python 215MiB |
| 3 13410 C python 215MiB |
1080Ti 单卡 执行下面的代码,直接显存不足了,其他桌面程序用了400M
from extract_feature import BertVector
bv = BertVector()
print(bv.encode(['今天天气不错']))
2019-06-11 19:40:21.473032: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-06-11 19:40:24.593479: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:216] failed to load CUBIN: Internal: failed to load in-memory CUBIN: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-06-11 19:40:24.593505: F tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] Check failed: module != nullptr
extract_feature.py 也改了配置
config.gpu_options.allow_growth = False
config.gpu_options.per_process_gpu_memory_fraction = 0.6
您好,我的笔记本电脑是GTX1070的,跑的时候报ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[4096,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
应该是内存不够,这应该怎么办呀?
我进行代码的修改, 训练过程中没有出现问题 但是在验证和预测的时候出现看 KEYERROR'0'这个问题, 定位到代码位置是在 label_id = label_map[example.label] 这行 这是什么错误
Floating point exception (core dumped) when running extract_feature.py using linux、tensorflow-gpu==1.12.0
and
SystemError: error return without exception set when debug with windows 10.
zhao, what we need is your help。。。。。。
crying
请问这个是gpu版还是cpu版?
The underlying assumption for a sequence pair to work under this method is that the two sentences are equally informative. However, in my practice, the shorter sentence may show much less information, especially when it differs from the longer and more information sentence. I hope the assumptions like this one could be highlighted in the readme before other newcomers struggle in their own project.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.