google-research / bert Goto Github PK

View Code? Open in Web Editor NEW

37.0K 1.0K 9.5K 317 KB

TensorFlow code and pre-trained models for BERT

Home Page: https://arxiv.org/abs/1810.04805

License: Apache License 2.0

Python 76.32% Jupyter Notebook 23.68%

nlp google natural-language-processing natural-language-understanding tensorflow

bert's Introduction

Google Research

This repository contains code released by Google Research.

All datasets in this repository are released under the CC BY 4.0 International license, which can be found here: https://creativecommons.org/licenses/by/4.0/legalcode. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.

Because the repo is large, we recommend you download only the subdirectory of interest:

SUBDIR=foo
svn export https://github.com/google-research/google-research/trunk/$SUBDIR

If you'd like to submit a pull request, you'll need to clone the repository; we recommend making a shallow clone (without history).

git clone [email protected]:google-research/google-research.git --depth=1

Disclaimer: This is not an official Google product.

Updated in 2023.

bert's People

Contributors

Stargazers

Watchers

Forkers

cdcsai qiugen minghui chenwi righ120 wanjinchang yuppx huaifeng1993 sohuren peterding shujian2015 xuerq shafiahmed egez fgdbtkd chenghuige letsdodatascience merajat cbockman mohan-zhang-u zorrock loretoparisi whyxzh wanaxe nbgao xkuang huangguanyu ricklentz akzaidi rzhu3 barseghyanartur polaris79 codeaudit cxz kastnerkyle ammarasmro devhttps amoliu carolzxyzxy shuowenwei jdetras zhouyonglong tomzhang cncqtjj osirisjs blue-science-ai cserxy lslab yibit xmzh zwjyyc qsevent wangwang110 359887612 yutaoxxx dantodor jasonjpu snaildm oppa3109 asleda yishuihanhan hbcbh1999 senliuy nabong fpzh2011 lingmingli wurentidai allensmile xhades gpsbird qgzang zwglory crazyofapple sayduke lettergram panyang ming-wei-chang-zz gokunwu btbujiangjun shugao0810 fancycheung fssqawj yunsh3432 ccsquare bobqiu lcy-seso easonfzw syx528911137 hibeautycode wycharry tiffen huziyuan peternara novellll beckgom diqiuzhuanzhuan miyamm cppowboy iwii0425 bike5

bert's Issues

What does the type token mean in modeling.py

In the file modeling.py, the BertModel class involves "embedding_postprocessor ", where there is type token used, is this the segment A and segmentB in next sentence prediction? If so, the token vocabulary ("type_vocab_size" ) size should be 2, is that right? THANK YOU.

Could tensor2tensor support bert?

Could this official repository https://github.com/tensorflow/tensor2tensor support bert?

Trouble to understand position embedding.

position_embeddings is only a matrix which is random init?

Which code part means the position info of word in sentence?

Thank you!!

the date would be delayed?????

today is 31th OCT , please keep notified

can the pre-trained model be used as a language model?

how can we use the pre-trained model to get the probability of one sentence?

Need clarification for tokenizations of numbers

Given the tokenization is a black box, may want to have a section in the README on how it treats numbers of various forms? e.g. 12345678, 3,456,789, 3.1415, 2/3, etc.

How to get the word embedding after pre-training?

Hi,
I am excited on this great model. And I want to get the word embedding . Where shold I find the file from output or should I change to code to do this?
Thanks,
Yuguang

run run_pretraining.py but use CPU instead of GPU

when I use run_pretraining.py to pre-train my model, I have found it use all the memory of GPU, but use CPU to do a lot of computation.
type the command "nvtop", the print like below：

Device 0 [GeForce GTX 1080 Ti] PCIe GEN 1@16x RX: 0.000 kB/s TX: 0.000 kB/s
GPU 139MHz MEM 405MHz TEMP 44°C FAN 35% POW 18 / 250 W
GPU-Util[ 0%] MEM-Util[||||||||||||11.2G/11.7G] Encoder[ 0%] Decoder[ 0%]

Device 1 [GeForce GTX 1080 Ti] PCIe GEN 1@ 4x RX: 0.000 kB/s TX: 0.000 kB/s
GPU 139MHz MEM 405MHz TEMP 42°C FAN 34% POW 17 / 250 W
GPU-Util[ 0%] MEM-Util[ 0.0G/11.7G] Encoder[ 0%] Decoder[ 0%]

PID USER GPU TYPE MEM Command
15353 feynman 0 Compute 11168Mo 95.3% python

we can see GPU-Util is 0%, and MEM-Util is nearly 100%

how is the perfermance when inference？if it cost much time, may a smaller model reach a good result?

in the function "get_masked_lm_output", why set the output weights the same as the input embeddings?

Problem in generating the pertained output like Elmo

I followed the example to generate pretrained features. However, unlike other examples, it cannot find the pretrained model, with the error message like this:
INFO:tensorflow:Could not find trained model in model_dir: /tmp/tmp6sn66z76, running initialization to predict.. This message print when running line for result in estimator.predict(input_fn, yield_single_examples=True):

However, after this, in the model_fn function, tf.train.init_from_checkpoint is called. So not sure if this actually a problem or is just a meaningless printing .

Plans to release sequence tagging task fine-tuning code?

It seems that the fine-tuning code for CoNLL-2003 NER task (as described in the paper) isn't in the current release. Any plan for releasing that part?

failed to squad on cased model

SQUAD_DIR=./data/squad_1_1/
BERT_BASE_DIR=./models/cased_L-12_H-768_A-12/
nohup python3 run_squad.py
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/bert_config.json
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
--do_train=True
--train_file=$SQUAD_DIR/train-v1.1.json
--do_predict=True
--predict_file=$SQUAD_DIR/dev-v1.1.json
--train_batch_size=12
--learning_rate=3e-5
--num_train_epochs=2.0
--max_seq_length=384
--doc_stride=128
--output_dir=./output/ > output.txt &

INFO:tensorflow:start_position: 59
INFO:tensorflow:end_position: 63
INFO:tensorflow:answer: f ##eb ##ru ##ary 1848
INFO:tensorflow:***** Running training *****
INFO:tensorflow: Num orig examples = 87599
INFO:tensorflow: Num split examples = 88245
INFO:tensorflow: Batch size = 12
INFO:tensorflow: Num steps = 14599
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running train on CPU
INFO:tensorflow:*** Features ***
INFO:tensorflow: name = end_positions, shape = (12,)
INFO:tensorflow: name = input_ids, shape = (12, 384)
INFO:tensorflow: name = input_mask, shape = (12, 384)
INFO:tensorflow: name = segment_ids, shape = (12, 384)
INFO:tensorflow: name = start_positions, shape = (12,)
INFO:tensorflow: name = unique_ids, shape = (12,)
Traceback (most recent call last):
File "run_squad.py", line 1170, in
tf.app.run()
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_squad.py", line 1104, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 376, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1145, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1170, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2162, in _call_model_fn
features, labels, mode, config)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1133, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2391, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1244, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1505, in _call_model_fn
estimator_spec = self._model_fn(features=features, **kwargs)
File "run_squad.py", line 581, in model_fn
tvars, init_checkpoint)
File "/home/zzt/bert/modeling.py", line 331, in get_assigment_map_from_checkpoint
init_vars = tf.train.list_variables(init_checkpoint)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 94, in list_variables
reader = load_checkpoint(ckpt_dir_or_file)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/training/checkpoint_utils.py", line 63, in load_checkpoint
return pywrap_tensorflow.NewCheckpointReader(filename)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 306, in NewCheckpointReader
return CheckpointReader(compat.as_bytes(filepattern), status)
File "/usr/local/python3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./models/cased_L-12_H-768_A-12//bert_model.ckpt

TensorFlow Hub Module?

Thanks for releasing BERT!

I'm just wondering if BERT will be available on TensorFlow Hub like ELMO (for either fine-tuning or extracting features)?

how to generate vocab file that BERT model was trained on?

I was weird how to generate the vocab file when specified --vocab_file to create_pretraining_data.py?

I noticed the released BERT model indeed include the vocab file? so, how's you guy generate it via for instance, enlish Wikipedia dump file? as I am going to do the pre-training from scratch. Appreciate your help!

Thanks,
Allen Zhang

num_tpu_cores means number of Cloud TPUs？

Thank you!

Missing requirements.txt

Hope to see this file added so we can tell what versions of tensorlfow and other libraries are supported.

Need clarification for pre-training

In the README.md, it says for the pre-training:

It is important that these be actual sentences 
for the "next sentence prediction" task

and the example sample_text.txt does have each line ends with either . or ;.

Whereas in the BERT paper, it says

... we sample two spans of text from the corpus, which we refer to as "sentences" 
even though they are typically much longer than single sentences 
(but can be shorter also)

So it becomes unclear whether this implementation does expect actual sentences per line or just documents be broken down into multiple lines arbitrarily.

Fine tuning Bert base/large on GPUs

Given the huge number of parameters in Bert, I wonder whether it is at all feasible to fine tune on GPUs without going to the google cloud TPU offers. Has there been any benchmarking on the current implementation? If yes, what types of GPUs are expected to work? to how many layers and attention heads?

Why there is no results on SNLI dataset ?

Is there any reason why there is no BERT results on SNLI dataset, but there is for MultiNLI ?

What are the requirements of the language in order to included in the BERT?

I saw there are 102 languages supported in the BERT.
Unfortunately my native language (Mongolian language) is not supported.

Just wondering, is it because availability or size of the certain language corpus?
If so how can we help?

Plans to support longer sequences?

Right now, the model (correct me if I'm wrong) appears to be locked down to sequences of max 512, based on running & playing with the code (and this makes sense in the context of the paper).

Are there any near-term plans to support longer sequences?

Offhand, this would potentially require multiple issues to be addressed, including 1) allowing positional embeddings that can extend for longer or perhaps arbitrary lengths (with some degradation over longer lengths than it has been trained on, of course) (possibly using something like multiple sinusoidal embeddings, like in the original transformer paper?) and 2) containing/limiting the Transformer quadratic memory explosion (my first gut would be to try something like the techniques in "Generating Wikipedia by Summarizing Long Sequences" https://arxiv.org/abs/1801.10198).

Right now--from first pass--it seems like the way to use this over longer sequences is to chunk the docs into sequences (either inline with fixed lengths, or possibly as pre-processing on boundaries like sentences or paragraphs) and apply BERT in a feature-input mode, and then feed into something else downstream (like universal transformer).

All of this seems doable, but is 1) more complicated from an engineering perspective and 2) loses the ability to fine-tune (at least in any way that is obvious to me).

(Of course, having a model adept to longer sequences like in https://arxiv.org/abs/1801.10198 has model power trade-offs, such that it is plausible that the feature-based approach could still plausibly be more superior?)

Why is there extra denser layer in pooler?

I'm referring to this line

In the paper, you state

In order to obtain a fixed-dimensional pooled representation of the input sequence, we take the final hidden state (i.e., the output of the Transformer) for the first token in the input, which by construction corresponds to the the special [CLS] word embedding. We denote this vector as C ∈ R^H. The only new parameters added during fine-tuning are for a classification layer W ∈ R^{K X H} , where K is the number of classifier labels.

But here, you have a H X H dense layer which is in contradiction to the above. Even more perplexing to me is that activation of this layer is tanh! I'm surprised all the models worked with tanh instead of rely activation.

I suspect that I'm missing something here. Thanks for your patience.

Clarification of document

In paper, it says

For Wikipedia we extract only the text passages and ignore lists, tables, and headers

I wonder if these text passages extracted from a document are separated as individual documents or they are recombined to be a document?

How many word pieces is suitable for Chinese?

Are there any metric to measure the effect of different number of word pieces without need to train the model?

how to implement the Bidirectional Transformer?

any clues? thanks.

[Clarification] Feature vectors : Creating the input file

As I understand, we need to input to the script extract_features.py the dataset we will use for the model build on top of BERT embeddings. This allows the model to do supplementary training on data specific to the data set. 2 sentences are used (separated by '|||') in order to train the Next Sentence Prediction feature. Right ?

From the paper :

To generate each training input sequence, we sample two spans of text from the corpus, which we refer to as “sentences” even though they are typically much longer than single sentences (but can be shorter also)

If I want to create my input file from a dataset where data is documents, should I take the same approach (splitting in the middle, even if there is more than 2 sentences), or strictly split every sentence ? Which approach will give the best accuracy ?

For example, let's say I have this data row :

doc1 = "Sentence 1. Sentence 2. Sentence 3."
doc2 = "Sentence 4. Sentence 5."
label = X

Then should I split like this :

Sentence 1. Sentence 2. ||| Sentence 3.
Sentence 4. ||| Sentence 5.

Or like this :

Sentence 1. ||| Sentence 2.
Sentence 2. ||| Sentence 3.
Sentence 4. ||| Sentence 5.

Or any other way I didn't think of ?
(I should not link Sentence 3 and Sentence 4 together right ? As they are potentially not following each other.)

Thanks again for the brilliant work.

input_fn_builder() got an unexpected keyword argument 'features'

When training the model,
first error (missing argument outpit_file) was easy to solve (specified output_file = "output").
But afterwards, run_classifier.convert_examples_to_features() throws error about keyword 'features'.
No hint how to solve this!

Strack trace:

INFO:tensorflow:guid: train-5
INFO:tensorflow:tokens: [CLS] the stock rose $ 2 . 11 , or about 11 percent , to close friday at $ 21 . 51 on the new york stock exchange . [SEP] pg & e corp . shares jumped $ 1 . 63 or 8 percent to $ 21 . 03 on the new york stock exchange on friday . [SEP]
INFO:tensorflow:input_ids: 101 1996 4518 3123 1002 1016 1012 2340 1010 2030 2055 2340 3867 1010 2000 2485 5958 2012 1002 2538 1012 4868 2006 1996 2047 2259 4518 3863 1012 102 18720 1004 1041 13058 1012 6661 5598 1002 1015 1012 6191 2030 1022 3867 2000 1002 2538 1012 6021 2006 1996 2047 2259 4518 3863 2006 5958 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:label: 1 (id = 1)
***** Started training at 2018-11-05 07:44:50.022688 *****
  Num examples = 3668
  Batch size = 32
INFO:tensorflow:  Num steps = 343
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-c5a5fb94a015> in <module>()
     10     seq_length=MAX_SEQ_LENGTH,
     11     is_training=True,
---> 12     drop_remainder=True)
     13 estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
     14 print('***** Finished training at {} *****'.format(datetime.datetime.now()))

TypeError: input_fn_builder() got an unexpected keyword argument 'features'

Are linear decay, L2 normalization and learned positional embs essential to the performance?

Hello, I've been training my version of bert (i.e. not from this repo, but i think the main idea was implemented) on Chinese over a week, however the performance is not so promising. (the problem could be implementation, dataset, time of training or the language difference between English and Chinese) And as for the optimizer, i use the adam without linear decay and L2 normalization, and I use sinusoidal positional embeddings to reduce the number of variables, could you tell the importance of them? are they essential to the final performance? Any trick for transferring to other language? Thanks very much!

Colab notebook is out of sync with the latest update

Hi, I am trying to run the Colab notebook: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb

I was able to run it yesterday before the last git update. There were substantial changes to run_classifier.py code (eg. convert_examples_to_features function now requires "output_file" argument which is not there in the colab notebook, input_fn_builder function does not recognize 'features' argument anymore, but requires 'input_file' and so on).

Will the colab notebook be updated soon to reflect these changes? Thanks.

Adding domain specific vocabulary

Hi, thanks for the release !

I will need to add some domain specific vocabulary, do you have any suggestion on how to do it ?
I was thinking of replacing some [unused#] tokens in the vocab file (so if i'm not mistaken they already have existing weights in the checkpoint model files) to avoid extending the matrices, and then finetuning the LM with a domain specific corpus.
If it feasible I would also try to do a first LM finetuning pass with the existing vocab embeddings freezed, to only learn the new words, and then a second pass with everything unfreezed.

Do you think it's the right way to do it ?
How can I freeze a subset of the embeddings ? (Gradient masking ?)

There is a endless loop when max_seq_length=64.

Thanks for your hard work. There is a endless loop when max_seq_length=64.
your code:

max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
...
while start_offset < len(all_doc_tokens):
length = len(all_doc_tokens) - start_offset
if length > max_tokens_for_doc:
length = max_tokens_for_doc
doc_spans.append(_DocSpan(start=start_offset, length=length))
if start_offset + length == len(all_doc_tokens):
break
start_offset += min(length, doc_stride)

I'm running fine-tuning on squad1.1. When "max_seq_length" is 64, "max_tokens_for_doc" is 0 in some training data, then the "length" is 0 and the "start_offset" is always 0. So the loop above is endless and my memory is growing until my program is killed.

please quick！

I am very curious about it. how on squad 2.0?

Why Chinese vocab contains ##word?

In the chinese vocab, we see many word pieces contains ##. In my understanding, those ## words only exist if the words are rare. But after tokenization, the article is converted to a sequence of single character, in what case do we use/need those vocabs ##?

Training on data sets not in the discussed data sets

Would it be possible to supplant the microsoft research paraphrase data with data from an alternative source?

Take, for example, the following;

`export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue

python run_classifier.py
--task_name=MRPC
--do_train=true
--do_eval=true
--data_dir=$GLUE_DIR/MRPC
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/bert_config.json
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
--max_seq_length=128
--train_batch_size=32
--learning_rate=2e-5
--num_train_epochs=3.0
--output_dir=/tmp/mrpc_output/
`

Does run_classifier support replacing the data_dir with my own dataset to train the classifier?

tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

python run_squad.py
--vocab_file=$BERT_BASE_DIR/vocab.txt
--bert_config_file=$BERT_BASE_DIR/bert_config.json
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt
--do_train=True
--train_file=$SQUAD_DIR/train-v1.1.json
--do_predict=True
--predict_file=$SQUAD_DIR/dev-v1.1.json
--train_batch_size=12
--learning_rate=3e-5
--num_train_epochs=2.0
--max_seq_length=384
--doc_stride=128
--output_dir=/tmp/squad_base/

should I install cuda 8?

linear projection "bias=True" != original t2t implementation transformer

Hi,

In modeling.py line 866:

with tf.variable_scope("output"): attention_output = tf.layers.dense( attention_output, hidden_size, kernel_initializer=create_initializer(initializer_range))

You are setting the bias of the dense layer = True, where in the original implementation of transformer they are setting it to False. There is any reason for doing that?

Thanks.

PyTorch implementation

Hello all,

We have released a PyTorch implementation/port of BERT !

Our scripts load Google's pre-trained models and it performs about the same as the TF implementation in our tests (see the readme). We have also included gradient-accumulation, multi-GPU & distributed training options to help you fine-tune these large models.

Here's the link: https://github.com/huggingface/pytorch-pretrained-BERT

We hope it will be useful !

Victor - HuggingFace 🤗

Support for tied weights / Universal Transformer-like model?

Tying the weights across transformer layers and thereby creating a kind of recurrent model was reported to substantially improve upon the standard transformer model across many tasks (https://arxiv.org/abs/1807.03819). It would be interesting to add support for this to BERT and see how it impacts performance.

why need to change words to "###*"by apply tokenization?

Hi,I couldn't understand why need to change words to "###*" by apply tokenization.
E.g., john johanson ' s , → john johan ##son ' s ,

I get attribute error when I run classifier code

When I run code like:
python run_classifier.py
--task_name=MRPC
--do_train=true
--do_eval=true
--data_dir=SST-2
--vocab_file=english_L-12_H-768_A-12/vocab.txt
--bert_config_file=english_L-12_H-768_A-12/bert_config.json
--init_checkpoint=english_L-12_H-768_A-12/bert_model.ckpt
--max_seq_length=128
--train_batch_size=32
--learning_rate=2e-5
--num_train_epochs=3.0
--output_dir=output

I get an Attribute Error as following:
Traceback (most recent call last):
File "run_classifier.py", line 754, in
tf.app.run()
File "/home/xiyu/.local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "run_classifier.py", line 658, in main
is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
AttributeError: 'module' object has no attribute 'InputPipelineConfig'

My tensorflow version is 1.4.0, and my gpu is GTX1080. Do I run the shell wrong or I forget to install some packages?

Thanks very much.

pertained Chinese language model request, please.

It would be very nice you can release the some other language models, like German, Chinese etc. Then we can experiment them on other language domains.

Thanks a lot. :)
John.

run run_classifier.py on chinese data, Failed to find any matching files for /path/chinese_L-12_H-768_A-12/bert_model.ckpt

when run the classify script "run_classifier.py"
as follow:

python run_classifier.py --task_name=XNLI --do_train=true --do_eval=true --data_dir=$XNLI_DIR --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt --max_seq_length=128 --train_batch_size=32 --learning_rate=5e-5 --num_train_epochs=0.01 --output_dir=/tmp/xnli_output/

suffer this error, I cannot find this file "bert_model.ckpt"

INFO:tensorflow:Error recorded from training_loop: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /path/chinese_L-12_H-768_A-12//bert_model.ckpt INFO:tensorflow:training_loop marked as finished WARNING:tensorflow:Reraising captured error Traceback (most recent call last): File "run_classifier.py", line 838, in <module> tf.app.run() File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "run_classifier.py", line 794, in main estimator.train(input_fn=train_input_fn, max_steps=num_train_steps) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2400, in train rendezvous.raise_errors() File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors six.reraise(typ, value, traceback) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train saving_listeners=saving_listeners File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1211, in _train_model_default features, labels, model_fn_lib.ModeKeys.TRAIN, self.config) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2186, in _call_model_fn features, labels, mode, config) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn model_fn_results = self._model_fn(features=features, **kwargs) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2470, in _model_fn features, labels, is_export_mode=is_export_mode) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1250, in call_without_tpu return self._call_model_fn(features, labels, is_export_mode=is_export_mode) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1524, in _call_model_fn estimator_spec = self._model_fn(features=features, **kwargs) File "run_classifier.py", line 575, in model_fn ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint) File "/Users/xiaoqiugen/Project/tmp/bert/modeling.py", line 331, in get_assignment_map_from_checkpoint init_vars = tf.train.list_variables(init_checkpoint) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 95, in list_variables reader = load_checkpoint(ckpt_dir_or_file) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/checkpoint_utils.py", line 64, in load_checkpoint return pywrap_tensorflow.NewCheckpointReader(filename) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 314, in NewCheckpointReader return CheckpointReader(compat.as_bytes(filepattern), status) File "/Library/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 526, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /path/chinese_L-12_H-768_A-12//bert_model.ckpt

where is the XnliProcessor?

I can't want run_classifier.py with Xnli dataset. I did as the CONTRIBUTING.md told. but i can't find the XnliProcessor

Training WordPiece vocabulary

Hello, thanks for releasing this code! I need to pretrain a BERT model for non-English language using my own (domain-specific) data. I noticed that the training code for WordPiece vocabulary is not released, although there are a couple open source implementations mentioned in the README. It is however stated that "these are not compatible with our tokenization.py library". May I know whether this means I can't simply feed in the vocab file generated by an external WordPiece library (e.g. tensor2tensor's) as an argument for --vocab_file?

How BERT would perform on IMDB, Trec-6?

On paper it mentioned only GLUE datasets. I wonder how it would perform on binary, multiclass and multilabel text classification tasks, so that we can directly compare it with ELMO, ULMFIT.

SentencePiece

Are you going to use SentencePiece for BPE tokenization?
Thank you.

Clarification : Fixed feature vectors

Please correct me if I'm wrong :

Feature vectors are word embeddings, for each token of the input file.
These vectors can be used as ELMo / GloVe : as a base for a bigger neural network.

If these assumptions are right, here is my question :

From the use example :

python extract_features.py
...
--layers=-1,-2,-3,-4
...

Why would anyone be interested in features vectors from others layers than the last one ?

From my understanding, feature vectors from the last layer are complete. Feature vectors from other layers are not complete.

'Complete' is obviously the wrong word here, due to my lack of vocabulary / knowledge.

By the way, BERT is really amazing, congratulations and thank you for sharing it.

plan to release SWAG code?

Hi, I just want to know if you plan to release fine-tuning and evaluation code for SWAG dataset.
If not, I wonder if the training procedure is same as MRPC. (more specificly, label 0 for distractors and 1 for gold-ending)

throwing bad_alloc after calling model_fn

Awesome research! This is a huge breakthrough for NLP.

I'm running BERT-large on a Cloud TPU doing fine-tuning for squad, but I keep getting

I have nothing else running so I'm not sure why the machine is running out of memory, and followed the steps exactly for setup (ie putting the pre-trained model on a google bucket, set up for TPU, etc).