google-research / albert Goto Github PK

View Code? Open in Web Editor NEW

3.2K 72.0 571.0 296 KB

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

License: Apache License 2.0

Python 96.22% Shell 0.69% Jupyter Notebook 3.10%

albert's People

Stargazers

Watchers

Forkers

alinear-corp sxjscience tvinith michael-wzhu abhitaliyan bluseking cclauss gomjang xwixcn ngoanpv lansatiankong abhilash1910 shivaylamba gochipon soskek leavelove haiwentom benjamesbabala dangchienhsgs jrmendeshurb askspoke liuzh47 researchmore jnnnnn johnjjung lucyio abodacs paiellen dx2048 mbrukman davidie tnatomtong numb3r3 ankur3107 xhyandwyy hwaking lightr0 dogydev kazyun zhangjuhui augustmiao1996 emma1710 chrhad stevezheng23 shadowoom shashwatwork haojiepan1 xjs924 netrookiecn balatatree pst2016 davidce kp-forks xmxoxo duxiaochao hivewang fancycheung shirleylaulau 90217 hhy5277 dorucioclea cnfive seeker1943 keeperovswords pvcastro sigmaquan porcupine96 xincui-math nick-2008 murasame zxlzr lyuji282 christinaliang josecannete wgwangang wsdjs whz1861 weigoss daizzhisheng realcodebase 17714196157 rommyhsu jinhou zhscotty manhlab kraw yuhuofei wskwon pasikon shadowkun hellonlp sanjaykrkundu gindis imanojkumar huybuidev dockiem shyamalschandra saiuz jiyongchen hoanghphan

albert's Issues

How to continue train ALBERT from the modelreleased on the tfhub ?

Training from scratch is very expensive. Anybody know how to continue train ALBERT from the exported model..

Thanks

[ALBERT] Failed to load TF-Hub model on Google Colab

I want to import ALBERT in Google Colab so first I ran bash albert/run.sh to install albert
Then I import the model by the examples in README.md but get some error:

Seems like albert didn't import correctly, how can I slove it?

Multilingual Albert

Is there a plan to release multilingual pre-trained model like you did in BERT?

Training from scratch on TPU

Is it possible to train Albert from scratch in another language using a TPU v3 (128Gb)?

Could you give an estimated training time? Days, weeks, months?

What is a reasonable corpus size? 1B words? Should the seq_length be reduced from the default 512?

How to generate vocab.txt file?

Anybody tell me how to generate vocab file for ALBERT model?

Will you release Chinese albert models ?

Thanks.

[ALBERT] Pre-training on TPU Pod

Hi all,
Could I do pre-training on TPU Pod v2-256 on large/xlarge V2 config (batch 4096, 3M steps,...)?
Any config to working on it?

[ALBERT] --random_next_sentence option in create_pretraining_data.py

Thanks for releasing ALBERT codes.

The default for --random_next_sentence option in create_pretraining_data.py is "True", but to enable the sentence order prediction (SOP) task, this option is set to be "False", right? If so, it is better to set the default to be "False".

The explanation for --random_next_sentence option in the source code seems strange. This explanation is for "False".

Thanks.

[ALBERT] question : unnecessary(or code error) in tokenization.py?

Problem description

From encode_pieces function part in tokenization script (specficially, line 122 ~ 126),
I can't identify a single case of executing line 122 referred by '>' code down below.
(ran through my sample corpus(1GB) but cannot find any...)

  if not sample:
    pieces = sp_model.EncodeAsPieces(text)
  else:
    pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
  new_pieces = []
  for piece in pieces:
    piece = printable_text(piece)
    if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit():
      cur_pieces = sp_model.EncodeAsPieces(
          six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b""))
>     if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
        if len(cur_pieces[0]) == 1:
          cur_pieces = cur_pieces[1:]
        else:
         cur_pieces[0] = cur_pieces[0][1:]
      cur_pieces.append(piece[-1])
      new_pieces.extend(cur_pieces)
    else:
      new_pieces.append(piece)

Steps/code/corpus to reproduce

I was able to find the case executing line 119 is executed running sample corpus(1GB) : piece = '▁2011' or '▁11,' and etc.
but none of them executes line 122.

Infos I found

####1) 'piece[0] != SPIECE_UNDERLINE' in line 122 below always has to be TRUE (?)

> if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:

Since piece has to be a character, piece[0] is a single character(including the case when piece = '▁')
and SPIECE_UNDERLINE is unicded_encoded_byte object referring '▁'

piece[0] != SPIECE_UNDERLINE
# always True

####2) cur_pieces[0][0] refers b'\xe2' in b'\xe2\x96\x81'.
Since cur_pieces is the instance of sp_model.EncodeAsPieces, cur_pieces[0] is equal b'\xe2\x96\x81+somecharacters' because cur_pieces[0] refers the first token. (sentencepiece model always append '▁' to the first token)
Then cur_pieces[0][0] is referring b'\xe2' which is equal to 226.

cur_pieces[0][0]
#226
ord(b'\xe2')
#226

So cur_pieces[0][0] == SPIECE_UNDERLINE is always false in my cases.

Question

Is there any explanation or reason implementing line 122 ~ 126?
Can I know sample sentences or words executing that code block?

Versions

Darwin-18.5.0-x86_64-i386-64bit
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
tensorflow 1.13.1

[ALBERT] How to know memory consumption on GPU/TPU of a model?

Hi all,
With parameters such as batch size, max sequence length,...
How to know the number of megabyte/gigabyte/.. of TPU/GPU need to train the model?
Could you please give me an example?
Thank you.

[ALBERT] Pretraining

Hi,

thanks for releasing implementation and pre-trained models for ALBERT ❤️

I would really like to train a model from scratch - would this be possible with one v3-8 TPU 🤔. If that's possible could you also give a detailed overview of the parameters you used for training base or larger models?

I think the "training for x steps for sequence length 128 and then fine-tuning the model for a sequence length of 512" approach is not used in the paper, so could you specify the parameters for the create_pretraining_data.py script (because it is using a seq. length of 128 by default).

Thanks many in advance!

Finetune ALBERT using pretained model

Hi,

Thanks very much for releasing the pre-trained code for ALBERT. I have downloaded THhub model from https://tfhub.dev/google/albert_base/1 , However, I found the accuracy of CoLA and MNLI seems to indicate, this is model does not contain pre-trained weight fro the ALBERT.
I have tried:

python3 -m run_classifier_sp --data_dir=../../DataSet/CoLA/ --task_name=cola —-output_dir=testing_sp_modifying_model_setting_in_main_file --vocab_file=vocab.txt --albert_config_file=2/assets/albert_config.json --do_train=False --do_eval=True --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-05 --num_train_epochs=3.0

with do_train set to False, getting the accuracy of around 40%,
however, with do_train set to True, I am able to get the accuracy of 70%

Besides, after downloading the weight from TH_Hub, I can only find the .json file, but without the .ckpt file from the folder.

I am wondering, how can we find the per-trained weight, the .ckpt file for ALBERT?

[ALBERT]: In run_squad_sp, convert_examples_to_features gives error in case sentence piece model is not provided.

I am trying to run Albert model on SQUAD dataset. In case SP model is not used, convert_examples_to_features will not go through. Please let me know, where I can find SP model.

Failed precondition: Error while reading resource variable

I used albert_base_v2 and tfhub on python3.6 and tensorflow 1.15.0.

I tried to wrap the albert output layer with a lambda layer. And it gives the following erros:

---------------------------------------------------------------------------
FailedPreconditionError                   Traceback (most recent call last)
<ipython-input-6-d725e62752e5> in <module>()
----> 1 dense = keras.layers.Lambda(lambda x: x[:, 0])(albert_outputs)

7 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py in __call__(self, *args, **kwargs)
   1470         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1471                                                self._handle, args,
-> 1472                                                run_metadata_ptr)
   1473         if run_metadata:
   1474           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: Error while reading resource variable module/bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/module/bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias)
	 [[{{node module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/add/ReadVariableOp}}]]
  (1) Failed precondition: Error while reading resource variable module/bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/module/bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias)
	 [[{{node module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/add/ReadVariableOp}}]]
	 [[module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/add/ReadVariableOp/_1]]
0 successful operations.
0 derived errors ignored.

Here is my code:

    from tensorflow import keras
    import tensorflow_hub as hub
    q_in = keras.layers.Input(shape=(None,), dtype=tf.int32, name="q_input_word_ids")
    q2_in = keras.layers.Input(shape=(None,), dtype=tf.int32, name="q_input_masks")
    q3_in = keras.layers.Input(shape=(None,), dtype=tf.int32, name="q_segment_ids")

    albert_module = hub.Module('https://tfhub.dev/google/albert_base/2', trainable=True)
    albert_inputs = dict(input_ids=q_in, input_mask=q2_in, segment_ids=q3_in)
    albert_outputs = albert_module(albert_inputs, signature="tokens", as_dict=True)["sequence_output"]

Up till now everything runs fine. But when I run the following, it gives the error above.

    dense = keras.layers.Lambda(lambda x: x[:, 0])(albert_outputs)

Any helps are appreciated!

[ALBERT]Has anyone reproduced ALBERT a scores on GLUE dataset?

I convert tf weight to pytorch weight ,and on QQP dataset, I only get 87% accuracy.

model: albert-base
epochs: 3
learning_rate; 2e-5
batch size: 24
max sequence length: 128
warmup_proportion: 0.1

[ALBERT]error "expected str instance, bytes found" to error "list index out of range"

when i run run_squad_sp.py , error occur :

Traceback (most recent call last):
  File "run_squad_sp.py", line 1350, in <module>
    tf.app.run()
  File "/usr/local/app/.conda/envs/convlab/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "run_squad_sp.py", line 1257, in main
    output_fn=train_writer.process_feature)
  File "run_squad_sp.py", line 396, in convert_examples_to_features
    tok_cat_text = "".join(para_tokens).replace(
TypeError: sequence item 89: expected str instance, bytes found

and i find that the para_tokens as below:

['▁greece', '▁attract', 's', '▁more', '▁than', '▁16', '▁million', '▁tourists', '▁each', '▁year', ',', '▁thus', '▁contributing', '▁18', '.', '2%', '▁to', '▁the', '▁nation', "'", 's', '▁gdp', '▁in', '▁2008', '▁according', '▁to', '▁an', '▁', 'oe', 'cd', '▁report', '.', '▁the', '▁same', '▁survey', '▁showed', '▁that', '▁the', '▁average', '▁tourist', '▁expenditure', '▁while', '▁in', '▁greece', '▁was', '▁$1', ',', '07', '3', ',', '▁ranking', '▁greece', '▁10', 'th', '▁in', '▁the', '▁world', '.', '▁the', '▁number', '▁of', '▁jobs', '▁directly', '▁or', '▁indirectly', '▁related', '▁to', '▁the', '▁tourism', '▁sector', '▁were', '▁8', '40,000', '▁in', '▁2008', '▁and', '▁represented', '▁', '19%', '▁of', '▁the', '▁country', "'", 's', '▁total', '▁labor', '▁force', '.', '▁in', "b'\\xe2\\x96\\x812009'", ',', '▁greece', '▁welcomed', '▁over', '▁19', '.', '3', '▁million', '▁tourists', ',', '▁a', '▁major', '▁increase', '▁from', '▁the', '▁17', '.', '7', '▁million', '▁tourists', '▁the', '▁country', '▁welcomed', '▁in', '▁2008', '.']

so i modify my code from tok_cat_text = "".join(para_tokens).replace(tokenization.SPIECE_UNDERLINE.decode("utf-8"), " ") totok_cat_text = "".join([w.decode() if type(w)==bytes else w for w in para_tokens]).replace( tokenization.SPIECE_UNDERLINE.decode("utf-8"), " ") and another error occurs as below:

Traceback (most recent call last):
  File "run_squad_sp.py", line 1350, in <module>
    tf.app.run()
  File "/usr/local/app/.conda/envs/convlab/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "run_squad_sp.py", line 1257, in main
    output_fn=train_writer.process_feature)
  File "run_squad_sp.py", line 471, in convert_examples_to_features
    n, is_start=False)
  File "run_squad_sp.py", line 316, in _convert_index
    if index[pos] is not None:
IndexError: list index out of range

what's wrong with it？ thx

Issues with codes in Google Colab-Tensorflow-2.0.0

Hi All,

I exported albert codes to google drive and mounted it into google colab. The code is not running due to many issues like tensorflow.contrib is not found etc. Could anybody suggest me how to run albert codes in google colab?

SQuAD vocab file for ALBERT

Thanks for sharing great work,

The current implementation of albert/run_squad_sp.py requires a vocab file to work.
Is it the same as for BERT pre-trained archives (https://github.com/google-research/bert)?

[ALBERT] How many value of 'dupe_factor' is good?

I see default value is 40, it large. Did this para affect significant to performance?

Model for SQuAD Question Answering

Hi,
I would like to adopt the albert model ran for SQuAD data set. Is "run_squad_sp.py" a right model for this purpose? Is it enough if I run the code in this file alone else should I run any other .py files in the albert folder beforehand?

I was expecting some documentation for using the albert but not available unfortunately.

where to download non-tfhub pretrained models?

I want to modify sth. based on run_classification.py, however it seems only tfhub pretrained models released. Could you provide raw format TF checkpoints?

"no dropout" on v2 models

You say that you are using "no dropout" on the TFHub v2-models. However, looking at the albert_config.json-files there seem to be dropout on most models (https://tfhub.dev/google/albert_base/2). Only on the xxlarge, there is no dropout (https://tfhub.dev/google/albert_xxlarge/2). What is correct?

[ALBERT] albert-xlarge V2 seems to have a different behavior than the other models

Hi, this issue is related to ALBERT and especially the V2 models, specifically the xlarge version 2.

TLDR: The ALBERT-xlarge V2 model seems to be different to other V2/V1 models.

The models are accessible through the HUB; in order to inspect them I save the checkpoints which I then load in the modeling.AlbertModel available in the modeling.py file. I use this script to save the checkpoint to a file.

In a different script, I load the checkpoint in a model from modeling.py (a line has to be added to the scope so that the modeling scope begins with module, same as the HUB module). I load the checkpoint in this script. In that same script I load a HUB module, and I compare the outputs of both models given the same input values.

For every model, I check that the difference is near to zero by checking the maximum difference between tensor values (included at the bottom of the second script). Here are the results:

ALBERT-BASE-V1 max difference: pooled 8.009374e-06, full transformer 2.3543835e-06
ALBERT-LARGE-V1 max difference: pooled 2.5719404e-05 full transformer 1.8417835e-05
ALBERT-XLARGE-V1 max difference: pooled 0.0006218478 full transformer 0.0
ALBERT-XXLARGE-V1 max difference: pooled 0.0 full transformer 1.0311604e-05

ALBERT-BASE-V2 max difference: pooled 2.3335218e-05 full transformer 4.9591064e-05
ALBERT-LARGE-V2 max difference: pooled 0.00015488267 full transformer 0.00010347366
ALBERT-XLARGE-V2 max difference: pooled 1.9535216 full transformer 5.152705
ALBERT-XXLARGE-V2 max difference: pooled 1.7762184e-05 full transformer 2.592802e-06

Is there an issue with this model in particular, does it have a particular architecture change that is different from the others?
I have had no problems replicating the SQuAD results on all of the V1 models, but I could not do so on the V2 models apart for the base one. Is this related? Thank you for your time.

tensorflow.python.framework.errors_impl.NotFoundError: Graph ops missing from the python registry ({'Einsum'}) are also absent from the c++ registry

I am trying to creat tensorflow-hub Module using locally saved ALBER pre-trained model and I am getting this error :
tensorflow.python.framework.errors_impl.NotFoundError: Graph ops missing from the python registry ({'Einsum'}) are also absent from the c++ registry.
I have:
tensorflow: 1.14.0
tensorflow-hub: 0.7.0
opt-einsum= 3.1.0
I am working on MAC OS High Sierra 10.13.6

[ALBERT] what are the parameters setting for training data generation ?

Hi,

What are the parameters setting for training data generation for your released models?

like, what is the dupe_factor, whether do whole word masking? I just found the max_seq_len, mask_probability, n_gram_mask,shorter_seq_prob in the paper?

Thanks

[ALBERT] Fine-tune SQUAD, got error in `convert_examples_to_features()`

In particular, the error is Index Error: list index out of range in _convert_index()

in run_squad_sp.py what is the file format for train_feature_file?

i can't figure out what should I pass to the example in this parameter

[ALBERT] Same amount of TPU memory consumption compared to BERT

Although the layers are shared in ALBERT, I failed running ALBERT with larger batch size than the batch size I ran successfully on BERT.

Only thing I can suspect is that TPU/GPU consumes same amount of memory regardless of layer sharing.

Is this expected?

squad error

i try to run run_squad_sp.py, and I met two error. :(((( my code is as follows:

python -m run_squad_sp \
    --albert_config_file=data/assets/albert_config.json \
    --vocab_file=data/assets/30k-clean.vocab \
    --spm_model_file=data/assets/30k-clean.model \
    --output_dir=data/output \
    --train_file=data/train-v2.0.json \
    --predict_file=data/dev-v2.0.json \
    --train_feature_file=data/train.tfrecord \
    --predict_feature_file=data/dev.tfrecord \
    --init_checkpoint=data/variables/variables \
    --do_train \
    --do_predict \
    --nouse_tpu \
    --train_batch_size=32 \
    --predict_batch_size=8 \
    --num_train_steps=3 \
    --version_2_with_negative=True

At first,it caused this error, and i had solued it.

Traceback (most recent call last):
File "/home/bai/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/bai/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/bai/squad/run_squad_sp.py", line 1331, in
tf.app.run()
File "/home/bai/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/bai/squad/run_squad_sp.py", line 1239, in main
output_fn=train_writer.process_feature)
File "/home/bai/squad/run_squad_sp.py", line 381, in convert_examples_to_features
print("".join(para_tokens))
TypeError: sequence item 89: expected str instance, bytes found

Then i met other error.

Traceback (most recent call last):
File "/home/bai/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/bai/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/bai/squad/run_squad_sp.py", line 1334, in
tf.app.run()
File "/home/bai/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/bai/squad/run_squad_sp.py", line 1242, in main
output_fn=train_writer.process_feature)
File "/home/bai/squad/run_squad_sp.py", line 461, in convert_examples_to_features
n, is_start=False)
File "/home/bai/squad/run_squad_sp.py", line 302, in _convert_index
if index[pos] is not None:
IndexError: list index out of range

May i get some help here, thx~

Significantly lower than expected eval accuracy on MNLI

Before ALBERT was moved to this repository, I downloaded the pre-trained ALBERT-base-2 from TFHub and used run_classifier_sp.py to evaluate the model on MNLI by modifying the provided run.sh script to execute the following instead of run_pretraining_test:

 python -m albert.run_classifier_sp \
    --output_dir="/path/to/output" \
    --export_dir="/path/to/export" \
    --do_eval \
    --nouse_tpu \
    --eval_batch_size=1 \
    --max_seq_length=4 \
    --max_eval_steps=3 \
    --vocab_file="/path/to/albert-base-2/assets/30k-clean.vocab" \
    --data_dir="/path/to/glue/MNLI" \
    --task_name=MNLI

This gave an eval accuracy of approximately 0.34, which is significantly lower than the expected 0.84 discussed in the paper.

Has anyone else seen such low out-of-the-box evaluation results? Is this simply an issue with how I'm running the evaluation? If so, are there any recommendations for running evaluation to achieve better results?

[ALBERT] albert-xxlarge v1 `albert_config.josn/30k-clean.model` file

Download albert-xxlarge-v1 from https://storage.googleapis.com/tfhub-modules/google/albert_xxlarge/1.tar.gz
there are only 30k-clean.model file in v1's assets dir,
But in albert-xxlarge-v2's assets dir, there are three files: 30k-clean.vocab , albert_config.json, 30k-clean.model.
How to get 30k-clean.vocab , albert_config.json file in albert-xxlarge-v1 ??

Incorrect English alphabet in line 402 in create_pretraining_data.py

  # Note(mingdachen):
  # For foreign characters, we always treat them as a whole piece.
  english_chars = set(list("abcdefghijklmnopqrstuvwhyz"))

the character h is listed twice.

layer norm after embedding layer

https://github.com/google-research/ALBERT/blob/66bf3e950830048f65794f7b5644ff0ae7b5fab6/modeling.py#L617

Is there some benchmark that shows comparison with vs without this layer norm after embedding layer?

export_dir in run_pretraining.py is necessary?

export_dir is just made at L545 in run_pretraining.py, but nothing else happens.

tf.gfile.MakeDirs(FLAGS.export_dir)

In the current script, unless --export_dir option is specified, the script dies in the evaluation step.
export_dir is necessary?

[ALBERT] FullTokenizer inconsistency - do_lower_case ignored when spm_model_file is specified

It looks like the FullTokenizer in tokenizer.py does not respect the do_lower_case argument if spm_model_file is provided!
Which is somewhat inconsistent, as the do_lower_case argument would not be ignored if the spm_model_file is not specified.
(during pre-training the lower case conversion is done in tokenizer.preprocess_text() and unicode normalization too; in the BERT-like tokenization, i.e. when no spm_model_file is specified, both would be done in FullTokenizer)

[ALBERT] : Gradient for bert/embeddings/LayerNorm/gamma:0 is NaN : Tensor had NaN values [[node CheckNumerics_4 (defined at usr/local/lib/python3.5/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

I try to change ALBERT optimizer to RAdam, but got this error after 1000steps. I tried lower batch size but still not work

Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/mengqingyang0102/albert/run_squad_sp.py", line 1384, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/mengqingyang0102/albert/run_squad_sp.py", line 1307, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
Gradient for bert/embeddings/position_embeddings:0 is NaN : Tensor had NaN values
[[node CheckNumerics_2 (defined at usr/local/lib/python3.5/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Index Out of Range Error in tokenization using TF Hub for Pretrained Albert Models

I am getting Index out of Range error in tokenization.py when running a finetune Albert large model with TF Hub. I printed out the vocab file and printing out the token before the error. You can see the error and print-outs below.

Vocab File: b'/tmp/tfhub_modules/c88f9d4ac7469966b2fab3b577a8031ae23e125a/assets/30k-clean.model'
Token:  

Traceback (most recent call last):
  File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/[user]/Documents/ml-tests/falling-albert/albert/run_classifier_with_tfhub.py", line 318, in <module>
    tf.compat.v1.app.run()
  File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/[user]/Documents/ml-tests/falling-albert/albert/run_classifier_with_tfhub.py", line 185, in main
    tokenizer = create_tokenizer_from_hub_module(FLAGS.albert_hub_module_handle)
  File "/home/[user]/Documents/ml-tests/falling-albert/albert/run_classifier_with_tfhub.py", line 161, in create_tokenizer_from_hub_module
    spm_model_file=FLAGS.spm_model_file)
  File "/home/[user]/Documents/ml-tests/falling-albert/albert/tokenization.py", line 249, in __init__
    self.vocab = load_vocab(vocab_file)
  File "/home/[user]/Documents/ml-tests/falling-albert/albert/tokenization.py", line 203, in load_vocab
    token = token.strip().split()[0]
IndexError: list index out of range

Albert Finetune Shell Script

#!/bin/bash
pip install -r albert/requirements.txt
python -m albert.run_classifier_with_tfhub \
--albert_hub_module_handle=https://tfhub.dev/google/albert_xlarge/1 \
--task_name=cola \
--do_train=true \
--do_eval=true  \
--data_dir=./data-to-albert \
--max_seq_length=128  \
--train_batch_size=32  \
--learning_rate=2e-05 \
--num_train_epochs=3.0  \
--output_dir=./checkpoints/test

[ALBERT]: LookupError: gradient registry has no entry for: AddV2

When run run_classifier_with_tfhub.py, but the training crashed. The error is:

LookupError: No gradient defined for operation 'module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/LayerNorm_1/batchnorm/add_1' (op type: AddV2)

My tensorflow-gpu version is 1.14.0

Anyone knows the reason, pls help..
Thanks

Any plans on multilangual albert models release?

[ALBERT] Checkpoints for pretrained models

Hello!
Thank you for releasing the code for Albert!
Could you upload the pre-trained checkpoints for the 4 Albert models? I would like to run run_squad_sp.py directly for finetuning on SQuAD.

Could you also release instructions on how to run SQuAD using tensorflow hub directly? (Similar to run_classifier_with_tfhub.py?

Thanks in advance!

How to train and test the ALBERT model.

I still didn't find the proper documentation on how to train and test the model.

I somehow fixed the training issues using the input from https://github.com/google-research/google-research/issues/84

Now I have the trained check points. But I am not able to test it because of missing vocab.txt. I tried to use vocab.txt from BERT model still failed.

Did any one figure out a way to test the model.

My arguments for training:

!python -m run_classifier_with_tfhub \
  --albert_hub_module_handle=https://tfhub.dev/google/albert_base/1 --task_name=cola --do_train=true --do_eval=true --data_dir=./dataset --output_dir=./albert_output/ --max_seq_length=64  --train_batch_size=2 --learning_rate=2e-5 --num_train_epochs=3.0

My arguments for testing:
!python run_classifier_sp.py --task_name=cola --do_predict=true --data_dir=./dataset --albert_config_file=./model/2/assets/albert_config.json --init_checkpoint=./albert_output/model.ckpt-906 --vocab_file=./model/vocab.txt --max_seq_length=64 --output_dir=./albert_output/

Fine-tuning Albert large

I fine-tuned Albert base on my task but didn't get desired accuracy. Now that I am trying to fine-tune Albert large I get this error:
"Resource exhausted: OOM when allocating tensor with shape[8,512,16,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"

I used single GPU, with 12 GB Memory and also 16 GB (two different attempts).
It is interesting that I can fine-tune Bert base on the single gpu with 12GB memory.

Multilingual ALBERT

Do you plan to release multilingual pre-trained models like BERT? Would appreciate it.

[UNK] token in v2 models

I downloaded albert_xxl v2, in file assets/30k-clean.vocab entry for [UNK] looks like:

<unk> 0

while in tokenization.py it's :

class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""

def init(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):

So I'm getting error like below. Is it ok to modlfy tokenization.py or I'm doing something wrong?

input_ids = tokenizer.convert_tokens_to_ids(ntokens)

File "J:\albert\tokenization.py", line 269, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)
File "J:\albert\tokenization.py", line 211, in convert_by_vocab
output.append(vocab[item])
KeyError: '[UNK]'

KeyError: '[CLS]' in tokenization.py

    !python -m albert.run_classifier_with_tfhub \
        --albert_hub_module_handle=https://tfhub.dev/google/albert_large/2 \
        --data_dir=input \
        --task_name=cola \
        --do_train=True \
        --do_eval=True \
        --train_batch_size=16 \
        --eval_batch_size=16 \
        --max_seq_length=128 \
        --learning_rate=1e-4 \
        --num_train_epochs=2 \
        --output_dir=output

Run on colab with tensorflow==1.15.0

INFO:tensorflow:Writing example 0 of 123163
I1104 20:48:08.980587 140516005656448 run_classifier_sp.py:804] Writing example 0 of 123163
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/albert/run_classifier_with_tfhub.py", line 319, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/content/albert/run_classifier_with_tfhub.py", line 233, in main
train_examples, label_list, FLAGS.max_seq_length, tokenizer)
File "/content/albert/run_classifier_sp.py", line 807, in convert_examples_to_features
max_seq_length, tokenizer)
File "/content/albert/run_classifier_sp.py", line 464, in convert_single_example
input_ids = tokenizer.convert_tokens_to_ids(tokens)
File "/content/albert/tokenization.py", line 270, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)
File "/content/albert/tokenization.py", line 212, in convert_by_vocab
output.append(vocab[item])
KeyError: '[CLS]'

[ALBERT] TFHub assets/albert_config.json has 0 for dropouts

I noticed that the TFHub released ALBERT v2 models specify zeros for attention_probs_dropout_prob and hidden_dropout_prob in the assets/albert_config.json:

{
  "attention_probs_dropout_prob": 0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0,
  ...
}

however the [README](https://tfhub.dev/google/albert_base/2) in TFHub  specifies:
```json
{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
 ...
}

May be one of them (README or assets/albert_config.json) could be update?

I'm also wondering if it is a good idea to provide a do_lower_case flag somewhere under assets/ - just as a minimal specification for the required text pre-processing?
Probably such a do_lower_case belongs to the sentencepiece model, what do you think?

LookupError: No gradient defined for operation 'module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/einsum/Ein$um' (op type: Einsum)

I am using run_classifier_with_tfhub with --albert_hub_module_handle=https://tfhub.dev/google/albert_base/2.

I am getting error like "LookupError: No gradient defined for operation 'module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/einsum/Ein$um' (op type: Einsum)"

The argument is:
python3 -m run_classifier_with_tfhub --data_dir=../../DataSet/CoLA/ --t
ask_name=cola --output_dir=testing_ttt --vocab_file=vocab.txt --albert_hub_module_handle=https://tfhub.dev/google/albert_base/2 --do_train=True --do_eval=True --max_seq
_length=128 --train_batch_size=32 --learning_rate=2e-05 --num_train_epochs=3.0

I am using tensorflow==1.15.0

Training time

I am considering training AlBert from scratch in another language on a single TPU v3 128Gb. I have a corpus of around 2B words.

Would this be a sufficient corpus size? Could you give a rough estimate of how long this would take for the various models?

Is it possible to reload a pretrained tfhub albert in tf2.0?

for example:
ALBERT_PATH = "xxx" // a pretrained tfhub albert model
albert_layer = hub.KerasLayer(ALBERT_PATH , trainable=True)

[ALBERT] Tokenization crashes while trying to finetune classifier with TF Hub model

I'm trying to get ALBERT running locally with the following command line:
python -m albert.run_classifier_with_tfhub --task_name=MNLI --data_dir=./multinli_1.0 --albert_hub_module_handle=https://tfhub.dev/google/albert_large/1 --output_dir=./output --do_train=True

When tokenizer is initialized from TF Hub model it crashes:

Traceback (most recent call last):
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 320, in <module>
    tf.app.run()
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 187, in main
    tokenizer = create_tokenizer_from_hub_module(FLAGS.albert_hub_module_handle)
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 161, in create_tokenizer_from_hub_module
    spm_model_file=FLAGS.spm_model_file)
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 247, in __init__
    self.vocab = load_vocab(vocab_file)
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 201, in load_vocab
    token = token.strip().split()[0]
IndexError: list index out of range

The issue is with the line being just a newline character '\n'. However, even if I modify code to ignore them it still crashes later with

Traceback (most recent call last):
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 320, in <module>
    tf.app.run()
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 187, in main
    tokenizer = create_tokenizer_from_hub_module(FLAGS.albert_hub_module_handle)
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 161, in create_tokenizer_from_hub_module
    spm_model_file=FLAGS.spm_model_file)
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 249, in __init__
    self.vocab = load_vocab(vocab_file)
  File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 198, in load_vocab
    token = convert_to_unicode(reader.readline())
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 179, in readline
    return self._prepare_value(self._read_buf.ReadLineAsString())
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 98, in _prepare_value
    return compat.as_str_any(val)
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 117, in as_str_any
    return as_str(value)
  File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 87, in as_text
    return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 8: invalid start byte

I'm running the code on OS X Catalina, Anaconda, Python 3.6

sentencepiece             0.1.83                   pypi_0    pypi
tensorflow                1.14.0          mkl_py36h933f829_0  
tensorflow-base           1.14.0          mkl_py36h655c25b_0  
tensorflow-estimator      1.14.0                     py_0  
tensorflow-hub            0.6.0              pyhe1b5a44_0    conda-forge