google-research / albert Goto Github PK
View Code? Open in Web Editor NEWALBERT: A Lite BERT for Self-supervised Learning of Language Representations
License: Apache License 2.0
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
License: Apache License 2.0
Training from scratch is very expensive. Anybody know how to continue train ALBERT from the exported model..
Thanks
Is there a plan to release multilingual pre-trained model like you did in BERT?
Is it possible to train Albert from scratch in another language using a TPU v3 (128Gb)?
Could you give an estimated training time? Days, weeks, months?
What is a reasonable corpus size? 1B words? Should the seq_length be reduced from the default 512?
Anybody tell me how to generate vocab file for ALBERT model?
Thanks.
Hi all,
Could I do pre-training on TPU Pod v2-256 on large/xlarge V2 config (batch 4096, 3M steps,...)?
Any config to working on it?
Thanks for releasing ALBERT codes.
The default for --random_next_sentence
option in create_pretraining_data.py
is "True", but to enable the sentence order prediction (SOP) task, this option is set to be "False", right? If so, it is better to set the default to be "False".
The explanation for --random_next_sentence
option in the source code seems strange. This explanation is for "False".
Thanks.
From encode_pieces function part in tokenization script (specficially, line 122 ~ 126),
I can't identify a single case of executing line 122 referred by '>' code down below.
(ran through my sample corpus(1GB) but cannot find any...)
if not sample:
pieces = sp_model.EncodeAsPieces(text)
else:
pieces = sp_model.SampleEncodeAsPieces(text, 64, 0.1)
new_pieces = []
for piece in pieces:
piece = printable_text(piece)
if len(piece) > 1 and piece[-1] == "," and piece[-2].isdigit():
cur_pieces = sp_model.EncodeAsPieces(
six.ensure_binary(piece[:-1]).replace(SPIECE_UNDERLINE, b""))
> if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
if len(cur_pieces[0]) == 1:
cur_pieces = cur_pieces[1:]
else:
cur_pieces[0] = cur_pieces[0][1:]
cur_pieces.append(piece[-1])
new_pieces.extend(cur_pieces)
else:
new_pieces.append(piece)
I was able to find the case executing line 119 is executed running sample corpus(1GB) : piece = '▁2011' or '▁11,' and etc.
but none of them executes line 122.
Infos I found
####1) 'piece[0] != SPIECE_UNDERLINE' in line 122 below always has to be TRUE (?)
> if piece[0] != SPIECE_UNDERLINE and cur_pieces[0][0] == SPIECE_UNDERLINE:
Since piece has to be a character, piece[0] is a single character(including the case when piece = '▁')
and SPIECE_UNDERLINE is unicded_encoded_byte object referring '▁'
piece[0] != SPIECE_UNDERLINE
# always True
####2) cur_pieces[0][0] refers b'\xe2' in b'\xe2\x96\x81'.
Since cur_pieces is the instance of sp_model.EncodeAsPieces, cur_pieces[0] is equal b'\xe2\x96\x81+somecharacters' because cur_pieces[0] refers the first token. (sentencepiece model always append '▁' to the first token)
Then cur_pieces[0][0] is referring b'\xe2' which is equal to 226.
cur_pieces[0][0]
#226
ord(b'\xe2')
#226
So cur_pieces[0][0] == SPIECE_UNDERLINE is always false in my cases.
Is there any explanation or reason implementing line 122 ~ 126?
Can I know sample sentences or words executing that code block?
Darwin-18.5.0-x86_64-i386-64bit
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
tensorflow 1.13.1
Hi all,
With parameters such as batch size, max sequence length,...
How to know the number of megabyte/gigabyte/.. of TPU/GPU need to train the model?
Could you please give me an example?
Thank you.
Hi,
thanks for releasing implementation and pre-trained models for ALBERT ❤️
I would really like to train a model from scratch - would this be possible with one v3-8 TPU 🤔. If that's possible could you also give a detailed overview of the parameters you used for training base or larger models?
I think the "training for x steps for sequence length 128 and then fine-tuning the model for a sequence length of 512" approach is not used in the paper, so could you specify the parameters for the create_pretraining_data.py
script (because it is using a seq. length of 128 by default).
Thanks many in advance!
Hi,
Thanks very much for releasing the pre-trained code for ALBERT. I have downloaded THhub model from https://tfhub.dev/google/albert_base/1 , However, I found the accuracy of CoLA and MNLI seems to indicate, this is model does not contain pre-trained weight fro the ALBERT.
I have tried:
python3 -m run_classifier_sp --data_dir=../../DataSet/CoLA/ --task_name=cola —-output_dir=testing_sp_modifying_model_setting_in_main_file --vocab_file=vocab.txt --albert_config_file=2/assets/albert_config.json --do_train=False --do_eval=True --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-05 --num_train_epochs=3.0
with do_train set to False, getting the accuracy of around 40%,
however, with do_train set to True, I am able to get the accuracy of 70%
Besides, after downloading the weight from TH_Hub, I can only find the .json file, but without the .ckpt file from the folder.
I am wondering, how can we find the per-trained weight, the .ckpt file for ALBERT?
I am trying to run Albert model on SQUAD dataset. In case SP model is not used, convert_examples_to_features will not go through. Please let me know, where I can find SP model.
I used albert_base_v2 and tfhub on python3.6 and tensorflow 1.15.0.
I tried to wrap the albert output layer with a lambda layer. And it gives the following erros:
---------------------------------------------------------------------------
FailedPreconditionError Traceback (most recent call last)
<ipython-input-6-d725e62752e5> in <module>()
----> 1 dense = keras.layers.Lambda(lambda x: x[:, 0])(albert_outputs)
7 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py in __call__(self, *args, **kwargs)
1470 ret = tf_session.TF_SessionRunCallable(self._session._session,
1471 self._handle, args,
-> 1472 run_metadata_ptr)
1473 if run_metadata:
1474 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
FailedPreconditionError: 2 root error(s) found.
(0) Failed precondition: Error while reading resource variable module/bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/module/bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias)
[[{{node module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/add/ReadVariableOp}}]]
(1) Failed precondition: Error while reading resource variable module/bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias from Container: localhost. This could mean that the variable was uninitialized. Not found: Container localhost does not exist. (Could not find resource: localhost/module/bert/encoder/transformer/group_0/inner_group_0/ffn_1/intermediate/output/dense/bias)
[[{{node module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/add/ReadVariableOp}}]]
[[module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/add/ReadVariableOp/_1]]
0 successful operations.
0 derived errors ignored.
Here is my code:
from tensorflow import keras
import tensorflow_hub as hub
q_in = keras.layers.Input(shape=(None,), dtype=tf.int32, name="q_input_word_ids")
q2_in = keras.layers.Input(shape=(None,), dtype=tf.int32, name="q_input_masks")
q3_in = keras.layers.Input(shape=(None,), dtype=tf.int32, name="q_segment_ids")
albert_module = hub.Module('https://tfhub.dev/google/albert_base/2', trainable=True)
albert_inputs = dict(input_ids=q_in, input_mask=q2_in, segment_ids=q3_in)
albert_outputs = albert_module(albert_inputs, signature="tokens", as_dict=True)["sequence_output"]
Up till now everything runs fine. But when I run the following, it gives the error above.
dense = keras.layers.Lambda(lambda x: x[:, 0])(albert_outputs)
Any helps are appreciated!
I convert tf weight to pytorch weight ,and on QQP dataset, I only get 87% accuracy.
model: albert-base
epochs: 3
learning_rate; 2e-5
batch size: 24
max sequence length: 128
warmup_proportion: 0.1
when i run run_squad_sp.py , error occur :
Traceback (most recent call last):
File "run_squad_sp.py", line 1350, in <module>
tf.app.run()
File "/usr/local/app/.conda/envs/convlab/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_squad_sp.py", line 1257, in main
output_fn=train_writer.process_feature)
File "run_squad_sp.py", line 396, in convert_examples_to_features
tok_cat_text = "".join(para_tokens).replace(
TypeError: sequence item 89: expected str instance, bytes found
and i find that the para_tokens as below:
['▁greece', '▁attract', 's', '▁more', '▁than', '▁16', '▁million', '▁tourists', '▁each', '▁year', ',', '▁thus', '▁contributing', '▁18', '.', '2%', '▁to', '▁the', '▁nation', "'", 's', '▁gdp', '▁in', '▁2008', '▁according', '▁to', '▁an', '▁', 'oe', 'cd', '▁report', '.', '▁the', '▁same', '▁survey', '▁showed', '▁that', '▁the', '▁average', '▁tourist', '▁expenditure', '▁while', '▁in', '▁greece', '▁was', '▁$1', ',', '07', '3', ',', '▁ranking', '▁greece', '▁10', 'th', '▁in', '▁the', '▁world', '.', '▁the', '▁number', '▁of', '▁jobs', '▁directly', '▁or', '▁indirectly', '▁related', '▁to', '▁the', '▁tourism', '▁sector', '▁were', '▁8', '40,000', '▁in', '▁2008', '▁and', '▁represented', '▁', '19%', '▁of', '▁the', '▁country', "'", 's', '▁total', '▁labor', '▁force', '.', '▁in', "b'\\xe2\\x96\\x812009'", ',', '▁greece', '▁welcomed', '▁over', '▁19', '.', '3', '▁million', '▁tourists', ',', '▁a', '▁major', '▁increase', '▁from', '▁the', '▁17', '.', '7', '▁million', '▁tourists', '▁the', '▁country', '▁welcomed', '▁in', '▁2008', '.']
so i modify my code from tok_cat_text = "".join(para_tokens).replace(tokenization.SPIECE_UNDERLINE.decode("utf-8"), " ")
totok_cat_text = "".join([w.decode() if type(w)==bytes else w for w in para_tokens]).replace( tokenization.SPIECE_UNDERLINE.decode("utf-8"), " ")
and another error occurs as below:
Traceback (most recent call last):
File "run_squad_sp.py", line 1350, in <module>
tf.app.run()
File "/usr/local/app/.conda/envs/convlab/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_squad_sp.py", line 1257, in main
output_fn=train_writer.process_feature)
File "run_squad_sp.py", line 471, in convert_examples_to_features
n, is_start=False)
File "run_squad_sp.py", line 316, in _convert_index
if index[pos] is not None:
IndexError: list index out of range
what's wrong with it? thx
Hi All,
I exported albert codes to google drive and mounted it into google colab. The code is not running due to many issues like tensorflow.contrib is not found etc. Could anybody suggest me how to run albert codes in google colab?
Thanks for sharing great work,
The current implementation of albert/run_squad_sp.py requires a vocab file to work.
Is it the same as for BERT pre-trained archives (https://github.com/google-research/bert)?
I see default value is 40, it large. Did this para affect significant to performance?
Hi,
I would like to adopt the albert model ran for SQuAD data set. Is "run_squad_sp.py" a right model for this purpose? Is it enough if I run the code in this file alone else should I run any other .py files in the albert folder beforehand?
I was expecting some documentation for using the albert but not available unfortunately.
I want to modify sth. based on run_classification.py
, however it seems only tfhub pretrained models released. Could you provide raw format TF checkpoints?
You say that you are using "no dropout" on the TFHub v2-models. However, looking at the albert_config.json-files there seem to be dropout on most models (https://tfhub.dev/google/albert_base/2). Only on the xxlarge, there is no dropout (https://tfhub.dev/google/albert_xxlarge/2). What is correct?
Hi, this issue is related to ALBERT and especially the V2 models, specifically the xlarge
version 2.
TLDR: The ALBERT-xlarge V2 model seems to be different to other V2/V1 models.
The models are accessible through the HUB; in order to inspect them I save the checkpoints which I then load in the modeling.AlbertModel
available in the modeling.py
file. I use this script to save the checkpoint to a file.
In a different script, I load the checkpoint in a model from modeling.py
(a line has to be added to the scope so that the modeling scope begins with module
, same as the HUB module). I load the checkpoint in this script. In that same script I load a HUB module, and I compare the outputs of both models given the same input values.
For every model, I check that the difference is near to zero by checking the maximum difference between tensor values (included at the bottom of the second script). Here are the results:
ALBERT-BASE-V1 max difference: pooled 8.009374e-06, full transformer 2.3543835e-06
ALBERT-LARGE-V1 max difference: pooled 2.5719404e-05 full transformer 1.8417835e-05
ALBERT-XLARGE-V1 max difference: pooled 0.0006218478 full transformer 0.0
ALBERT-XXLARGE-V1 max difference: pooled 0.0 full transformer 1.0311604e-05
ALBERT-BASE-V2 max difference: pooled 2.3335218e-05 full transformer 4.9591064e-05
ALBERT-LARGE-V2 max difference: pooled 0.00015488267 full transformer 0.00010347366
ALBERT-XLARGE-V2 max difference: pooled 1.9535216 full transformer 5.152705
ALBERT-XXLARGE-V2 max difference: pooled 1.7762184e-05 full transformer 2.592802e-06
Is there an issue with this model in particular, does it have a particular architecture change that is different from the others?
I have had no problems replicating the SQuAD results on all of the V1 models, but I could not do so on the V2 models apart for the base one. Is this related? Thank you for your time.
I am trying to creat tensorflow-hub Module using locally saved ALBER pre-trained model and I am getting this error :
tensorflow.python.framework.errors_impl.NotFoundError: Graph ops missing from the python registry ({'Einsum'}) are also absent from the c++ registry.
I have:
tensorflow: 1.14.0
tensorflow-hub: 0.7.0
opt-einsum= 3.1.0
I am working on MAC OS High Sierra 10.13.6
Hi,
What are the parameters setting for training data generation for your released models?
like, what is the dupe_factor
, whether do whole word masking
? I just found the max_seq_len
, mask_probability, n_gram_mask,shorter_seq_prob in the paper?
Thanks
In particular, the error is Index Error: list index out of range in _convert_index()
i can't figure out what should I pass to the example in this parameter
Although the layers are shared in ALBERT, I failed running ALBERT with larger batch size than the batch size I ran successfully on BERT.
Only thing I can suspect is that TPU/GPU consumes same amount of memory regardless of layer sharing.
Is this expected?
i try to run run_squad_sp.py, and I met two error. :(((( my code is as follows:
python -m run_squad_sp \
--albert_config_file=data/assets/albert_config.json \
--vocab_file=data/assets/30k-clean.vocab \
--spm_model_file=data/assets/30k-clean.model \
--output_dir=data/output \
--train_file=data/train-v2.0.json \
--predict_file=data/dev-v2.0.json \
--train_feature_file=data/train.tfrecord \
--predict_feature_file=data/dev.tfrecord \
--init_checkpoint=data/variables/variables \
--do_train \
--do_predict \
--nouse_tpu \
--train_batch_size=32 \
--predict_batch_size=8 \
--num_train_steps=3 \
--version_2_with_negative=True
At first,it caused this error, and i had solued it.
Traceback (most recent call last):
File "/home/bai/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/bai/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/bai/squad/run_squad_sp.py", line 1331, in
tf.app.run()
File "/home/bai/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/bai/squad/run_squad_sp.py", line 1239, in main
output_fn=train_writer.process_feature)
File "/home/bai/squad/run_squad_sp.py", line 381, in convert_examples_to_features
print("".join(para_tokens))
TypeError: sequence item 89: expected str instance, bytes found
Then i met other error.
Traceback (most recent call last):
File "/home/bai/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/bai/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/bai/squad/run_squad_sp.py", line 1334, in
tf.app.run()
File "/home/bai/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/home/bai/squad/run_squad_sp.py", line 1242, in main
output_fn=train_writer.process_feature)
File "/home/bai/squad/run_squad_sp.py", line 461, in convert_examples_to_features
n, is_start=False)
File "/home/bai/squad/run_squad_sp.py", line 302, in _convert_index
if index[pos] is not None:
IndexError: list index out of range
May i get some help here, thx~
Before ALBERT was moved to this repository, I downloaded the pre-trained ALBERT-base-2 from TFHub and used run_classifier_sp.py
to evaluate the model on MNLI by modifying the provided run.sh
script to execute the following instead of run_pretraining_test
:
python -m albert.run_classifier_sp \
--output_dir="/path/to/output" \
--export_dir="/path/to/export" \
--do_eval \
--nouse_tpu \
--eval_batch_size=1 \
--max_seq_length=4 \
--max_eval_steps=3 \
--vocab_file="/path/to/albert-base-2/assets/30k-clean.vocab" \
--data_dir="/path/to/glue/MNLI" \
--task_name=MNLI
This gave an eval accuracy of approximately 0.34, which is significantly lower than the expected 0.84 discussed in the paper.
Has anyone else seen such low out-of-the-box evaluation results? Is this simply an issue with how I'm running the evaluation? If so, are there any recommendations for running evaluation to achieve better results?
Download albert-xxlarge-v1 from https://storage.googleapis.com/tfhub-modules/google/albert_xxlarge/1.tar.gz
there are only 30k-clean.model
file in v1's assets
dir,
But in albert-xxlarge-v2's assets
dir, there are three files: 30k-clean.vocab
, albert_config.json
, 30k-clean.model
.
How to get 30k-clean.vocab
, albert_config.json
file in albert-xxlarge-v1 ??
# Note(mingdachen):
# For foreign characters, we always treat them as a whole piece.
english_chars = set(list("abcdefghijklmnopqrstuvwhyz"))
the character h
is listed twice.
Is there some benchmark that shows comparison with vs without this layer norm after embedding layer?
export_dir
is just made at L545 in run_pretraining.py
, but nothing else happens.
tf.gfile.MakeDirs(FLAGS.export_dir)
In the current script, unless --export_dir
option is specified, the script dies in the evaluation step.
export_dir
is necessary?
It looks like the FullTokenizer
in tokenizer.py
does not respect the do_lower_case
argument if spm_model_file
is provided!
Which is somewhat inconsistent, as the do_lower_case
argument would not be ignored if the spm_model_file
is not specified.
(during pre-training the lower case conversion is done in tokenizer.preprocess_text()
and unicode normalization too; in the BERT-like tokenization, i.e. when no spm_model_file
is specified, both would be done in FullTokenizer
)
I try to change ALBERT optimizer to RAdam, but got this error after 1000steps. I tried lower batch size but still not work
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/mengqingyang0102/albert/run_squad_sp.py", line 1384, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.5/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/mengqingyang0102/albert/run_squad_sp.py", line 1307, in main
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 136, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.5/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: From /job:worker/replica:0/task:0:
Gradient for bert/embeddings/position_embeddings:0 is NaN : Tensor had NaN values
[[node CheckNumerics_2 (defined at usr/local/lib/python3.5/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
I am getting Index out of Range error in tokenization.py when running a finetune Albert large model with TF Hub. I printed out the vocab file and printing out the token before the error. You can see the error and print-outs below.
Vocab File: b'/tmp/tfhub_modules/c88f9d4ac7469966b2fab3b577a8031ae23e125a/assets/30k-clean.model'
Token:
Traceback (most recent call last):
File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/[user]/Documents/ml-tests/falling-albert/albert/run_classifier_with_tfhub.py", line 318, in <module>
tf.compat.v1.app.run()
File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/[user]/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/[user]/Documents/ml-tests/falling-albert/albert/run_classifier_with_tfhub.py", line 185, in main
tokenizer = create_tokenizer_from_hub_module(FLAGS.albert_hub_module_handle)
File "/home/[user]/Documents/ml-tests/falling-albert/albert/run_classifier_with_tfhub.py", line 161, in create_tokenizer_from_hub_module
spm_model_file=FLAGS.spm_model_file)
File "/home/[user]/Documents/ml-tests/falling-albert/albert/tokenization.py", line 249, in __init__
self.vocab = load_vocab(vocab_file)
File "/home/[user]/Documents/ml-tests/falling-albert/albert/tokenization.py", line 203, in load_vocab
token = token.strip().split()[0]
IndexError: list index out of range
Albert Finetune Shell Script
#!/bin/bash
pip install -r albert/requirements.txt
python -m albert.run_classifier_with_tfhub \
--albert_hub_module_handle=https://tfhub.dev/google/albert_xlarge/1 \
--task_name=cola \
--do_train=true \
--do_eval=true \
--data_dir=./data-to-albert \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-05 \
--num_train_epochs=3.0 \
--output_dir=./checkpoints/test
When run run_classifier_with_tfhub.py
, but the training crashed. The error is:
LookupError: No gradient defined for operation 'module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/LayerNorm_1/batchnorm/add_1' (op type: AddV2)
My tensorflow-gpu version is 1.14.0
Anyone knows the reason, pls help..
Thanks
Hello!
Thank you for releasing the code for Albert!
Could you upload the pre-trained checkpoints for the 4 Albert models? I would like to run run_squad_sp.py
directly for finetuning on SQuAD.
Could you also release instructions on how to run SQuAD using tensorflow hub directly? (Similar to run_classifier_with_tfhub.py
?
Thanks in advance!
I still didn't find the proper documentation on how to train and test the model.
I somehow fixed the training issues using the input from https://github.com/google-research/google-research/issues/84
Now I have the trained check points. But I am not able to test it because of missing vocab.txt. I tried to use vocab.txt from BERT model still failed.
Did any one figure out a way to test the model.
My arguments for training:
!python -m run_classifier_with_tfhub \
--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1 --task_name=cola --do_train=true --do_eval=true --data_dir=./dataset --output_dir=./albert_output/ --max_seq_length=64 --train_batch_size=2 --learning_rate=2e-5 --num_train_epochs=3.0
My arguments for testing:
!python run_classifier_sp.py --task_name=cola --do_predict=true --data_dir=./dataset --albert_config_file=./model/2/assets/albert_config.json --init_checkpoint=./albert_output/model.ckpt-906 --vocab_file=./model/vocab.txt --max_seq_length=64 --output_dir=./albert_output/
I fine-tuned Albert base on my task but didn't get desired accuracy. Now that I am trying to fine-tune Albert large I get this error:
"Resource exhausted: OOM when allocating tensor with shape[8,512,16,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc"
I used single GPU, with 12 GB Memory and also 16 GB (two different attempts).
It is interesting that I can fine-tune Bert base on the single gpu with 12GB memory.
Do you plan to release multilingual pre-trained models like BERT? Would appreciate it.
I downloaded albert_xxl v2, in file assets/30k-clean.vocab entry for [UNK] looks like:
<unk> 0
while in tokenization.py it's :
class WordpieceTokenizer(object):
"""Runs WordPiece tokenziation."""
def init(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
So I'm getting error like below. Is it ok to modlfy tokenization.py or I'm doing something wrong?
input_ids = tokenizer.convert_tokens_to_ids(ntokens)
File "J:\albert\tokenization.py", line 269, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)
File "J:\albert\tokenization.py", line 211, in convert_by_vocab
output.append(vocab[item])
KeyError: '[UNK]'
!python -m albert.run_classifier_with_tfhub \
--albert_hub_module_handle=https://tfhub.dev/google/albert_large/2 \
--data_dir=input \
--task_name=cola \
--do_train=True \
--do_eval=True \
--train_batch_size=16 \
--eval_batch_size=16 \
--max_seq_length=128 \
--learning_rate=1e-4 \
--num_train_epochs=2 \
--output_dir=output
Run on colab with tensorflow==1.15.0
INFO:tensorflow:Writing example 0 of 123163
I1104 20:48:08.980587 140516005656448 run_classifier_sp.py:804] Writing example 0 of 123163
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/albert/run_classifier_with_tfhub.py", line 319, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/content/albert/run_classifier_with_tfhub.py", line 233, in main
train_examples, label_list, FLAGS.max_seq_length, tokenizer)
File "/content/albert/run_classifier_sp.py", line 807, in convert_examples_to_features
max_seq_length, tokenizer)
File "/content/albert/run_classifier_sp.py", line 464, in convert_single_example
input_ids = tokenizer.convert_tokens_to_ids(tokens)
File "/content/albert/tokenization.py", line 270, in convert_tokens_to_ids
return convert_by_vocab(self.vocab, tokens)
File "/content/albert/tokenization.py", line 212, in convert_by_vocab
output.append(vocab[item])
KeyError: '[CLS]'
I noticed that the TFHub released ALBERT v2 models specify zeros for attention_probs_dropout_prob
and hidden_dropout_prob
in the assets/albert_config.json
:
{
"attention_probs_dropout_prob": 0,
"hidden_act": "gelu",
"hidden_dropout_prob": 0,
...
}
however the [README](https://tfhub.dev/google/albert_base/2) in TFHub specifies:
```json
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
...
}
May be one of them (README
or assets/albert_config.json
) could be update?
I'm also wondering if it is a good idea to provide a do_lower_case
flag somewhere under assets/
- just as a minimal specification for the required text pre-processing?
Probably such a do_lower_case
belongs to the sentencepiece model, what do you think?
I am using run_classifier_with_tfhub with --albert_hub_module_handle=https://tfhub.dev/google/albert_base/2.
I am getting error like "LookupError: No gradient defined for operation 'module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/einsum/Ein$um' (op type: Einsum)"
The argument is:
python3 -m run_classifier_with_tfhub --data_dir=../../DataSet/CoLA/ --t
ask_name=cola --output_dir=testing_ttt --vocab_file=vocab.txt --albert_hub_module_handle=https://tfhub.dev/google/albert_base/2 --do_train=True --do_eval=True --max_seq
_length=128 --train_batch_size=32 --learning_rate=2e-05 --num_train_epochs=3.0
I am using tensorflow==1.15.0
I am considering training AlBert from scratch in another language on a single TPU v3 128Gb. I have a corpus of around 2B words.
Would this be a sufficient corpus size? Could you give a rough estimate of how long this would take for the various models?
for example:
ALBERT_PATH = "xxx" // a pretrained tfhub albert model
albert_layer = hub.KerasLayer(ALBERT_PATH , trainable=True)
I'm trying to get ALBERT running locally with the following command line:
python -m albert.run_classifier_with_tfhub --task_name=MNLI --data_dir=./multinli_1.0 --albert_hub_module_handle=https://tfhub.dev/google/albert_large/1 --output_dir=./output --do_train=True
When tokenizer is initialized from TF Hub model it crashes:
Traceback (most recent call last):
File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 320, in <module>
tf.app.run()
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 187, in main
tokenizer = create_tokenizer_from_hub_module(FLAGS.albert_hub_module_handle)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 161, in create_tokenizer_from_hub_module
spm_model_file=FLAGS.spm_model_file)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 247, in __init__
self.vocab = load_vocab(vocab_file)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 201, in load_vocab
token = token.strip().split()[0]
IndexError: list index out of range
The issue is with the line being just a newline character '\n'. However, even if I modify code to ignore them it still crashes later with
Traceback (most recent call last):
File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 320, in <module>
tf.app.run()
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 187, in main
tokenizer = create_tokenizer_from_hub_module(FLAGS.albert_hub_module_handle)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/run_classifier_with_tfhub.py", line 161, in create_tokenizer_from_hub_module
spm_model_file=FLAGS.spm_model_file)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 249, in __init__
self.vocab = load_vocab(vocab_file)
File "/Users/vladimirbugay/Knoema/GitHub/google-research/albert/tokenization.py", line 198, in load_vocab
token = convert_to_unicode(reader.readline())
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 179, in readline
return self._prepare_value(self._read_buf.ReadLineAsString())
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 98, in _prepare_value
return compat.as_str_any(val)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 117, in as_str_any
return as_str(value)
File "/Users/vladimirbugay/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 87, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 8: invalid start byte
I'm running the code on OS X Catalina, Anaconda, Python 3.6
sentencepiece 0.1.83 pypi_0 pypi
tensorflow 1.14.0 mkl_py36h933f829_0
tensorflow-base 1.14.0 mkl_py36h655c25b_0
tensorflow-estimator 1.14.0 py_0
tensorflow-hub 0.6.0 pyhe1b5a44_0 conda-forge
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.