Coder Social home page Coder Social logo

LookupError: No gradient defined for operation 'module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/einsum/Ein$um' (op type: Einsum) about albert HOT 22 CLOSED

MichaelCaohn avatar MichaelCaohn commented on May 19, 2024 11
LookupError: No gradient defined for operation 'module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/einsum/Ein$um' (op type: Einsum)

from albert.

Comments (22)

Rachnas avatar Rachnas commented on May 19, 2024 3

I faced the same issue with Hub2, work around is to use Hub1.

from albert.

jdongca2003 avatar jdongca2003 commented on May 19, 2024 1

You need to specify vocab or spam_model_file (related to sentencepiece tokenization model) in the command line.
How do you get them?
You can download https://tfhub.dev/google/albert_base/1 and untar them. Then you can find them in "../asset/30k-clean.model"

add command-line arguments
"--spam_model_file=YOUR_PATH/assets/30k-clean.model"

Note: only work for hub1.

from albert.

agupta74 avatar agupta74 commented on May 19, 2024 1

I am still seeing the same issue with TF 1.15. using the "run_classifier" command mentioned above. v1 module works fine.

LookupError: No gradient defined for operation 'module_apply_tokens/bert/encoder/transformer/group_0_11/layer_11/inner_group_0/ffn_1/intermediate/output/dense/einsum/Einsum' (op type: Einsum)

from albert.

mnsrmov avatar mnsrmov commented on May 19, 2024

I faced the same issue with Hub2, work around is to use Hub1.

FYI, Rachnas means using version 1 of the base model rather than 2. If someone finds a way to use version 2 please tell us the secret!

from albert.

MichaelCaohn avatar MichaelCaohn commented on May 19, 2024

I faced the same issue with Hub2, work around is to use Hub1.

Thank you for the advice Rachnas. It worked in Hub1. However, still wondering how to work using Hub2 :)

from albert.

mnsrmov avatar mnsrmov commented on May 19, 2024

astrongstorm, Rachnas have you guys been able to get reasonable results from any training? Even when I repeat the same example in they have provided I get pretty bad results.

from albert.

Rachnas avatar Rachnas commented on May 19, 2024

astrongstorm, Rachnas have you guys been able to get reasonable results from any training? Even when I repeat the same example in they have provided I get pretty bad results.

I am yet to get results.

from albert.

jdongca2003 avatar jdongca2003 commented on May 19, 2024

run_classifier_with_tfhub.py
--albert_hub_module_handle=https://tfhub.dev/google/albert_base/1
--data_dir=glue_data/MNLI
--task_name=mnli
--spm_model_file=30k-clean.model
--output_dir=output
--do_train=true
--do_eval=true
--max_seq_length=128
--train_batch_size=32
--learning_rate=1e-4
--num_train_epochs=5
--eval_batch_size=32
--predict_batch_size=32
--use_tpu=False

I got poor results too:

INFO:tensorflow:***** Eval results *****
I1109 21:40:18.838561 139722163410752 run_classifier_with_tfhub.py:273] ***** Eval results *****
INFO:tensorflow: eval_accuracy = 0.8169129
I1109 21:40:18.838666 139722163410752 run_classifier_with_tfhub.py:275] eval_accuracy = 0.8169129
INFO:tensorflow: eval_loss = 0.57061106
I1109 21:40:18.838964 139722163410752 run_classifier_with_tfhub.py:275] eval_loss = 0.57061106
INFO:tensorflow: global_step = 61359

from albert.

liuqiangict avatar liuqiangict commented on May 19, 2024

For this problem, I believe we are talking about the v2, there are some problems on tensor lookup on Hub2, right?

from albert.

zheyuye avatar zheyuye commented on May 19, 2024

facing same issue using version 2, but it works fine with version 1 by defined spam_model_file in command line

from albert.

mnsrmov avatar mnsrmov commented on May 19, 2024

I'm getting bad results on both version 1 and 2. Better results on 1 in comparison to 2 though. In my prior experiences with other models I found that Lamb was very sensitive to the parameters. I'm thinking of trying Adam to see if that is the problem. Has anyone tried using Adam instead of Lamb and see if they get better results?

from albert.

agupta74 avatar agupta74 commented on May 19, 2024

I am also having the same issue.

from albert.

PradyumnaGupta avatar PradyumnaGupta commented on May 19, 2024

The training problem is still not solved even after using Hub1 ( version 1 of ALBERT ) . It gives the following error -
ValueError: Variable <tf.Variable 'albert_layer_module/cls/predictions/output_bias:0' shape=(30000,) dtype=float32> has None for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.

from albert.

iShaka avatar iShaka commented on May 19, 2024

I'm getting bad results on both version 1 and 2. Better results on 1 in comparison to 2 though. In my prior experiences with other models I found that Lamb was very sensitive to the parameters. I'm thinking of trying Adam to see if that is the problem. Has anyone tried using Adam instead of Lamb and see if they get better results?

Have you solve the problem on v2? Could you share how to make it work?

from albert.

agupta74 avatar agupta74 commented on May 19, 2024

The issue with hub v2 modules is not fixed yet (v1 is good)

from albert.

0x0539 avatar 0x0539 commented on May 19, 2024

The "no gradient defined for operation Einsum" was found to be caused by using an old version of TF. The full investigation is here. I've modified requirements.txt to explicitly request TF 1.15. Please run pip install -r requirements.txt and verify that you are running TF 1.15. If you still see the problem, let me know by posting to this thread.

BTW, I merged the TF-hub functionality into run_classifier.py in this commit. The reason is that run_classifier_with_tfhub.py got out of sync. Please use run_classifier.py with --albert_hub_module_handle=XXX when fine-tuning from TF-Hub. Sorry for any inconvenience.

I tested this with TF1.15 using the v2 hub modules and it seems to be working at HEAD.

python3 -m run_classifier --data_dir="$HOME/ALBERT/glue" --task_name=cola --output_dir=/tmp/testing_ttt --vocab_file=vocab.txt --albert_hub_module_handle=https://tfhub.dev/google/albert_base/2 --do_train=True --do_eval=True --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-05 --train_step=50 --spm_model_file="$HOME/ALBERT/spm_vocab/30k-clean.model"

from albert.

jonanem avatar jonanem commented on May 19, 2024

The "no gradient defined for operation Einsum" was found to be caused by using an old version of TF. The full investigation is here. I've modified requirements.txt to explicitly request TF 1.15. Please run pip install -r requirements.txt and verify that you are running TF 1.15. If you still see the problem, let me know by posting to this thread.

BTW, I merged the TF-hub functionality into run_classifier.py in this commit. The reason is that run_classifier_with_tfhub.py got out of sync. Please use run_classifier.py with --albert_hub_module_handle=XXX when fine-tuning from TF-Hub. Sorry for any inconvenience.

I tested this with TF1.15 using the v2 hub modules and it seems to be working at HEAD.

python3 -m run_classifier --data_dir="$HOME/ALBERT/glue" --task_name=cola --output_dir=/tmp/testing_ttt --vocab_file=vocab.txt --albert_hub_module_handle=https://tfhub.dev/google/albert_base/2 --do_train=True --do_eval=True --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-05 --train_step=50 --spm_model_file="$HOME/ALBERT/spm_vocab/30k-clean.model"

with tensorflow version 1.15 we are still facing the same error

from albert.

0x0539 avatar 0x0539 commented on May 19, 2024

Ah, now I'm able to reproduce it. There appears to be an issue with the way that the V2 modules were generated. I'm looking into it with the TF team and will get back with an answer soon hopefully.

from albert.

0x0539 avatar 0x0539 commented on May 19, 2024

It looks like V2 modules were generated with a different version of TF, which contains native ops not present in TF 1.X releases. We will have to regenerate and re-release them with TF 1.15. Apologies for the inconvenience. I'll update this thread when the new modules are uploaded.

from albert.

0x0539 avatar 0x0539 commented on May 19, 2024

We have regenerated the hub modules using TF1.15.
Please use hub modules with the "/3" suffix. Hub modules with the "/2" suffix will remain broken. TF-Hub links in the readme have been updated accordingly.
See Jan 7 update in the readme for more info.

from albert.

TheGlobalist avatar TheGlobalist commented on May 19, 2024

I am facing the same issue with the traditional BERt on Colab.
Here's all the specs:

TF --> '1.15.0'
Colab --> '0.7.0'

Code for loading BERt


        input_word_ids = tf.keras.layers.Input(shape=(20,), dtype=tf.int32, name="input_word_ids")
        input_mask = tf.keras.layers.Input(shape=(20,), dtype=tf.int32, name="input_mask")
        segment_ids = tf.keras.layers.Input(shape=(20,), dtype=tf.int32, name="segment_ids")
        #BERt = BERtLayer()([input_word_ids, input_mask, segment_ids])
        bert = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/1", trainable=True)
        pooled_output, sequence_output = bert([input_word_ids, input_mask, segment_ids])

Exception thrown

`Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_word_ids (InputLayer)     [(None, 20)]         0                                            
__________________________________________________________________________________________________
input_mask (InputLayer)         [(None, 20)]         0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 20)]         0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        [(None, 768), (None, 177853441   input_word_ids[0][0]             
                                                                 input_mask[0][0]                 
                                                                 segment_ids[0][0]                
__________________________________________________________________________________________________
bidirectional (Bidirectional)   [(None, None, 512),  2099200     keras_layer[0][1]                
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 512)          0           bidirectional[0][1]              
                                                                 bidirectional[0][3]              
__________________________________________________________________________________________________
repeat_vector (RepeatVector)    (None, None, 512)    0           concatenate[0][0]                
__________________________________________________________________________________________________
dense (Dense)                   (None, None, 1)      513         repeat_vector[0][0]              
__________________________________________________________________________________________________
activation (Activation)         (None, None, 1)      0           dense[0][0]                      
__________________________________________________________________________________________________
lambda (Lambda)                 (None, 512)          0           bidirectional[0][0]              
                                                                 activation[0][0]                 
__________________________________________________________________________________________________
multiply (Multiply)             (None, None, 512)    0           bidirectional[0][0]              
                                                                 lambda[0][0]                     
__________________________________________________________________________________________________
babelnet (Dense)                (None, None, 26221)  13451373    multiply[0][0]                   
__________________________________________________________________________________________________
domain (Dense)                  (None, None, 9916)   5086908     multiply[0][0]                   
__________________________________________________________________________________________________
lexicon (Dense)                 (None, None, 9916)   5086908     multiply[0][0]                   
==================================================================================================
Total params: 203,578,343
Trainable params: 203,578,342
Non-trainable params: 1
__________________________________________________________________________________________________
enter in train...
WARNING:tensorflow:From /content/Progetto/code/tokenizer.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

WARNING:tensorflow:From /content/Progetto/code/tokenizer.py:125: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead.

Done train preparation...
Done label preparatiomn
ciao
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 29740 samples, validate on 7436 samples
2020-02-22 08:36:39.829236: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-22 08:36:39.829902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
2020-02-22 08:36:39.830005: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-22 08:36:39.830039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-22 08:36:39.830074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-22 08:36:39.830103: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-22 08:36:39.830127: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-22 08:36:39.830154: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-22 08:36:39.830182: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-22 08:36:39.830309: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-22 08:36:39.830960: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-22 08:36:39.831507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-02-22 08:36:39.831561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-22 08:36:39.831575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-02-22 08:36:39.831603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-02-22 08:36:39.831760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-22 08:36:39.832342: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-02-22 08:36:39.832866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14221 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2020-02-22 08:36:41.766438: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 367248384 exceeds 10% of system memory.
2020-02-22 08:36:42.267189: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 367248384 exceeds 10% of system memory.
2020-02-22 08:36:43.576576: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 367248384 exceeds 10% of system memory.
2020-02-22 08:36:43.654042: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 367248384 exceeds 10% of system memory.
2020-02-22 08:36:44.099220: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 367248384 exceeds 10% of system memory.
Epoch 1/4
2020-02-22 08:37:05.283924: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-22 08:37:08.127005: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at function_ops.cc:250 : Not found: No gradient defined for op: Einsum
Traceback (most recent call last):
  File "model.py", line 128, in <module>
    modello.train(train,label,vocab_label_bn,vocab_label_wndmn,vocab_label_lex, train_dev, label_dev)
  File "model.py", line 92, in train
    callbacks = [checkpoint, early_stopper],
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 727, in fit
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 675, in fit
    steps_name='steps_per_epoch')
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 394, in model_iteration
    batch_outs = f(ins_batch)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3476, in __call__
    run_metadata=self.run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.NotFoundError: [_Derived_]No gradient defined for op: Einsum
	 [[{{node Func/_36}}]]
	 [[training/Adam/gradients/gradients/keras_layer/cond/StatefulPartitionedCall_grad/PartitionedCall/gradients/StatefulPartitionedCall_grad/PartitionedCall/gradients/StatefulPartitionedCall_grad/SymbolicGradient]]`

from albert.

RobRomijnders avatar RobRomijnders commented on May 19, 2024

Does the hub module have multiple tags? If so, did you tried any other?

I faced a similar error with a different hub module. It turns out I was using the incorrect tag.

from albert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.