when apply rezero to bert or gpt, get NAN gradients about rezero HOT 5 OPEN

majumderb commented on June 19, 2024

when apply rezero to bert or gpt, get NAN gradients

from rezero.

Comments (5)

calclavia commented on June 19, 2024

@yyht A few questions:

Did you initialize \alpha to zero?
How did you initialize the embedding matrix? We found that GPT2's embedding initialization doesn't work very well.

from rezero.

yyht commented on June 19, 2024

I initizlied \alpha to zero
the initialization are followed by official BERT initialization:
ebmbedding matrix and kernel matrix are initialized via:
def create_initializer(initializer_range=0.02):
"""Creates a truncated_normal_initializer with the given range."""
return tf.truncated_normal_initializer(stddev=initializer_range)

from rezero.

calclavia commented on June 19, 2024

Try initializing the embedding matrix to uniform distribution drawn from +- 1 / d.

from rezero.

sooheon commented on June 19, 2024

@calclavia can you give a little more insight into reasoning for this embedding init recommendation? Curious if it's motivated by empirical performance or other theoretical justification.

from rezero.

calclavia commented on June 19, 2024

@sooheon It depends on the particular implementation of your Transformer. Some implementations (Huggingface) scale the embedding by 1 / d before padding it into higher layers while initializing the embedding with a uniform distribution (-1 to + 1). This effectively does the same thing as initializing it as +- 1/d.

The reasoning for this initialization is less to do with our paper - we simply follow what previous work has recommended. I believe the Attention is all your need paper recommended 1/d scaling for attentional softmax (when d is large). By scaling to 1/d, the gradients for the softmax layer is more well behaved.

The same principle is applied to the output softmax when predicting output vocabularies. When Rezero initializes the Transformer layers to zero, it essentially starts off as a pass-through from input embedding directly to output embedding. Having 1/d initialization ensures the gradients as well behaved.

from rezero.

Recommend Projects

when apply rezero to bert or gpt, get NAN gradients about rezero HOT 5 OPEN

Comments (5)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent