Coder Social home page Coder Social logo

lsdefine / attention-is-all-you-need-keras Goto Github PK

View Code? Open in Web Editor NEW
703.0 26.0 192.0 1.36 MB

A Keras+TensorFlow Implementation of the Transformer: Attention Is All You Need

Python 100.00%
deep-learning attention-is-all-you-need attention-seq2seq keras-tensorflow keras

attention-is-all-you-need-keras's Introduction

The Transformer model in Attention is all you need:a Keras implementation.

A Keras+TensorFlow Implementation of the Transformer: "Attention is All You Need" (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, arxiv, 2017)

Usage

Please refer to en2de_main.py and pinyin_main.py

en2de_main.py

Results

  • The code achieves near results as in the repository: about 70% valid accuracy. If using smaller model parameters, such as layers=2 and d_model=256, the valid accuracy is better since the task is quite small.

For your own data

  • Just preprocess your source and target sequences as the format in en2de.s2s.txt and pinyin.corpus.examples.txt.

Some notes

  • For larger number of layers, the special learning rate scheduler reported in the papar is necessary.
  • In pinyin_main.py, I tried another method to train the deep network. I train the first layer and the embedding layer first, then train a 2-layers model, and then train a 3-layers, etc. It works in this task.

Upgrades

  • Reconstruct some classes.
  • It is easier to use the components in other models, just import transformer.py
  • A fast step-by-step decoder is added, including an upgraded beam-search. But they should be modified to be reuseable.
  • Updated for tensorflow 2.6.0

Acknowledgement

attention-is-all-you-need-keras's People

Contributors

julesgm avatar lsdefine avatar zhanjunlang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

attention-is-all-you-need-keras's Issues

Reshape : Dimension mismatch

def reshape1(x):
      s = tf.shape(x)   # [batch_size, len_q, n_head * d_k]
      x = tf.reshape(x, [s[0], s[1], n_head, d_k])
      x = tf.transpose(x, [2, 0, 1, 3])  
      x = tf.reshape(x, [-1, s[1], d_k])  # [n_head * batch_size, len_q, d_k]

This function is used for value vector also which will have dimension batch_size, len_q, n_head * d_v.
It will pop up an error if d_k and d_v are not same.
The above code is used in transformer.py: MultiHeadAttention

Transformer encoder layer instead of Bidirectional LSTM

So I want to change below Keras bidirectional LSTM layer into Transformer encoder:

lstmLayer = keras.layers.Bidirectional( keras.layers.CuDNNLSTM(args.rnnSize, return_sequences = True, recurrent_initializer = 'glorot_uniform' ) )(inputLayer)

so can this be accomplished using your library? The rest of the code remains same, I just want to replace bidirectional LSTM layers with Transformer.

I would really appreciate your help. Thanks.

pure language model

Hello, inspired by openai/finetune-transformer-lm, I am now trying to make a language model based on your code. I got a question during implementation.

self.model = Model([src_seq_input, tgt_seq_input], loss)
self.model.add_loss([loss])
self.model.compile(optimizer, None)

Why don't you add the loss function through compile api? I am not quite sure about the effect of api add_loss.

By the way, I made a language model encoder based on your Encoder, but I added GetSubMask as you did in Decoder. Then I would like to add a crf layer after the encoder (for sequence labelling, while openAi's model is for text classification). Finally, train the model based on the language model loss + crf loss. Do you have any implementation suggestion? Especially any idea for verifying the correctness of the code...

I saw you example data about pinyin and Chinese, are you Chinese?

Using the approach for video encoding.

I am trying to implement and test the approach for video encoding. I would like to have as input to the system sets of image frames from videos and just encode them using only the encoding part. Therefore, I am trying to comment out the decoder part and I am trying to figure out what modifications should I perform to make it work. I am a bit puzzled with the line 30 and 34 in pinyin_main.py:

gen = dd.S2SDataGenerator('data/pinyin.corpus.txt', itokens, otokens, batch_size=32, max_len=120) s2s.model.fit_generator(gen, steps_per_epoch=2000, epochs=5, callbacks=[lr_scheduler, model_saver])

Could I replace the gen object with a tensor easily? What exactly gen stands for?

'nan' loss function when using layer normalization

Hi,

I was using only the LayerNormalization from your code in mine. I didn't change anything from the code, apart from overriding the compute_mask function, as my input is an Embedding with mask_zero=True

Code

class LayerNormalization(Layer):

    def __init__(self, eps=1e-6, **kwargs):
        self.eps = eps
        super(LayerNormalization, self).__init__(**kwargs)

    def build(self, input_shape):
        self.gamma = self.add_weight(name='gamma', shape=input_shape[-1:],
                                     initializer=Ones(), trainable=True)
        self.beta = self.add_weight(name='beta', shape=input_shape[-1:],
                                    initializer=Zeros(), trainable=True)
        super(LayerNormalization, self).build(input_shape)

    def call(self, x):
        mean = K.mean(x, axis=-1, keepdims=True)
        std = K.std(x, axis=-1, keepdims=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

    def compute_output_shape(self, input_shape):
        return input_shape

    def compute_mask(self, inputs, input_mask=None):
        return input_mask

but strangely I get all nan for all the measurements I do while training and tuning (loss function and others). I tried using other implementations of the LayerNormalization layer (e.g. https://github.com/CyberZHG/keras-layer-normalization), and everything works without problem. I was wondering whether you have any clue about that.

Save model to json

Have you ever tried save/pickle your trained model? It does not seem to work on my side, reporting an error when I use model.to_json.

seq2seq confused with shape

want to play around with the transformer, but I'm confused with shapes.

print(train[0]) [ 2 4 1 283 51 283 986 6 284 8 226 227 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
train.shape is (1000, 57)

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 57)           0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 57, 300)      865200      input_1[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 57, 300)      90000       embedding_2[0][0]                
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 57, 300)      90000       embedding_2[0][0]                
__________________________________________________________________________________________________
lambda_3 (Lambda)               (None, 57)           0           input_1[0][0]                    
__________________________________________________________________________________________________
lambda_4 (Lambda)               (None, None, None)   0           dense_1[0][0]                    
__________________________________________________________________________________________________
lambda_5 (Lambda)               (None, None, None)   0           dense_2[0][0]                    
__________________________________________________________________________________________________
lambda_7 (Lambda)               (None, 57)           0           lambda_3[0][0]                   
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 57)           0                                            
__________________________________________________________________________________________________
lambda_8 (Lambda)               (None, None, None)   0           lambda_4[0][0]                   
                                                                 lambda_5[0][0]                   
__________________________________________________________________________________________________
lambda_9 (Lambda)               (None, 57)           0           lambda_7[0][0]                   
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 56)           0           input_2[0][0]                    
__________________________________________________________________________________________________
add_1 (Add)                     (None, None, None)   0           lambda_8[0][0]                   
                                                                 lambda_9[0][0]                   
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 56, 300)      865200      lambda_1[0][0]                   
__________________________________________________________________________________________________
lambda_12 (Lambda)              (None, 56, 56)       0           lambda_1[0][0]                   
__________________________________________________________________________________________________
lambda_13 (Lambda)              (None, None, None)   0           lambda_1[0][0]                   
__________________________________________________________________________________________________
activation_1 (Activation)       (None, None, None)   0           add_1[0][0]                      
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 57, 300)      90000       embedding_2[0][0]                
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 56, 300)      90000       embedding_3[0][0]                
__________________________________________________________________________________________________
dense_6 (Dense)                 (None, 56, 300)      90000       embedding_3[0][0]                
__________________________________________________________________________________________________
lambda_14 (Lambda)              (None, 56, 56)       0           lambda_12[0][0]                  
                                                                 lambda_13[0][0]                  
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, None, None)   0           activation_1[0][0]               
__________________________________________________________________________________________________
lambda_6 (Lambda)               (None, None, None)   0           dense_3[0][0]                    
__________________________________________________________________________________________________
lambda_16 (Lambda)              (None, None, None)   0           dense_5[0][0]                    
__________________________________________________________________________________________________
lambda_17 (Lambda)              (None, None, None)   0           dense_6[0][0]                    
__________________________________________________________________________________________________
lambda_19 (Lambda)              (None, 56, 56)       0           lambda_14[0][0]                  
__________________________________________________________________________________________________
lambda_10 (Lambda)              (None, None, None)   0           dropout_1[0][0]                  
                                                                 lambda_6[0][0]                   
__________________________________________________________________________________________________
lambda_20 (Lambda)              (None, None, None)   0           lambda_16[0][0]                  
                                                                 lambda_17[0][0]                  
__________________________________________________________________________________________________
lambda_21 (Lambda)              (None, 56, 56)       0           lambda_19[0][0]                  
__________________________________________________________________________________________________
lambda_11 (Lambda)              (None, None, 300)    0           lambda_10[0][0]                  
__________________________________________________________________________________________________
add_4 (Add)                     (None, None, None)   0           lambda_20[0][0]                  
                                                                 lambda_21[0][0]                  
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, None, 300)    90300       lambda_11[0][0]                  
__________________________________________________________________________________________________
activation_2 (Activation)       (None, None, None)   0           add_4[0][0]                      
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, 56, 300)      90000       embedding_3[0][0]                
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, None, 300)    0           time_distributed_1[0][0]         
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, None, None)   0           activation_2[0][0]               
__________________________________________________________________________________________________
lambda_18 (Lambda)              (None, None, None)   0           dense_7[0][0]                    
__________________________________________________________________________________________________
add_2 (Add)                     (None, None, 300)    0           embedding_2[0][0]                
                                                                 dropout_6[0][0]                  
__________________________________________________________________________________________________
lambda_22 (Lambda)              (None, None, None)   0           dropout_3[0][0]                  
                                                                 lambda_18[0][0]                  
__________________________________________________________________________________________________
layer_normalization_2 (LayerNor (None, None, 300)    600         add_2[0][0]                      
__________________________________________________________________________________________________
lambda_23 (Lambda)              (None, None, 300)    0           lambda_22[0][0]                  
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, None, 512)    154112      layer_normalization_2[0][0]      
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, None, 300)    90300       lambda_23[0][0]                  
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, None, 300)    153900      conv1d_1[0][0]                   
__________________________________________________________________________________________________
dropout_7 (Dropout)             (None, None, 300)    0           time_distributed_2[0][0]         
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, None, 300)    0           conv1d_2[0][0]                   
__________________________________________________________________________________________________
add_5 (Add)                     (None, None, 300)    0           embedding_3[0][0]                
                                                                 dropout_7[0][0]                  
__________________________________________________________________________________________________
add_3 (Add)                     (None, None, 300)    0           dropout_2[0][0]                  
                                                                 layer_normalization_2[0][0]      
__________________________________________________________________________________________________
layer_normalization_4 (LayerNor (None, None, 300)    600         add_5[0][0]                      
__________________________________________________________________________________________________
layer_normalization_1 (LayerNor (None, None, 300)    600         add_3[0][0]                      
__________________________________________________________________________________________________
dense_9 (Dense)                 (None, None, 300)    90000       layer_normalization_4[0][0]      
__________________________________________________________________________________________________
dense_10 (Dense)                (None, None, 300)    90000       layer_normalization_1[0][0]      
__________________________________________________________________________________________________
lambda_15 (Lambda)              (None, 56, 57)       0           lambda_1[0][0]                   
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
lambda_24 (Lambda)              (None, None, None)   0           dense_9[0][0]                    
__________________________________________________________________________________________________
lambda_25 (Lambda)              (None, None, None)   0           dense_10[0][0]                   
__________________________________________________________________________________________________
lambda_27 (Lambda)              (None, 56, 57)       0           lambda_15[0][0]                  
__________________________________________________________________________________________________
lambda_28 (Lambda)              (None, None, None)   0           lambda_24[0][0]                  
                                                                 lambda_25[0][0]                  
__________________________________________________________________________________________________
lambda_29 (Lambda)              (None, 56, 57)       0           lambda_27[0][0]                  
__________________________________________________________________________________________________
add_6 (Add)                     (None, None, None)   0           lambda_28[0][0]                  
                                                                 lambda_29[0][0]                  
__________________________________________________________________________________________________
activation_3 (Activation)       (None, None, None)   0           add_6[0][0]                      
__________________________________________________________________________________________________
dense_11 (Dense)                (None, None, 300)    90000       layer_normalization_1[0][0]      
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, None, None)   0           activation_3[0][0]               
__________________________________________________________________________________________________
lambda_26 (Lambda)              (None, None, None)   0           dense_11[0][0]                   
__________________________________________________________________________________________________
lambda_30 (Lambda)              (None, None, None)   0           dropout_4[0][0]                  
                                                                 lambda_26[0][0]                  
__________________________________________________________________________________________________
lambda_31 (Lambda)              (None, None, 300)    0           lambda_30[0][0]                  
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, None, 300)    90300       lambda_31[0][0]                  
__________________________________________________________________________________________________
dropout_8 (Dropout)             (None, None, 300)    0           time_distributed_3[0][0]         
__________________________________________________________________________________________________
add_7 (Add)                     (None, None, 300)    0           layer_normalization_4[0][0]      
                                                                 dropout_8[0][0]                  
__________________________________________________________________________________________________
layer_normalization_5 (LayerNor (None, None, 300)    600         add_7[0][0]                      
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, None, 512)    154112      layer_normalization_5[0][0]      
__________________________________________________________________________________________________
conv1d_4 (Conv1D)               (None, None, 300)    153900      conv1d_3[0][0]                   
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, None, 300)    0           conv1d_4[0][0]                   
__________________________________________________________________________________________________
add_8 (Add)                     (None, None, 300)    0           dropout_5[0][0]                  
                                                                 layer_normalization_5[0][0]      
__________________________________________________________________________________________________
layer_normalization_3 (LayerNor (None, None, 300)    600         add_8[0][0]                      
__________________________________________________________________________________________________
time_distributed_4 (TimeDistrib (None, None, 57)     17100       layer_normalization_3[0][0]      
==================================================================================================
Total params: 3,447,424
Trainable params: 3,447,424
Non-trainable params: 0
__________________________________________________________________________________________________```

I wanna input the train data and output the exact same sentence as input.
how do I do it?

why get same output with different input?

@lsdefine Thanks for your sharing, I use transformer to do seq2seq task. Like, input a article and predict the abstract. When I finish training, I get almost same output with different input. Code are same as your example, data should be right, because with same data, and use lstm block as seq2seq, I got the proper output.
Hope for your answer, Thanks.

Using the transformer instead of a simple LSTM layer

@lsdefine please can you tell me how can I use the transformer instead of an LSTM layer in a simple encoder ? as in this small example.

model = Sequential()
model.add(Embedding(top_words, 100, input_length=max_words, trainable=True))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

LayerNormalization

It takes to change

super().__init__(**kwargs)
on
super(LayerNormalization, self).__init__(**kwargs)

and

super().build(input_shape)
on
super(LayerNormalization, self).build(input_shape)

in class
class LayerNormalization(Layer):

Python 2.7

dimension in GetSubMask

def GetSubMask(s):
len_s = tf.shape(s)[1]
bs = tf.shape(s)[:1]
mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
return mask

if the input is (5,4,3)

wouldn't tf.eye here creates a lower triangle tensor of 5, 4, 4 instead of 5,4,3 because of [:1]

the mask of attention

in transformer.py, line 87,
mask = Lambda(lambda x:K.repeat_elements(x, n_head, 0))(mask)
this line makes the mask shape (in readout_model) like (batch_sizen_head,x,x), but the shape of the result of reshape1 like (n_headbatch_size,x,x), it seems the same shape, but the elements not.

Maybe the repeat_elements could change to tile?

maybe i find a point should be change

self.target_layer = TimeDistributed(Dense(o_tokens.num(), use_bias=False))
change to:
self.target_layer = TimeDistributed(Dense(o_tokens.num(), activation='softmax', use_bias=False))

MultiHeadAttention

Hi!

It is strange to have n_head == 1, but it does not work in MultiHeadAttention class (mode=1)
To fix it, it is enough to change

head = Concatenate()(heads)
attn = Concatenate()(attns)

on

if n_head == 1:
    head = heads[0]
    attn = attns[0]
else:
    head = Concatenate()(heads)
    attn = Concatenate()(attns)

because

A `Concatenate` layer should be called on a list of at least 2 inputs

Keras and Tensorflow Versions

When I run the code I got this error on line 100 of transformer.py:

ValueError: Axis 0 of input tensor should have a defined dimension, but is None. Full tensor shape: (None, None, None). Typically you need to pass a fully-defined input_shape argument to your first layer.

could you specify the versions of keras and tensorflow that you used for your test?

Licence

What is the license for your shared code?

Issues with Keras Lambda Layers

I'm running into a lot of errors attempting to run the Transformer.py file for testing purposes.

The issue begins with:

(100000, 7) (100000, 9)
X:  [[ 2 11 12 ...  7  4  3]
 [ 2 10 11 ... 12 12  3]
 [ 2  5  5 ... 13  6  3]
 ...
 [ 2 13 11 ... 12  5  3]
 [ 2  7 12 ...  6 11  3]
 [ 2  6  4 ...  7 13  3]]
Y:  [[ 2  4 20 ... 19 14  3]
 [ 2  4 20 ... 11  8  3]
 [ 2  4 20 ...  8  3  0]
 ...
 [ 2  4 20 ... 19  9  3]
 [ 2  4 20 ... 19  3  0]
 [ 2  4 20 ... 15  3  0]]
2018-07-10 18:46:13.676502: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Traceback (most recent call last):
  File "transformer.py", line 586, in <module>
    s2s.compile('adam')
  File "transformer.py", line 396, in compile
    enc_output = self.encoder(src_seq, src_pos, active_layers=active_layers)
  File "transformer.py", line 306, in __call__
    mask = Lambda(lambda x:GetPadMask(emb, emb))(src_seq)
  File "/Users/user/anaconda2/envs/tfdeeplearning/lib/python3.6/site-packages/keras/engine/base_layer.py", line 460, in __call__
    output = self.call(inputs, **kwargs)
  File "/Users/user/anaconda2/envs/tfdeeplearning/lib/python3.6/site-packages/keras/layers/core.py", line 693, in call
    return self.function(inputs, **arguments)
  File "transformer.py", line 306, in <lambda>
    mask = Lambda(lambda x:GetPadMask(emb, emb))(src_seq)
  File "transformer.py", line 255, in GetPadMask
    ones = K.expand_dims(K.ones_like(Q, 'float32'), -1)
AttributeError: 'Tensor' object has no attribute 'expand_dims'

What version of Keras and Tensorflow are you using to develop?
Could you add that info to a requirements.txt file or possibly to the readme info.
I am wondering if this is an issue between conflicting versions.
I am using:
tensorflow 1.8.0 Keras 2.2.0

I've tried wrapping the operations in Lambda Layers which works for the first two lines in
GetPadMask Function but I am running into issues again with the K.batch_dot Operation.
An Ideas? I am relatively new to the Keras framework.

K.mean() in computing loss doesn't make any sense.

In

def get_loss(args):
           y_pred, y_true = args
           y_true = tf.cast(y_true, 'int32')
           loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
           mask = tf.cast(tf.not_equal(y_true, 0), 'float32')
           loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
           loss = K.mean(loss)
           return loss

loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
produce single element, it's mean doesn't make difference.

mask for decoder

hello, I suspect that the mask you used for decoder is not correct.
In decoder, the mask you used is a matrix of which elements in the right upper triangle are one.

mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
In [4]: np.cumsum(np.eye(5), 1) Out[4]: array([[1., 1., 1., 1., 1.], [0., 1., 1., 1., 1.], [0., 0., 1., 1., 1.], [0., 0., 0., 1., 1.], [0., 0., 0., 0., 1.]])

That means, when you compute self attention, the first word will take the entire output sequence into account by WmaskV. That is not correct during training. And this problem could also impact the prediction.

ScaledDotProductAttention

Hi, I'm a beginner, and i found

attn = Lambda(lambda x:K.batch_dot(x[0],x[1],axes=[2,2])/self.temper)([q, k])
is equal to
attn = Lambda(lambda x:tf.matmul(x[0],x[1],transpose_b=True)/self.temper)([q, k])

Sorry to disturb you.

the test demo

Thank you for your excellent work. When will the test demo open

layer norm end of the encoder?

should a layer norm be at the end of encoder layer like below? if I search orginal paper there is norm layer after pos-ffn.
class EncoderLayer():
def init(self, d_model, d_inner_hid, n_head, dropout=0.1):
self.self_att_layer = MultiHeadAttention(n_head, d_model, dropout=dropout)
self.pos_ffn_layer = PositionwiseFeedForward(d_model, d_inner_hid, dropout=dropout)
self.norm_layer = LayerNormalization()
def call(self, enc_input, mask=None):
output, slf_attn = self.self_att_layer(enc_input, enc_input, enc_input, mask=mask)
output1 = self.norm_layer(Add()([enc_input, output]))
output = self.pos_ffn_layer(output1)
output = self.norm_layer(Add()([output1 , output]))
return output, slf_attn

encoder

Issue with attention mask

Hello, I check the source code and found the implementation of mask is define as following:

class ScaledDotProductAttention():
	def __init__(self, d_model, attn_dropout=0.1):
		self.temper = np.sqrt(d_model)
		self.dropout = Dropout(attn_dropout)
	def __call__(self, q, k, v, mask):
		attn = Lambda(lambda x:K.batch_dot(x[0],x[1],axes=[2,2])/self.temper)([q, k])
		if mask is not None:
			mmask = Lambda(lambda x:(-1e+10)*(1-x))(mask)
			attn = Add()([attn, mmask])
		attn = Activation('softmax')(attn)
		attn = self.dropout(attn)
		output = Lambda(lambda x:K.batch_dot(x[0], x[1]))([attn, v])
		return output, attn

as far as i am concerned, the "Add()([attn, mmask])" operation will broadcast "mmask" to the shape of "attn", which will mask some rows of "attn". But this may cause the following softmax operation to be meaningless as the softmax layer takes effect on each row. To be clearer,

'''
## we neglect the batch dimension
attn = [
	a_11, a_12, a_13
	a_21, a_22, a_23
	a_31, a_32, a_33
](q=3, k=3)

mmask = [
	0.0,
	0.0,
	-inf, 
](q=3)

## after broadcasting:
attn += mmask 
== [
	a_11, a_12, a_13
	a_21, a_22, a_23
	-inf, -inf, -inf
](q=3, k=3)

attn = softmax(attn)
== [
	softmax(a_11, a_12, a_13)
	softmax(a_21, a_22, a_23)
	1/3, 1/3, 1/3  <-------- is not what we want
](q=3, k=3)
'''

Am I missing something? or should the mask operation take effect after softmax layer?

Decoding a sentence give same translation

I tried translating different english sentences,but got same translation.

Tried decode_sequence and decode_sequence_fast too.
trained for 2 epochs,it that a problem?

reshape may not match

Hi, thanks a lot for your code. It seems that I find a bug.

In the MultiHeadAttention layer, the reshape1 function

x = tf.reshape(x, [s[0], s[1], n_head, s[2]//n_head])
x = tf.transpose(x, [2, 0, 1, 3]) 
x = tf.reshape(x, [-1, s[1], s[2]//n_head])

The transpose puts the head axis before the batch axis. After reshaping, the first axis should be like this (suppose N samples and only 2 heads):

sample_0_head_0
sample_1_head_0
sample_2_head_0
...
sample_N-1_head_0
sample_0_head_1
sample_1_head_1
sample_2_head_1
...
sample_N-1_head_1

But the repeats of mask:

mask = Lambda(lambda x:K.repeat_elements(x, n_head, 0))(mask)

will return mask like this:

mask_0,
mask_0,
mask_1,
mask_1,
...
mask_N,
mask_N,

(find the useage of repeat_elements here)

However, actually we want mask to be like this:

mask_0,
mask_1,
...
mask_N-1,
mask_0,
mask_1,
...
mask_N-1

So I think the reshape function reshape1 should change x = tf.transpose(x, [2, 0, 1, 3]) into x = tf.transpose(x, [0, 2, 1, 3]). And so does the reshape2.

How to perform translation?

Hi
I had managed to train the network using the your given dataset but don't have the idea to how to use the trained model to perform translation prediction.. Pls advise thanks

startup error

Hello. I try to evaluate your script and got the following error message:

(base) C:\Users\cp\Python\attention-is-all-you-need-keras>python en2de_main.py
Using TensorFlow backend.
loading data/en2de_word.txt
loading data/en2de.h5
loading data/en2de.valid.h5
seq 1 words: 3369
seq 2 words: 3665
train shapes: (29000, 43) (29000, 47)
valid shapes: (1014, 34) (1014, 39)
2020-03-11 13:08:06.384900: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default
inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Traceback (most recent call last):
  File "en2de_main.py", line 33, in <module>
    s2s.compile(Adam(0.001, 0.9, 0.98, epsilon=1e-9))
  File "C:\Users\cp\Python\attention-is-all-you-need-keras\transformer.py", line 452, in compile
    loss = get_loss(final_output, tgt_true)
  File "C:\Users\cp\Python\attention-is-all-you-need-keras\transformer.py", line 440, in get_loss
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 3537, in sparse_softmax_cros
s_entropy_with_logits_v2
    labels=labels, logits=logits, name=name)
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 3470, in sparse_softmax_cros
s_entropy_with_logits
    array_ops.shape(logits)[:-1]))
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\check_ops.py", line 658, in assert_equal
    data, summarize, message, name)
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\check_ops.py", line 333, in _binary_assert
    if condition:
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 757, in __bool__
    self._disallow_bool_casting()
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 526, in _disallow_bool_ca
sting
    self._disallow_in_graph_mode("using a `tf.Tensor` as a Python `bool`")
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 515, in _disallow_in_grap
h_mode
    " this function with @tf.function.".format(task))
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not al
lowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

I use keras 2+ and tf 2+ as well but not using any gpu.

Skip-connection in Transformer

Hello,

Thanks for a great project, which helps me build model on top of that.

I was wondering one thing: it seems like you do not implement skip connection (residual network) in Transformer?

Is it because you implemented it and you didn't observe improvement?

Or is it just because you didn't implement it?

I asked because when I use more layers, I got worser performance actually. I am not sure whether it is what it is (i.e. having more layers does not help), or it is because I don't have skip connections, which usually helps build a deeper model.

Best,

after embedding layer

Hello,
after embedding layer, the new token embedding includes learned token embedding and static postional embedding. Of course, the have the postional embedding value. So, before the embedding are entered into the encoder or decoder, if the embedding sequences are needed to multiply padding mask to delete the influence of embeddings?

Tks,
Look forward to your reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.