lsdefine / attention-is-all-you-need-keras Goto Github PK

A Keras+TensorFlow Implementation of the Transformer: Attention Is All You Need

Python 100.00%

deep-learning attention-is-all-you-need attention-seq2seq keras-tensorflow keras

attention-is-all-you-need-keras's Introduction

The Transformer model in Attention is all you need：a Keras implementation.

A Keras+TensorFlow Implementation of the Transformer: "Attention is All You Need" (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, arxiv, 2017)

Usage

Please refer to en2de_main.py and pinyin_main.py

en2de_main.py

This task is same as in jadore801120/attention-is-all-you-need-pytorch: WMT'16 Multimodal Translation: Multi30k (de-en) (http://www.statmt.org/wmt16/multimodal-task.html). We borrowed the data preprocessing step 0 and 1 in the repository, and then construct the input file en2de.s2s.txt

Results

The code achieves near results as in the repository: about 70% valid accuracy. If using smaller model parameters, such as layers=2 and d_model=256, the valid accuracy is better since the task is quite small.

For your own data

Just preprocess your source and target sequences as the format in en2de.s2s.txt and pinyin.corpus.examples.txt.

Some notes

For larger number of layers, the special learning rate scheduler reported in the papar is necessary.
In pinyin_main.py, I tried another method to train the deep network. I train the first layer and the embedding layer first, then train a 2-layers model, and then train a 3-layers, etc. It works in this task.

Upgrades

Reconstruct some classes.
It is easier to use the components in other models, just import transformer.py
A fast step-by-step decoder is added, including an upgraded beam-search. But they should be modified to be reuseable.
Updated for tensorflow 2.6.0

Acknowledgement

Some model structures and some scripts are borrowed from jadore801120/attention-is-all-you-need-pytorch.

attention-is-all-you-need-keras's People

Contributors

Stargazers

Watchers

Forkers

fdulcg chnagi jusjosgra pb-pravin cidd04 zhaoluffy wellbeing18 isallam haezyrohs jianbotang dangxuanhong gongqin721 wyu-du kaeflint hytsang guanj-h cjopengler johndpope joon-park92 malcolmgreaves zhangyijia1979 xiantotikui generalsemantics jpark77 zjatc xifengbishu wangjin0818 jetrunner liuhaumin chiahungtai hannight williamwhe code-lava frankfqchen yaoyi626 lynnchan90 wushicanasl deepphysicvision roertech iris8beiny liuqiangict lisha1992 lichao88 super-louis jkhlot houchangtao xuxuanbo zxsted we1l1n salkhan23 jliu-1 twilightdema xiaofenghuhuhu ivartanian se7endragon zw76859420 ngo010 currylym wentropy cyhong549 dizzydwarf75 hoangcuong2011 dingfeng fendaq leekltw hlng2002 mrkamizhou qshuang123 lvaleriu kapitsa2811 udaybmi mark0428 sjs2109 aurora314156 ichenjia qibaoyuan hjc3613 jrjdr takeshikondo boluoyu kaushalprasadhial yishuihanhan xiaotret akhileshydv aivanni kylejshaffer maximtian jianjunchang dopufol osama-ahmed caseycas zqs01 maxcurie orchestor xiaoling1992 ddxk ap-nlp-research wanglc2008 zyxpaidaxing yanghaocsg

attention-is-all-you-need-keras's Issues

Reshape : Dimension mismatch

def reshape1(x):
      s = tf.shape(x)   # [batch_size, len_q, n_head * d_k]
      x = tf.reshape(x, [s[0], s[1], n_head, d_k])
      x = tf.transpose(x, [2, 0, 1, 3])  
      x = tf.reshape(x, [-1, s[1], d_k])  # [n_head * batch_size, len_q, d_k]

This function is used for value vector also which will have dimension batch_size, len_q, n_head * d_v.
It will pop up an error if d_k and d_v are not same.
The above code is used in transformer.py: MultiHeadAttention

Transformer encoder layer instead of Bidirectional LSTM

So I want to change below Keras bidirectional LSTM layer into Transformer encoder:

lstmLayer = keras.layers.Bidirectional( keras.layers.CuDNNLSTM(args.rnnSize, return_sequences = True, recurrent_initializer = 'glorot_uniform' ) )(inputLayer)

so can this be accomplished using your library? The rest of the code remains same, I just want to replace bidirectional LSTM layers with Transformer.

I would really appreciate your help. Thanks.

pure language model

Hello, inspired by openai/finetune-transformer-lm, I am now trying to make a language model based on your code. I got a question during implementation.

self.model = Model([src_seq_input, tgt_seq_input], loss)
self.model.add_loss([loss])
self.model.compile(optimizer, None)

Why don't you add the loss function through compile api? I am not quite sure about the effect of api add_loss.

By the way, I made a language model encoder based on your Encoder, but I added GetSubMask as you did in Decoder. Then I would like to add a crf layer after the encoder (for sequence labelling, while openAi's model is for text classification). Finally, train the model based on the language model loss + crf loss. Do you have any implementation suggestion? Especially any idea for verifying the correctness of the code...

I saw you example data about pinyin and Chinese, are you Chinese?

Using the approach for video encoding.

I am trying to implement and test the approach for video encoding. I would like to have as input to the system sets of image frames from videos and just encode them using only the encoding part. Therefore, I am trying to comment out the decoder part and I am trying to figure out what modifications should I perform to make it work. I am a bit puzzled with the line 30 and 34 in pinyin_main.py:

gen = dd.S2SDataGenerator('data/pinyin.corpus.txt', itokens, otokens, batch_size=32, max_len=120) s2s.model.fit_generator(gen, steps_per_epoch=2000, epochs=5, callbacks=[lr_scheduler, model_saver])

Could I replace the gen object with a tensor easily? What exactly gen stands for?

'nan' loss function when using layer normalization

Hi,

I was using only the LayerNormalization from your code in mine. I didn't change anything from the code, apart from overriding the compute_mask function, as my input is an Embedding with mask_zero=True

Code

class LayerNormalization(Layer):

    def __init__(self, eps=1e-6, **kwargs):
        self.eps = eps
        super(LayerNormalization, self).__init__(**kwargs)

    def build(self, input_shape):
        self.gamma = self.add_weight(name='gamma', shape=input_shape[-1:],
                                     initializer=Ones(), trainable=True)
        self.beta = self.add_weight(name='beta', shape=input_shape[-1:],
                                    initializer=Zeros(), trainable=True)
        super(LayerNormalization, self).build(input_shape)

    def call(self, x):
        mean = K.mean(x, axis=-1, keepdims=True)
        std = K.std(x, axis=-1, keepdims=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

    def compute_output_shape(self, input_shape):
        return input_shape

    def compute_mask(self, inputs, input_mask=None):
        return input_mask

but strangely I get all nan for all the measurements I do while training and tuning (loss function and others). I tried using other implementations of the LayerNormalization layer (e.g. https://github.com/CyberZHG/keras-layer-normalization), and everything works without problem. I was wondering whether you have any clue about that.

Save model to json

Have you ever tried save/pickle your trained model? It does not seem to work on my side, reporting an error when I use model.to_json.

seq2seq confused with shape

want to play around with the transformer, but I'm confused with shapes.

print(train[0]) [ 2 4 1 283 51 283 986 6 284 8 226 227 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
train.shape is (1000, 57)

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 57)           0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 57, 300)      865200      input_1[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 57, 300)      90000       embedding_2[0][0]                
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 57, 300)      90000       embedding_2[0][0]                
__________________________________________________________________________________________________
lambda_3 (Lambda)               (None, 57)           0           input_1[0][0]                    
__________________________________________________________________________________________________
lambda_4 (Lambda)               (None, None, None)   0           dense_1[0][0]                    
__________________________________________________________________________________________________
lambda_5 (Lambda)               (None, None, None)   0           dense_2[0][0]                    
__________________________________________________________________________________________________
lambda_7 (Lambda)               (None, 57)           0           lambda_3[0][0]                   
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 57)           0                                            
__________________________________________________________________________________________________
lambda_8 (Lambda)               (None, None, None)   0           lambda_4[0][0]                   
                                                                 lambda_5[0][0]                   
__________________________________________________________________________________________________
lambda_9 (Lambda)               (None, 57)           0           lambda_7[0][0]                   
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 56)           0           input_2[0][0]                    
__________________________________________________________________________________________________
add_1 (Add)                     (None, None, None)   0           lambda_8[0][0]                   
                                                                 lambda_9[0][0]                   
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 56, 300)      865200      lambda_1[0][0]                   
__________________________________________________________________________________________________
lambda_12 (Lambda)              (None, 56, 56)       0           lambda_1[0][0]                   
__________________________________________________________________________________________________
lambda_13 (Lambda)              (None, None, None)   0           lambda_1[0][0]                   
__________________________________________________________________________________________________
activation_1 (Activation)       (None, None, None)   0           add_1[0][0]                      
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 57, 300)      90000       embedding_2[0][0]                
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 56, 300)      90000       embedding_3[0][0]                
__________________________________________________________________________________________________
dense_6 (Dense)                 (None, 56, 300)      90000       embedding_3[0][0]                
__________________________________________________________________________________________________
lambda_14 (Lambda)              (None, 56, 56)       0           lambda_12[0][0]                  
                                                                 lambda_13[0][0]                  
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, None, None)   0           activation_1[0][0]               
__________________________________________________________________________________________________
lambda_6 (Lambda)               (None, None, None)   0           dense_3[0][0]                    
__________________________________________________________________________________________________
lambda_16 (Lambda)              (None, None, None)   0           dense_5[0][0]                    
__________________________________________________________________________________________________
lambda_17 (Lambda)              (None, None, None)   0           dense_6[0][0]                    
__________________________________________________________________________________________________
lambda_19 (Lambda)              (None, 56, 56)       0           lambda_14[0][0]                  
__________________________________________________________________________________________________
lambda_10 (Lambda)              (None, None, None)   0           dropout_1[0][0]                  
                                                                 lambda_6[0][0]                   
__________________________________________________________________________________________________
lambda_20 (Lambda)              (None, None, None)   0           lambda_16[0][0]                  
                                                                 lambda_17[0][0]                  
__________________________________________________________________________________________________
lambda_21 (Lambda)              (None, 56, 56)       0           lambda_19[0][0]                  
__________________________________________________________________________________________________
lambda_11 (Lambda)              (None, None, 300)    0           lambda_10[0][0]                  
__________________________________________________________________________________________________
add_4 (Add)                     (None, None, None)   0           lambda_20[0][0]                  
                                                                 lambda_21[0][0]                  
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, None, 300)    90300       lambda_11[0][0]                  
__________________________________________________________________________________________________
activation_2 (Activation)       (None, None, None)   0           add_4[0][0]                      
__________________________________________________________________________________________________
dense_7 (Dense)                 (None, 56, 300)      90000       embedding_3[0][0]                
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, None, 300)    0           time_distributed_1[0][0]         
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, None, None)   0           activation_2[0][0]               
__________________________________________________________________________________________________
lambda_18 (Lambda)              (None, None, None)   0           dense_7[0][0]                    
__________________________________________________________________________________________________
add_2 (Add)                     (None, None, 300)    0           embedding_2[0][0]                
                                                                 dropout_6[0][0]                  
__________________________________________________________________________________________________
lambda_22 (Lambda)              (None, None, None)   0           dropout_3[0][0]                  
                                                                 lambda_18[0][0]                  
__________________________________________________________________________________________________
layer_normalization_2 (LayerNor (None, None, 300)    600         add_2[0][0]                      
__________________________________________________________________________________________________
lambda_23 (Lambda)              (None, None, 300)    0           lambda_22[0][0]                  
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, None, 512)    154112      layer_normalization_2[0][0]      
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, None, 300)    90300       lambda_23[0][0]                  
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, None, 300)    153900      conv1d_1[0][0]                   
__________________________________________________________________________________________________
dropout_7 (Dropout)             (None, None, 300)    0           time_distributed_2[0][0]         
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, None, 300)    0           conv1d_2[0][0]                   
__________________________________________________________________________________________________
add_5 (Add)                     (None, None, 300)    0           embedding_3[0][0]                
                                                                 dropout_7[0][0]                  
__________________________________________________________________________________________________
add_3 (Add)                     (None, None, 300)    0           dropout_2[0][0]                  
                                                                 layer_normalization_2[0][0]      
__________________________________________________________________________________________________
layer_normalization_4 (LayerNor (None, None, 300)    600         add_5[0][0]                      
__________________________________________________________________________________________________
layer_normalization_1 (LayerNor (None, None, 300)    600         add_3[0][0]                      
__________________________________________________________________________________________________
dense_9 (Dense)                 (None, None, 300)    90000       layer_normalization_4[0][0]      
__________________________________________________________________________________________________
dense_10 (Dense)                (None, None, 300)    90000       layer_normalization_1[0][0]      
__________________________________________________________________________________________________
lambda_15 (Lambda)              (None, 56, 57)       0           lambda_1[0][0]                   
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
lambda_24 (Lambda)              (None, None, None)   0           dense_9[0][0]                    
__________________________________________________________________________________________________
lambda_25 (Lambda)              (None, None, None)   0           dense_10[0][0]                   
__________________________________________________________________________________________________
lambda_27 (Lambda)              (None, 56, 57)       0           lambda_15[0][0]                  
__________________________________________________________________________________________________
lambda_28 (Lambda)              (None, None, None)   0           lambda_24[0][0]                  
                                                                 lambda_25[0][0]                  
__________________________________________________________________________________________________
lambda_29 (Lambda)              (None, 56, 57)       0           lambda_27[0][0]                  
__________________________________________________________________________________________________
add_6 (Add)                     (None, None, None)   0           lambda_28[0][0]                  
                                                                 lambda_29[0][0]                  
__________________________________________________________________________________________________
activation_3 (Activation)       (None, None, None)   0           add_6[0][0]                      
__________________________________________________________________________________________________
dense_11 (Dense)                (None, None, 300)    90000       layer_normalization_1[0][0]      
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, None, None)   0           activation_3[0][0]               
__________________________________________________________________________________________________
lambda_26 (Lambda)              (None, None, None)   0           dense_11[0][0]                   
__________________________________________________________________________________________________
lambda_30 (Lambda)              (None, None, None)   0           dropout_4[0][0]                  
                                                                 lambda_26[0][0]                  
__________________________________________________________________________________________________
lambda_31 (Lambda)              (None, None, 300)    0           lambda_30[0][0]                  
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, None, 300)    90300       lambda_31[0][0]                  
__________________________________________________________________________________________________
dropout_8 (Dropout)             (None, None, 300)    0           time_distributed_3[0][0]         
__________________________________________________________________________________________________
add_7 (Add)                     (None, None, 300)    0           layer_normalization_4[0][0]      
                                                                 dropout_8[0][0]                  
__________________________________________________________________________________________________
layer_normalization_5 (LayerNor (None, None, 300)    600         add_7[0][0]                      
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, None, 512)    154112      layer_normalization_5[0][0]      
__________________________________________________________________________________________________
conv1d_4 (Conv1D)               (None, None, 300)    153900      conv1d_3[0][0]                   
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, None, 300)    0           conv1d_4[0][0]                   
__________________________________________________________________________________________________
add_8 (Add)                     (None, None, 300)    0           dropout_5[0][0]                  
                                                                 layer_normalization_5[0][0]      
__________________________________________________________________________________________________
layer_normalization_3 (LayerNor (None, None, 300)    600         add_8[0][0]                      
__________________________________________________________________________________________________
time_distributed_4 (TimeDistrib (None, None, 57)     17100       layer_normalization_3[0][0]      
==================================================================================================
Total params: 3,447,424
Trainable params: 3,447,424
Non-trainable params: 0
__________________________________________________________________________________________________```

I wanna input the train data and output the exact same sentence as input.
how do I do it?

why get same output with different input?

@lsdefine Thanks for your sharing, I use transformer to do seq2seq task. Like, input a article and predict the abstract. When I finish training, I get almost same output with different input. Code are same as your example, data should be right, because with same data, and use lstm block as seq2seq, I got the proper output.
Hope for your answer, Thanks.

when i run the pinyin_main.py, get UserWarning like below

UserWarning: Output "lambda_83" missing from loss dictionary. We assume this was done on purpose, and we will not be expecting any data to be passed to "lambda_83" during training.
self.model.compile(optimizer, None)

Using the transformer instead of a simple LSTM layer

@lsdefine please can you tell me how can I use the transformer instead of an LSTM layer in a simple encoder ? as in this small example.

model = Sequential()
model.add(Embedding(top_words, 100, input_length=max_words, trainable=True))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

Is the LayerNormalization class is the transformer needed?

Keras implements a BatchNormalization layer. Isn't the LayerNormalization class the same thing?

Ref: https://keras.io/layers/normalization/

(Or is the code for a version of Keras where BN was not implemented?)

Time series forecasting?

Do you think this model is suitable for timeseries forecasting?

Difference between decode_sequence_fast and decode_sequence_readout?

First of all, thanks for this awesome repository.
This is not really an issue, but rather a doubt.
Can any one tell me the difference between decode_sequence_readout and decode_sequence_fast?

LayerNormalization

It takes to change

super().__init__(**kwargs)
on
super(LayerNormalization, self).__init__(**kwargs)

and

super().build(input_shape)
on
super(LayerNormalization, self).build(input_shape)

in class
class LayerNormalization(Layer):

Python 2.7

dimension in GetSubMask

def GetSubMask(s):
len_s = tf.shape(s)[1]
bs = tf.shape(s)[:1]
mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
return mask

if the input is (5,4,3)

wouldn't tf.eye here creates a lower triangle tensor of 5, 4, 4 instead of 5,4,3 because of [:1]

the mask of attention

in transformer.py, line 87,
mask = Lambda(lambda x:K.repeat_elements(x, n_head, 0))(mask)
this line makes the mask shape (in readout_model) like (batch_sizen_head,x,x), but the shape of the result of reshape1 like (n_headbatch_size,x,x), it seems the same shape, but the elements not.

Maybe the repeat_elements could change to tile?

maybe i find a point should be change

self.target_layer = TimeDistributed(Dense(o_tokens.num(), use_bias=False))
change to:
self.target_layer = TimeDistributed(Dense(o_tokens.num(), activation='softmax', use_bias=False))

MultiHeadAttention

Hi!

It is strange to have n_head == 1, but it does not work in MultiHeadAttention class (mode=1)
To fix it, it is enough to change

head = Concatenate()(heads)
attn = Concatenate()(attns)

if n_head == 1:
    head = heads[0]
    attn = attns[0]
else:
    head = Concatenate()(heads)
    attn = Concatenate()(attns)

because

A `Concatenate` layer should be called on a list of at least 2 inputs

Keras and Tensorflow Versions

When I run the code I got this error on line 100 of transformer.py:

ValueError: Axis 0 of input tensor should have a defined dimension, but is None. Full tensor shape: (None, None, None). Typically you need to pass a fully-defined input_shape argument to your first layer.

could you specify the versions of keras and tensorflow that you used for your test?

Licence

What is the license for your shared code?

Issues with Keras Lambda Layers

I'm running into a lot of errors attempting to run the Transformer.py file for testing purposes.

The issue begins with:

(100000, 7) (100000, 9)
X:  [[ 2 11 12 ...  7  4  3]
 [ 2 10 11 ... 12 12  3]
 [ 2  5  5 ... 13  6  3]
 ...
 [ 2 13 11 ... 12  5  3]
 [ 2  7 12 ...  6 11  3]
 [ 2  6  4 ...  7 13  3]]
Y:  [[ 2  4 20 ... 19 14  3]
 [ 2  4 20 ... 11  8  3]
 [ 2  4 20 ...  8  3  0]
 ...
 [ 2  4 20 ... 19  9  3]
 [ 2  4 20 ... 19  3  0]
 [ 2  4 20 ... 15  3  0]]
2018-07-10 18:46:13.676502: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Traceback (most recent call last):
  File "transformer.py", line 586, in <module>
    s2s.compile('adam')
  File "transformer.py", line 396, in compile
    enc_output = self.encoder(src_seq, src_pos, active_layers=active_layers)
  File "transformer.py", line 306, in __call__
    mask = Lambda(lambda x:GetPadMask(emb, emb))(src_seq)
  File "/Users/user/anaconda2/envs/tfdeeplearning/lib/python3.6/site-packages/keras/engine/base_layer.py", line 460, in __call__
    output = self.call(inputs, **kwargs)
  File "/Users/user/anaconda2/envs/tfdeeplearning/lib/python3.6/site-packages/keras/layers/core.py", line 693, in call
    return self.function(inputs, **arguments)
  File "transformer.py", line 306, in <lambda>
    mask = Lambda(lambda x:GetPadMask(emb, emb))(src_seq)
  File "transformer.py", line 255, in GetPadMask
    ones = K.expand_dims(K.ones_like(Q, 'float32'), -1)
AttributeError: 'Tensor' object has no attribute 'expand_dims'

What version of Keras and Tensorflow are you using to develop?
Could you add that info to a requirements.txt file or possibly to the readme info.
I am wondering if this is an issue between conflicting versions.
I am using:
tensorflow 1.8.0 Keras 2.2.0

I've tried wrapping the operations in Lambda Layers which works for the first two lines in
GetPadMask Function but I am running into issues again with the K.batch_dot Operation.
An Ideas? I am relatively new to the Keras framework.

K.mean() in computing loss doesn't make any sense.

def get_loss(args):
           y_pred, y_true = args
           y_true = tf.cast(y_true, 'int32')
           loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
           mask = tf.cast(tf.not_equal(y_true, 0), 'float32')
           loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
           loss = K.mean(loss)
           return loss

loss = tf.reduce_sum(loss * mask, -1) / tf.reduce_sum(mask, -1)
produce single element, it's mean doesn't make difference.

mask for decoder

hello, I suspect that the mask you used for decoder is not correct.
In decoder, the mask you used is a matrix of which elements in the right upper triangle are one.

mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
In [4]: np.cumsum(np.eye(5), 1) Out[4]: array([[1., 1., 1., 1., 1.], [0., 1., 1., 1., 1.], [0., 0., 1., 1., 1.], [0., 0., 0., 1., 1.], [0., 0., 0., 0., 1.]])

That means, when you compute self attention, the first word will take the entire output sequence into account by WmaskV. That is not correct during training. And this problem could also impact the prediction.

ScaledDotProductAttention

Hi, I'm a beginner, and i found

attn = Lambda(lambda x:K.batch_dot(x[0],x[1],axes=[2,2])/self.temper)([q, k])
is equal to
attn = Lambda(lambda x:tf.matmul(x[0],x[1],transpose_b=True)/self.temper)([q, k])

Sorry to disturb you.

the test demo

Thank you for your excellent work. When will the test demo open

layer norm end of the encoder?

should a layer norm be at the end of encoder layer like below? if I search orginal paper there is norm layer after pos-ffn.
class EncoderLayer():
def init(self, d_model, d_inner_hid, n_head, dropout=0.1):
self.self_att_layer = MultiHeadAttention(n_head, d_model, dropout=dropout)
self.pos_ffn_layer = PositionwiseFeedForward(d_model, d_inner_hid, dropout=dropout)
self.norm_layer = LayerNormalization()
def call(self, enc_input, mask=None):
output, slf_attn = self.self_att_layer(enc_input, enc_input, enc_input, mask=mask)
output1 = self.norm_layer(Add()([enc_input, output]))
output = self.pos_ffn_layer(output1)
output = self.norm_layer(Add()([output1 , output]))
return output, slf_attn

Issue with attention mask

Hello, I check the source code and found the implementation of mask is define as following:

class ScaledDotProductAttention():
	def __init__(self, d_model, attn_dropout=0.1):
		self.temper = np.sqrt(d_model)
		self.dropout = Dropout(attn_dropout)
	def __call__(self, q, k, v, mask):
		attn = Lambda(lambda x:K.batch_dot(x[0],x[1],axes=[2,2])/self.temper)([q, k])
		if mask is not None:
			mmask = Lambda(lambda x:(-1e+10)*(1-x))(mask)
			attn = Add()([attn, mmask])
		attn = Activation('softmax')(attn)
		attn = self.dropout(attn)
		output = Lambda(lambda x:K.batch_dot(x[0], x[1]))([attn, v])
		return output, attn

as far as i am concerned, the "Add()([attn, mmask])" operation will broadcast "mmask" to the shape of "attn", which will mask some rows of "attn". But this may cause the following softmax operation to be meaningless as the softmax layer takes effect on each row. To be clearer,

'''
## we neglect the batch dimension
attn = [
	a_11, a_12, a_13
	a_21, a_22, a_23
	a_31, a_32, a_33
](q=3, k=3)

mmask = [
	0.0,
	0.0,
	-inf, 
](q=3)

## after broadcasting:
attn += mmask 
== [
	a_11, a_12, a_13
	a_21, a_22, a_23
	-inf, -inf, -inf
](q=3, k=3)

attn = softmax(attn)
== [
	softmax(a_11, a_12, a_13)
	softmax(a_21, a_22, a_23)
	1/3, 1/3, 1/3  <-------- is not what we want
](q=3, k=3)
'''

Am I missing something? or should the mask operation take effect after softmax layer?

Decoding a sentence give same translation

I tried translating different english sentences,but got same translation.

Tried decode_sequence and decode_sequence_fast too.
trained for 2 epochs,it that a problem?

reshape may not match

Hi, thanks a lot for your code. It seems that I find a bug.

In the MultiHeadAttention layer, the reshape1 function

x = tf.reshape(x, [s[0], s[1], n_head, s[2]//n_head])
x = tf.transpose(x, [2, 0, 1, 3]) 
x = tf.reshape(x, [-1, s[1], s[2]//n_head])

The transpose puts the head axis before the batch axis. After reshaping, the first axis should be like this (suppose N samples and only 2 heads):

sample_0_head_0
sample_1_head_0
sample_2_head_0
...
sample_N-1_head_0
sample_0_head_1
sample_1_head_1
sample_2_head_1
...
sample_N-1_head_1

But the repeats of mask:

mask = Lambda(lambda x:K.repeat_elements(x, n_head, 0))(mask)

will return mask like this:

mask_0,
mask_0,
mask_1,
mask_1,
...
mask_N,
mask_N,

(find the useage of repeat_elements here)

However, actually we want mask to be like this:

mask_0,
mask_1,
...
mask_N-1,
mask_0,
mask_1,
...
mask_N-1

So I think the reshape function reshape1 should change x = tf.transpose(x, [2, 0, 1, 3]) into x = tf.transpose(x, [0, 2, 1, 3]). And so does the reshape2.

How to perform translation?

Hi
I had managed to train the network using the your given dataset but don't have the idea to how to use the trained model to perform translation prediction.. Pls advise thanks

startup error

Hello. I try to evaluate your script and got the following error message:

(base) C:\Users\cp\Python\attention-is-all-you-need-keras>python en2de_main.py
Using TensorFlow backend.
loading data/en2de_word.txt
loading data/en2de.h5
loading data/en2de.valid.h5
seq 1 words: 3369
seq 2 words: 3665
train shapes: (29000, 43) (29000, 47)
valid shapes: (1014, 34) (1014, 39)
2020-03-11 13:08:06.384900: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default
inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Traceback (most recent call last):
  File "en2de_main.py", line 33, in <module>
    s2s.compile(Adam(0.001, 0.9, 0.98, epsilon=1e-9))
  File "C:\Users\cp\Python\attention-is-all-you-need-keras\transformer.py", line 452, in compile
    loss = get_loss(final_output, tgt_true)
  File "C:\Users\cp\Python\attention-is-all-you-need-keras\transformer.py", line 440, in get_loss
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred)
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 3537, in sparse_softmax_cros
s_entropy_with_logits_v2
    labels=labels, logits=logits, name=name)
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\nn_ops.py", line 3470, in sparse_softmax_cros
s_entropy_with_logits
    array_ops.shape(logits)[:-1]))
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\check_ops.py", line 658, in assert_equal
    data, summarize, message, name)
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\ops\check_ops.py", line 333, in _binary_assert
    if condition:
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 757, in __bool__
    self._disallow_bool_casting()
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 526, in _disallow_bool_ca
sting
    self._disallow_in_graph_mode("using a `tf.Tensor` as a Python `bool`")
  File "C:\Users\cp\Anaconda3\lib\site-packages\tensorflow_core\python\framework\ops.py", line 515, in _disallow_in_grap
h_mode
    " this function with @tf.function.".format(task))
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not al
lowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

I use keras 2+ and tf 2+ as well but not using any gpu.

after run your demo i get a error result like this.why?

after run "python pinyin.py train" and "python pinyin.py test",i get a result like this:
" 天中，方后人将化的面要给物业，以确保学生安全。"
why have this wrong answer?

Skip-connection in Transformer

Hello,

Thanks for a great project, which helps me build model on top of that.

I was wondering one thing: it seems like you do not implement skip connection (residual network) in Transformer?

Is it because you implemented it and you didn't observe improvement?

Or is it just because you didn't implement it?

I asked because when I use more layers, I got worser performance actually. I am not sure whether it is what it is (i.e. having more layers does not help), or it is because I don't have skip connections, which usually helps build a deeper model.

Best,

Why wasn't K and V weren't passed from the top encoder to bottom decoder model?

The paper specified that it wasn't the Z (output) that gets passed. it is actually the K and V got passed to the decoder. In the code, it simply intakes the output.

after embedding layer

Hello,
after embedding layer, the new token embedding includes learned token embedding and static postional embedding. Of course, the have the postional embedding value. So, before the embedding are entered into the encoder or decoder, if the embedding sequences are needed to multiply padding mask to delete the influence of embeddings?

Tks,
Look forward to your reply.