jadore801120 / attention-is-all-you-need-pytorch Goto Github PK

View Code? Open in Web Editor NEW

8.7K 8.7K 2.0K 166 KB

A PyTorch implementation of the Transformer model in "Attention is All You Need".

License: MIT License

Python 99.31% Shell 0.69%

attention attention-is-all-you-need deep-learning natural-language-processing nlp pytorch

attention-is-all-you-need-pytorch's People

Contributors

Stargazers

Watchers

Forkers

ml-lab jfsantos locosoft1986 wanjinchang stevenlol benjamesbabala huguanglong 19ai awesome-archive ink-pad sliedes mattiadg vikingmew ajaytalati adrianhust quanpn90 xmb-cipher rbunn80110 cosmmb andy-yangz zxsted liushifeng hzauccg sepehr125 authman snazz2001 yingchao-mai oppa3109 keunwoochoi taekyoon zijianzhao akitotakeki wenhaozheng-nju jzhang45 jemisa franticnerd liuyuuan sanwushuosi linranran melody-xiaomi lliu25 jkhlot chenwgen purblue10 aa1607 l1aoxingyu mixcoder cshaoping pavanearmstrong xujunrt codeaudit lukovnikov skyhowie25 nininininini prateeky2806 hlthu hyzcn tsingcoo tian-qin piyank22 alvations 1013553207 yanhedewang dmortem gooklim yangxs darongliu edwardzh rpersie uzeful danpechi shubhampachori12110095 zhujunnan hoangcuong2011 walidar peterzhang2029 jeffreyyihuang xinghudamowang gwli hirominnn leechikara iqbal-chowdhury nlp-deeplearning-club xiaoyeye1117 hanzhichu linpingchuan chenerg kaixin-wu shuidongliu mingmingyang yangyaoyunshu ghrua swordlidev gonaco dashayushman willdamon harshs27 pbaljeka ieee820 sunchao1212

attention-is-all-you-need-pytorch's Issues

Multi-GPUs?

Hi, thanks for the sharing.
It seems like this code does not support multi-GPUs.
So are you planning on it?

RunTimeError During Training

After training for the first epoch I get the following error trying to calculate training accuracy and loss:

RuntimeError: Expected object of type.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other

At the following line of code at line 102 in train.py
return total_loss/n_total_words, n_total_correct/n_total_words

nan loss when training

Training and validation loss is nan (using commit e21800a):

$ python3 preprocess.py -train_src data/multi30k/train.en -train_tgt data/multi30k/train.de -valid_src data/multi30k/val.en -valid_tgt data/multi30k/val.de -output data/multi30k/data.pt
$ python3 train.py -data data/multi30k/data.pt -save_model trained -save_model best
[ Epoch 0 ]
  - (Training)   loss:      nan, accuracy: 3.7 %
  - (Validation) loss:      nan, accuracy: 10.0 %
    - [Info] The checkpoint file has been updated.
[ Epoch 1 ]
  - (Training)   loss:      nan, accuracy: 9.09 %
  - (Validation) loss:      nan, accuracy: 9.87 %
[ Epoch 2 ]
  - (Training)   loss:      nan, accuracy: 9.09 %
  - (Validation) loss:      nan, accuracy: 9.83 %
[ Epoch 3 ]
  - (Training)   loss:      nan, accuracy: 9.1 %
  - (Validation) loss:      nan, accuracy: 9.92 %
[ Epoch 4 ]
  - (Training)   loss:      nan, accuracy: 9.09 %
  - (Validation) loss:      nan, accuracy: 9.91 %

ejklektov@gpu3:~/attention-is-all-you-need-pytorch$ CUDA_VISIBLE_DEVICES=5 python3 train.py -data data/multi30k.atok.low.pt -save_model trained -save_mode best -proj_share_weight
Namespace(batch_size=64, cuda=True, d_inner_hid=1024, d_k=64, d_model=512, d_v=64, d_word_vec=512, data='data/multi30k.atok.low.pt', dropout=0.1, embs_share_weight=False, epoch=10, log=None, max_token_seq_len=52, n_head=8, n_layers=6, n_warmup_steps=4000, no_cuda=False, proj_share_weight=True, save_mode='best', save_model='trained', src_vocab_size=2909, tgt_vocab_size=3149)
/home/ejklektov/attention-is-all-you-need-pytorch/transformer/Modules.py:13: UserWarning: nn.init.xavier_normal is now deprecated in favor of nn.init.xavier_normal_.
init.xavier_normal(self.linear.weight)
/home/ejklektov/attention-is-all-you-need-pytorch/transformer/SubLayers.py:33: UserWarning: nn.init.xavier_normal is now deprecated in favor of nn.init.xavier_normal_.
init.xavier_normal(self.w_qs)
/home/ejklektov/attention-is-all-you-need-pytorch/transformer/SubLayers.py:34: UserWarning: nn.init.xavier_normal is now deprecated in favor of nn.init.xavier_normal_.
init.xavier_normal(self.w_ks)
/home/ejklektov/attention-is-all-you-need-pytorch/transformer/SubLayers.py:35: UserWarning: nn.init.xavier_normal is now deprecated in favor of nn.init.xavier_normal_.
init.xavier_normal(self.w_vs)
[ Epoch 0 ]

(Training) : 0%| | 0/454 [00:00<?, ?it/s]/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py:491: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
result = self.forward(*input, **kwargs)
train.py:71: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
total_loss += loss.data[0]
Traceback (most recent call last):
File "train.py", line 271, in
main()
File "train.py", line 268, in main
train(transformer, training_data, validation_data, crit, optimizer, opt)
File "train.py", line 126, in train
train_loss, train_accu = train_epoch(model, training_data, crit, optimizer)
File "train.py", line 73, in train_epoch
return total_loss/n_total_words, n_total_correct/n_total_words
RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other'

I change train.py 71line code,
loss.data[0] ===> loss.item[0]
but it doesn't work

Dropout when predicting

Shouldn't we set dropout prob to 0.0 during prediction?
I notice that in SubLayers.py line 27, the attn_dropout was not set for ScaledDotProductAttention

Code failing while translation

For translation, I use the following command
CUDA_LAUNCH_BLOCKING=1 CUDA_VISIBLE_DEVICES=1 python3 translate.py -model trained.chkpt -vocab data/nmt.atok.low.pt -src data/nmt/test.en.atok
and get the following error : error

Can someone help?

MultiHeadAttention() implemention question

Hi, Yu-Hsiang.
I am not clear about line 62 in file Sublayers.py

# back to original mb_size batch
outputs = outputs.view(mb_size, len_q, -1)            # mb_size x len_q x (n_head*d_v)

is it right?
above code is equal to below(tensorflow implemention from Kyubyong)?

# Restore shape
outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 )          # (N, T_q, C)

Why need "get_attn_subsequent_mask" function?

What is the difference between encoder self attention and decoder self attention?
Why need "get_attn_subsequent_mask" function in the decoder self attention?

Thanks for your reply in advance!

eval() questions

In your eval_epoch() function you feedforward src and tgt through the model just like the training phase. Is this correct? Shouldn't eval be similar to testing where the model won't know the true target? Should there be an autoregressive step for evaluation where the prediction words are generated one by one and used by subsequent predictions?

how to reduce the size of directory?

it need too many hours when training the model.how to solve it?

About label smoothing.

attention-is-all-you-need-pytorch/train.py

Line 25 in 7fa8c63

gold = gold * (1 - eps) + (1 - gold) * eps / num_class

Hi, thanks for the implementation. It is very neat and elegant. I noticed that you mentioned "label smoothing" is not done yet, but I also found you have implemented this line. I think it is correct but I am not sure what is the num_class in sequence-to-sequence models. Should it be equal to the size of the vocabulary?

the accuracy in the process of training is always zero

i have problem here, when training, the accuracy on training data and validation data is always zeros, can anyone help me ? thanks a lot.

Something wrong with F.cross_entropy(pred, gold, ignore_index=Constants.PAD, reduction="sum")''! !

Hi,could you tell me what's wrong with ''loss = F.cross_entropy(pred, gold, ignore_index=Constants.PAD, reduction="sum")''?

Document strings' style do not accord PEP8

As mentioned here:

PEP 257 describes good docstring conventions. Note that most importantly, the """ that ends a multiline docstring should be on a line by itself, e.g.:

"""Return a foobang

Optional plotz says to frobnicate the bizbaz first.
"""
For one liner docstrings, please keep the closing """ on the same line.

but most docstrings used in the code is:

''' document strings '''

Ubuntu Server Unable to recognise German Character

Ubuntu Server : Ein Boston Terrier läuft über saftig-grünes Gras vor einem wei?^?en Zaun.

Macbook Pro : Ein Boston Terrier läuft über saftig-grünes Gras vor einem weißen Zaun.

Can you tell me how to set up the language encoding in Ubuntu? Best Wishes

What command can continue running program?

Hi,thanks for your sharing.
My program has broken down。
What command can continue running program at GPU?

CUDA_VISIBLE_DEVICES=3 python train.py -data data/multi30k.atok.low.pt -save_model trained -save_mode best -proj_share_weight

Tensor data type error

This may have something to do with pytorch version (I use 0.4.0), but I think people should know:

Traceback (most recent call last): File "train.py", line 271, in <module> main() File "train.py", line 268, in main train(transformer, training_data, validation_data, crit, optimizer, opt) File "train.py", line 126, in train train_loss, train_accu = train_epoch(model, training_data, crit, optimizer) File "train.py", line 73, in train_epoch return total_loss/n_total_words, n_total_correct/n_total_words RuntimeError: Expected object of type torch.cuda.FloatTensor but found type torch.cuda.LongTensor for argument #2 'other'

Index error during translating

Hi,
I tried to force the GPU selection with CUDA_VISIBLE_DEVICES=1
but it pops an error:
RuntimeError: cublas runtime error : library not initialized at /py/conda-bld/pytorch_1490903321756/work/torch/lib/THC/THCGeneral.c:387

Pretrained models?

Does anyone have a pretrained model?

Temper in ScaledDotProductAttention?

I am referring to this code line that implements temper in ScaledDotProductAttention:

attention-is-all-you-need-pytorch/transformer/Modules.py

Line 76 in 7fa8c63

self.temper = np.power(d_model, 0.5)

What is this value based on? Can someone explain what it does here?

d_word_vec and d_model must be equal in Encoder

According to the paper, d_word_vec and d_model must be equal. However, the interface for Encoder allows you to set them to different values. If you initialize an Encoder and set them to different values, you get an error in Line 54 MultiHeadAttention during the forward pass.

bugs in the masking code

hi, i found that in decoder there is a subsequent mask which mask out the future information here . However, in line 123, you feed in the dec_input(which is the target embeding) at first layer. now check this line and then the MultiHeadAttention moudle's forward function, it has a residual connection and will make dec_input directly reached output, see here. so it doest not use the subsequent mask, which means that the model knows the future. am i correct?

Batch Beam Search Problem

In the Beam.py-L30-L31:

self.next_ys = [self.tt.LongTensor(size).fill_(Constants.PAD)]
self.next_ys[0][0] = Constants.BOS

It seems that only the top hypothesis get "BOS" as start while all other hypothesis get "PAD" as start. Why don't all the hypothesis get "BOS" as start?
And in the Beam.py-L65-L68:

        # End condition is when top-of-beam is EOS.
        if self.next_ys[-1][0] == Constants.EOS:
            self.done = True
            self.all_scores.append(self.scores)

you set that end condition is when top-of-beam is "EOS". Why top-of-beam instead of all-of-beam?

All translated sentences start with a same word

I wonder has anyone run the code and whether encounter the same problem? @ZiJianZhao

No dictinonaries saved after preprocessing

After running the preprocess.py, only the atok.low.pt file is saved, no .dict files saved.

accuracy reduce during the training

Hi. I just follow the tutorial to train the model with the dataset given here. However, the accuracy is relatively high at epoch 0 and has a sharp decline after that. Does anybody meet the similar problem?

Here is the record:

[ Epoch 0 ]

(Training) ppl: 69.89619, accuracy: 48.127 %, elapse: 9.800 min
(Validation) ppl: 86.35283, accuracy: 20.530 %, elapse: 3.096 min
- [Info] The checkpoint file has been updated.
  [ Epoch 1 ]
(Training) ppl: 135.71178, accuracy: 32.377 %, elapse: 10.807 min
(Validation) ppl: 865.33501, accuracy: 5.777 %, elapse: 3.052 min
[ Epoch 2 ]
(Training) ppl: 193.38618, accuracy: 27.988 %, elapse: 11.013 min
(Validation) ppl: 949.73713, accuracy: 4.359 %, elapse: 3.093 min

Assert error in validation.

Hi,

I've tried to run a training on iwslt data en-fr. The first train epoch finished with loss: nan, but this may be due to my choice of parameters. The problem is, when it started the validation I got the following error:

/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [109,0,0], thread: [88,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [109,0,0], thread: [89,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [109,0,0], thread: [90,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [109,0,0], thread: [91,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [109,0,0], thread: [92,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [109,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [109,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [109,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [97,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [98,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [99,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [100,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [101,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [102,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [103,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [104,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [105,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [106,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [107,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [108,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [109,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [110,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [111,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [112,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [113,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [114,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [115,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [116,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [117,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [121,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/generic/THCTensorMath.cu line=226 error=59 : device-side assert triggered
Traceback (most recent call last):                                                                                                                                                                                 
  File "attention-is-all-you-need-pytorch/train.py", line 244, in <module>
    main()
  File "attention-is-all-you-need-pytorch/train.py", line 241, in main
    train(transformer, training_data, validation_data, crit, optimizer, opt)
  File "attention-is-all-you-need-pytorch/train.py", line 120, in train
    valid_loss, valid_accu = eval_epoch(model, validation_data, crit)
  File "attention-is-all-you-need-pytorch/train.py", line 85, in eval_epoch
    pred = model(src, tgt)
  File "/hltmt0/data/digangi/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/hardmnt/hltmt0/data/digangi/attention-is-all-you-need-pytorch/transformer/Models.py", line 180, in forward
    enc_outputs, enc_slf_attns = self.encoder(src_seq, src_pos)
  File "/hltmt0/data/digangi/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/hardmnt/hltmt0/data/digangi/attention-is-all-you-need-pytorch/transformer/Models.py", line 76, in forward
    enc_output, slf_attn_mask=enc_slf_attn_mask)
  File "/hltmt0/data/digangi/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/hardmnt/hltmt0/data/digangi/attention-is-all-you-need-pytorch/transformer/Layers.py", line 18, in forward
    enc_input, enc_input, enc_input, attn_mask=slf_attn_mask)
  File "/hltmt0/data/digangi/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/hardmnt/hltmt0/data/digangi/attention-is-all-you-need-pytorch/transformer/SubLayers.py", line 43, in forward
    outputs = torch.cat(outputs, 2)
  File "/hltmt0/data/digangi/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 841, in cat
    return Concat(dim)(*iterable)
  File "/hltmt0/data/digangi/anaconda3/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 310, in forward
    return torch.cat(inputs, self.dim)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /py/conda-bld/pytorch_1493680494901/work/torch/lib/THC/generic/THCTensorMath.cu:226

I have no experience with pytorch, so I don't know how to fix it at the moment.

Softmax layer for output probabilities

Hello,

In the main Transformer model the encoder and decoder parts are calculated.
Then they are fed to a linear layer for the target word projections. But shouldn't this layer be followed by a softmax function to calculate the output probabilities like in the Transformer schematic?

Or am I looking over something? I can't seem to locate this last softmax function in the code.

Model training error

Training the model throws me the error below:

python train.py -data data/multi30k.atok.low.pt -save_model trained -save_mode best -proj_share_weight
Namespace(batch_size=64, cuda=True, d_inner_hid=1024, d_k=64, d_model=512, d_v=64, d_word_vec=512, data='data/multi30k.atok.low.pt', dropout=0.1, embs_share_weight=False, epoch=10, log=None, max_token_seq_len=52, n_head=8, n_layers=6, n_warmup_steps=4000, no_cuda=False, proj_share_weight=True, save_mode='best', save_model='trained', src_vocab_size=2909, tgt_vocab_size=3150)
('[ Epoch', 0, ']')

(Training) : 0%| | 0/453 [00:00<?, ?it/s]/home/user/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py:357: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
result = self.forward(*input, **kwargs)
Traceback (most recent call last):
File "train.py", line 271, in
main()
File "train.py", line 268, in main
train(transformer, training_data, validation_data, crit, optimizer, opt)
File "train.py", line 126, in train
train_loss, train_accu = train_epoch(model, training_data, crit, optimizer)
File "train.py", line 57, in train_epoch
pred = model(src, tgt)
File "/home/user/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/srv/disk01/user/medical_data/attention-is-all-you-need-pytorch/transformer/Models.py", line 192, in forward
enc_output, _ = self.encoder(src_seq, src_pos)
ValueError: need more than 1 value to unpack

Memory Problem?

Hi, I clone your code and run train it on WMT English-German task, but it failed with "RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1502009910772/work/torch/lib/THC/generic/THCStorage.cu:66".
I run it on a Tesla K40 which has the same memory capacity of 12GB as your Titan X, and with the default settings.
So I don`t know why this happens, do you have any idea? Thanks

Preprocessing Error

On running the following command for preprocessing
for l in en de; do for f in data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi; done; done;
I'm getting the following error
sed: 1: "data/multi30k/train.en": extra characters at the end of d command sed: 1: "data/multi30k/val.en": extra characters at the end of d command sed: 1: "data/multi30k/train.de": extra characters at the end of d command sed: 1: "data/multi30k/val.de": extra characters at the end of d command

Please advice as to how I should proceed

question about softmax

when calulate the attention in MultiHead ,softmax's dim is 0,but i think dim=2 is right.

Question about the value of temper in ScaledDotProductAttention and d_inner_hid default value

In the origin paper, dot products will be divided by np.power(d_k, 0.5), but in your code, the value, that is temper, is np.power(d_model, 0.5). I guess this may be wrong.

Another question is that in the origin paper, d_inner_hid is 2048, but you define it 1024 as default. I don't know why.

Bug in translating

There's a mistake when repeating data for beam search.

The source seq here

src_seq = Variable(src_seq.data.repeat(beam_size, 1))

gets a matrix in the following order for source sequence

seq1
seq2
seq3
seq1
seq2
seq3
seq1
seq2
seq3

while the beam search input here

input_data = torch.stack([b.get_current_state() for b in beam if not b.done])

takes an input like

seq1
seq1
seq1
seq2
seq2
seq2
seq3
seq3
seq3

, both of which are fed into the decoder

            dec_outputs, dec_slf_attns, dec_enc_attns = self.model.decoder(
                input_data, input_pos, src_seq, enc_outputs)

The order of the two input does not match.

Batch size limitation

Hi I was wondering why the maximum batch size is ~100 using a GPU with ~11GB of RAM whereas in the tensor2tensor the maximum batch size there is 1024?

masking on tensor.data?

Hi,

I noticed that you were masking out the padded tensor by assigning value to tensor.data.

attn.data.masked_fill_(attn_mask, -float('inf'))

Is this correct?

Based on the discussion here, shouldn't we assign values to tensor itself, instead of tensor.data? In this way, the history of the gradient can be tracked.

What is the performance on WMT'14 ENDE datasets ?

Hi, Could you reproduce the results on WMT'14 datasets of "Attention is All You Need" paper ? I want to know the exact BLEU scores of your systerm on WMT'14 ENDE datasets ?
Thanks in advance .

Masking bug?

I get 98% accuracy after 10 epochs on the multi30k validation set using this 1-layer model:

python train.py -data data/multi30k.atok.low.pt -save_model trained -save_mode best -proj_share_weight -dropout 0.0 -n_layers 1 -n_warmup_steps 40 -epoch 50 -d_inner_hid 1 -d_model 128 -d_word_vec 128 -n_head 4

This is a very small model (note -d_inner_hid 1), which should not get good results at all (98% accuracy is way too high in any case). Generating translations with translate.py produces non-sense. This makes me suspect that there is a problem with the masking code that allows the model to 'cheat' by looking at the target sequence.

I haven't been able to figure out where the problem is, but something seems wrong.

Dimension error in forward pass

I am receiving the following error when I try to run the train script:

File "train.py", line 266, in <module>
    main()
  File "train.py", line 263, in main
    train(transformer, training_data, validation_data, crit, optimizer, opt)
  File "train.py", line 124, in train
    train_loss, train_accu = train_epoch(model, training_data, crit, optimizer)
  File "train.py", line 55, in train_epoch
    pred = model(src, tgt)
  File "/home/ubuntu/miniconda3/envs/cuda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/attention-is-all-you-need-pytorch/transformer/Models.py", line 179, in forward
    enc_outputs, enc_slf_attns = self.encoder(src_seq, src_pos)
  File "/home/ubuntu/miniconda3/envs/cuda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/attention-is-all-you-need-pytorch/transformer/Models.py", line 76, in forward
    enc_output, slf_attn_mask=enc_slf_attn_mask)
  File "/home/ubuntu/miniconda3/envs/cuda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/attention-is-all-you-need-pytorch/transformer/Layers.py", line 18, in forward
    enc_input, enc_input, enc_input, attn_mask=slf_attn_mask)
  File "/home/ubuntu/miniconda3/envs/cuda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/attention-is-all-you-need-pytorch/transformer/SubLayers.py", line 68, in forward
    return self.layer_norm(outputs + residual), attns
  File "/home/ubuntu/miniconda3/envs/cuda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/attention-is-all-you-need-pytorch/transformer/Modules.py", line 52, in forward
    ln_out = (z - mu.expand_as(z)) / (sigma.expand_as(z) + self.eps)
  File "/home/ubuntu/miniconda3/envs/cuda/lib/python3.6/site-packages/torch/autograd/variable.py", line 681, in expand_as
    return Expand.apply(self, tensor.size())
  File "/home/ubuntu/miniconda3/envs/cuda/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 106, in forward
    result = i.expand(*new_size)
RuntimeError: The expanded size of the tensor (24) must match the existing size (64) at non-singleton dimension 1. at /home/ubuntu/cuda-ubuntu-16.04-ec2/pytorch/torch/lib/THC/generic/THCTensor.c:323

Feeding the output of the last encoding layer to the decoder

The original paper and the animation in this page seem only feed the output of the last encoding layer to the decoder, while the implementation here seems feed the output of each encoding layer to the corresponding decoding layer, which might not work if the encoder and the decoder have different number of layers.

KeyError on testing

After fixing a key error about tensor integer types, running translate.py seems to return KeyErrors with numbers, and checking with python seems to indicate that they are missing(the keys).
But skipping keys that are non-existent inside the write loop seems to return poor results.
result of pred.txt after running code

heres the changed code:

Did anybody experience this or have a fix? Thank you.

need to check

if the source target numbers are correct

how to change size of dictionaries

how to change size of dictionaries? I use a new corpus,but its dictionary is ao big. I don not know how to change it.

can not download mmt16_task1_test.tgz

how to solve it?

Decoder input

Hi, I am not sure if you are feeding the right input to the decoder.

(pg. 2) "Given z, the decoder then generates an output sequence (y₁, ..., y_m) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next."

I believe your decoder input is a batch of target sequences.

Error about the mask in ScaledDotProductAttention

Currently, the attention mask in the ScaledDotProductAttention is generated in Line 28 in Models.py by:
pad_attn_mask = seq_k.data.eq(Constants.PAD).unsqueeze(1)
pad_attn_mask = pad_attn_mask.expand(mb_size, len_q, len_k)

Ignoring the batch dimension for an explanation, I assume the generated pad_attn_mask is a matrix of shape (len_q * len_k), then this code will produce the matrix like [A 1], where 1 is an all one submatrix. However, I think the generated attention mask should be like [B 1 // 1 1], where 1 is an all one submatrix and // means line break (sorry I don't know how to type formula in Markdown environments).

Positional Encoding

In position_encoding_init, shouldn't it be

[pos / np.power(10000, (i//2)*2 / d_pos_vec ) for i in range(d_pos_vec)]

instead of

[pos / np.power(10000, 2*i/d_pos_vec) for i in range(d_pos_vec)]

In the original formulation, for positions 2i and 2i+1, the power should be 2i / d_model.

What is the score in beam search stands for?

I notice there's a score during beam search. But the meaning of it is very ambiguous and hard to be understood. Is there any intuitive description for it?

embedding of positional encoding?

Great work and thanks a lot. I wanted to ask why you do embeddings of the pos encoder?

attention-is-all-you-need-pytorch/transformer/Models.py

Line 55 in 1600401

    
           self.position_enc = nn.Embedding(n_position, d_word_vec, padding_idx=Constants.PAD)

I believe the pos encoder should just be added to the input embeddings, like here:
https://github.com/Kyubyong/transformer/blob/master/train.py

Let me know, thanks a lot

TypeError: cat() takes no keyword arguments

Traceback (most recent call last):
File "train.py", line 266, in
main()
File "train.py", line 263, in main
train(transformer, training_data, validation_data, crit, optimizer, opt)
File "train.py", line 124, in train
train_loss, train_accu = train_epoch(model, training_data, crit, optimizer)
File "train.py", line 55, in train_epoch
pred = model(src, tgt)
File "/home/sushuting/local/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "/home/sushuting/workspace/attention-is-all-you-need-pytorch/transformer/Models.py", line 179, in forward
enc_outputs, enc_slf_attns = self.encoder(src_seq, src_pos)
File "/home/sushuting/local/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "/home/sushuting/workspace/attention-is-all-you-need-pytorch/transformer/Models.py", line 76, in forward
enc_output, slf_attn_mask=enc_slf_attn_mask)
File "/home/sushuting/local/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "/home/sushuting/workspace/attention-is-all-you-need-pytorch/transformer/Layers.py", line 18, in forward
enc_input, enc_input, enc_input, attn_mask=slf_attn_mask)
File "/home/sushuting/local/anaconda3/lib/python3.5/site-packages/torch/nn/modules/module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "/home/sushuting/workspace/attention-is-all-you-need-pytorch/transformer/SubLayers.py", line 62, in forward
outputs = torch.cat(torch.split(outputs, mb_size, dim=0), dim=-1)
TypeError: cat() takes no keyword arguments