harvardnlp / annotated-transformer Goto Github PK

View Code? Open in Web Editor NEW

5.0K 62.0 1.1K 18.18 MB

An annotated implementation of the Transformer paper.

Home Page: http://nlp.seas.harvard.edu/annotated-transformer

License: MIT License

Python 0.87% TeX 0.28% Makefile 0.02% Jupyter Notebook 98.83%

annotated notebook python

annotated-transformer's Introduction

Code for The Annotated Transformer blog post:

http://nlp.seas.harvard.edu/annotated-transformer/

Package Dependencies

Use requirements.txt to install library dependencies with pip:

pip install -r requirements.txt

Notebook Setup

The Annotated Transformer is created using jupytext.

Regular notebooks pose problems for source control - cell outputs end up in the repo history and diffs between commits are difficult to examine. Using jupytext, there is a python script (.py file) that is automatically kept in sync with the notebook file by the jupytext plugin.

The python script is committed contains all the cell content and can be used to generate the notebook file. The python script is a regular python source file, markdown sections are included using a standard comment convention, and outputs are not saved. The notebook itself is treated as a build artifact and is not commited to the git repository.

Prior to using this repo, make sure jupytext is installed by following the installation instructions here.

To produce the .ipynb notebook file using the markdown source, run (under the hood, the notebook build target simply runs jupytext --to ipynb the_annotated_transformer.py):

make notebook

To produce the html version of the notebook, run:

make html

make html is just a shortcut for for generating the notebook with jupytext --to ipynb the_annotated_transformer.py followed by using the jupyter nbconvert command to produce html using jupyter nbconvert --to html the_annotated_transformer.ipynb

Formatting and Linting

To keep the code formatting clean, the annotated transformer git repo has a git action to check that the code conforms to PEP8 coding standards.

To make this easier, there are two Makefile build targets to run automatic code formatting with black and flake8.

Be sure to install black and flake8.

You can then run:

make black

(or alternatively manually call black black --line-length 79 the_annotated_transformer.py) to format code automatically using black and:

make flake

(or manually call flake8 `flake8 --show-source the_annotated_transformer.py) to check for PEP8 violations.

It's recommended to run these two commands and fix any flake8 errors that arise, when submitting a PR, otherwise the github actions CI will report an error.

annotated-transformer's People

Contributors

Stargazers

Watchers

Forkers

ilovecv guillaume-chevalier tarrysingh tanvi-m chrishokamp xxccb huguanglong esvhd allensmile wanjinchang seabay liyuanyaun statml linpingchuan ai3dvision mosincos fjibj artbataev ruohoruotsi afcarl yoheikikuta lgstd minhpqn codeaudit edchengg eternalfeather rsilveira79 xkuang merajat svo35 chenglongchen dineshsonachalam chenwgen xiarixiaoyao xiaoanshi zhenzhenclaire flyahead benzei cyhbrilliant r2drepos yucoian dextercoder munaachyuta tbmihailov batermj bayesquant wurentidai daod weig2000 eva-n27 zjatc hrlinlp azuredsky pengli09 spartag117 liuhanhit katedoan greengrass2015 chenghuige leeeeoliu yzy5630 seanliu96 eminemrain sachinmittal28 hfxunlp walden2013 lidhcs nicemartin ensky0 cskywit jojolin fuyanzhe mkuymkuy klqulei yusifu jianchengss yang-zhang yndu13 labixiaok xuhaiming1996 mindis xususan rainbow-rain nxw1994 shaoxiaoyu dedederek gusuperstar bigempire jzf2101 jinyeong alexanderhanboli neroop chaoongithub jinsongpan zkailinzhang cybertyann himelys hkxiron wangyiyan3318 lunarberial

annotated-transformer's Issues

teacher forcing problem

thanks for this tuorial ,it's really helpful to me,but i can't find where is the different cased by teacher forcing between trainning process and inference process ,thanks

Images in the google colab version are broken

All the images in the google colab version of the notebook won't render.
Heres' the link to the google colab version:

https://colab.research.google.com/drive/1xQXSv6mtAOLXxEMi8RvaW8TW-7bvYBDF

Preprocessed Dataset

Can you please provide the dataset that you had pre-processed for the translation task?

Question about learning rate

def rate(self, step = None):
    "Implement `lrate` above"
    if step is None:
        step = self._step
    return self.factor * \
        (self.model_size ** (-0.5) *
        min(step ** (-0.5), step * self.warmup ** (-1.5)))

What's the role of self.factor?

LayerNorm(x + Sublayer(x))

In the text (between cells 7 and 8), you say the output of each sub-layer is:

layer_norm(x + dropout(sublayer(x)))

but in the code it looks like it's implemented as

x + dropout(sublayer(layer_norm(x))

Am I understanding correctly? (I'm guessing this doesn't matter in practice, but just checking)

Thanks

Why called NoamOpt?

because of Noam Shazeer, one of the authors of the paper?
what is the connection?

Pytorch 1.0.1.post2 Masked_fill error in attention function

Running the code locally gets me an error about shapes not being broadcastable, although when I run in the notebook, I don't get the error, and printing the shapes reveals that they aren't broadcastable. What's going on?

Use of `deepcopy` in `clones` function

The code duplicate encoder/decoder layers using the following clones function:

def clones(module, N):
    "Produce N identical layers."
    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

My understanding is that this would cause all layers to share the same random initial values for the learnable parameters, which is likely non-optimal. Is this just to make the code simpler? I feel like there are easy alternatives, e.g.:

def clones(module_fn, N):
    "Produce N identical layers."
    return nn.ModuleList([module_fn() for _ in range(N)])

On a different notes, kudos for writing such a clear and useful document!

not working with pytorch 0.4

Any plan to update?

TypeError: torch.index_select received an invalid combination of arguments

This problem happen on the section of 'Greedy Decoding'.

TypeError Traceback (most recent call last)
in ()
9 model.train()
10 run_epoch(data_gen(V, 30, 20), model,
---> 11 SimpleLossCompute(model.generator, criterion, model_opt))
12 model.eval()
13 print(run_epoch(data_gen(V, 30, 5), model,

in run_epoch(data_iter, model, loss_compute)
7 for i, batch in enumerate(data_iter):
8 out = model.forward(batch.src, batch.trg,
----> 9 batch.src_mask, batch.trg_mask)
10 loss = loss_compute(out, batch.trg_y, batch.ntokens)
11 total_loss += loss

in forward(self, src, tgt, src_mask, tgt_mask)
14 def forward(self, src, tgt, src_mask, tgt_mask):
15 "Take in and process masked src and target sequences."
---> 16 return self.decode(self.encode(src, src_mask), src_mask,
17 tgt, tgt_mask)
18

in encode(self, src, src_mask)
18
19 def encode(self, src, src_mask):
---> 20 return self.encoder(self.src_embed(src), src_mask)
21
22 def decode(self, memory, src_mask, tgt, tgt_mask):

X:\ProgramData\Anaconda3\Lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
323 for hook in self._forward_pre_hooks.values():
324 hook(self, input)
--> 325 result = self.forward(*input, **kwargs)
326 for hook in self._forward_hooks.values():
327 hook_result = hook(self, input, result)

X:\ProgramData\Anaconda3\Lib\site-packages\torch\nn\modules\container.py in forward(self, input)
65 def forward(self, input):
66 for module in self._modules.values():
---> 67 input = module(input)
68 return input
69

in forward(self, x)
6
7 def forward(self, x):
----> 8 return self.lut(x) * math.sqrt(self.d_model)

X:\ProgramData\Anaconda3\Lib\site-packages\torch\nn\modules\sparse.py in forward(self, input)
101 input, self.weight,
102 padding_idx, self.max_norm, self.norm_type,
--> 103 self.scale_grad_by_freq, self.sparse
104 )
105

X:\ProgramData\Anaconda3\Lib\site-packages\torch\nn_functions\thnn\sparse.py in forward(cls, ctx, indices, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
57 output = torch.index_select(weight, 0, indices)
58 else:
---> 59 output = torch.index_select(weight, 0, indices.view(-1))
60 output = output.view(indices.size(0), indices.size(1), weight.size(1))

Question about masking on src and tgt

In class Batch, those mask about padding in sentence confuses me a lot. By conducting on how src_mask and tgt_mask working, I find that those masks only mask columns when doing self-attention after Q x K. However, I think both columns and rows of the calculated square matrix should be masked. Can anyone help me to figure it out? Thanks.

Any Plan on Adding Beam Search?

Thanks for providing this fine-grained tutorial on reproducing Transformer in Pytorch!

I'm wondering if you have any plan on adding the support for beam search? Thanks!

Best

License?

Hello. What license is the code/post under?

RuntimeError: inconsistent tensor size

Please help me with this issue

Add Titles and Axis Labels to Charts on Label Smoothing

It would be a good thing to have / would simplify reading.

decoding loop

I have a question regarding the decoding loop (cell 45)

for i, batch in enumerate(valid_iter):

I want to run for several validation instances not for only one. Even when I deleted break from the end of the loop It still fails after the first instance with error message

DeprecationWarning: generator 'Iterator.__iter__' raised StopIteration

When I replace valid_iter with train_iter. It runs for the first 42 instances then fails with the same error message.

I appreciate your help.

Why compute the positional encodings in way of exp(log...) ?

Why not simply with the original expression of the paper?
I know the expression in the code as below is equivalent to the original one:
div_term = torch.exp(torch.arange(0, d_model, 2) *-(math.log(10000.0) / d_model))

where should layer norm be applied?

class SublayerConnection(nn.Module):

def __init__(self, size, dropout):
    super(SublayerConnection, self).__init__()
    self.norm = LayerNorm(size)
    self.dropout = nn.Dropout(dropout)

def forward(self, x, sublayer):
    "Apply residual connection to any sublayer with the same size."
    #return x + self.dropout(sublayer(self.norm(x)))
    return self.norm(x + self.dropout(sublayer(x)))

In the paper, it seems to be "return self.norm(x + self.dropout(sublayer(x)))". But it gives a bad result for the example "Train the simple copy task."

Not Working on pytorch 1.0.1.post2

I am consistently getting two errors which I have failed to resolve even after spending hours on it!
Require some serious help here!!

RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'

.
.
.
<ipython-input-139-2ae4ba63671c> in encode(self, src, src_mask)
     17 
     18     def encode(self, src, src_mask):
---> 19         return self.encoder(self.src_embed(src), src_mask)
     20 
     21     def decode(self, memory, src_mask, tgt, tgt_mask):
.
.
.
RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'

NotImplementedError

.
.
.
74     def encode(self, src, src_mask):
---> 75         return self.encoder(self.src_embed(src), src_mask)

Both has self.encoder() as fault. I can't really figure out what is happening. It will super great if someone can provide insights on this. Link to that jupyter notebook:
https://github.com/AmoghM/DeepLearning/blob/master/TransformerNetwork/HarvardTransformer.ipynb

EOFError: Compressed file ended before the end-of-stream marker was reached

when split dataset with this following line ,the error happens:

train, val, test = datasets.IWSLT.splits(
    exts=('.de', '.en'), fields=(SRC, TGT),
    filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and
                          len(vars(x)['trg']) <= MAX_LEN)

I think that's mostly due to the versions' incompatible
my versions(actually there is not a requirements file in the repo,I'll be gratefull if someone add it ):

torch==0.3.0.post4
torchtext==0.2.3
spacy==2.0.10
torchvision==0.2.1

thank u!

label smoothing running error

When I run the case of class LabelSmoothing, I met a problem showns as follows:

RuntimeError: invalid argument 3: Index is supposed to be a vector at d:\build\pytorch\pytorch-0.4.1\aten\src\th\generic\thtensormath.cpp:569.

I find this line "true_dist.index_fill_(0, mask.squeeze(), 0)" cause the error because mask is blank.

Besides, the second case still has problem. The second case is :

        crit = LabelSmoothing(5, 0, 0.4)
        predict = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0],
        [0, 0.2, 0.7, 0.1, 0],
        [0, 0.2, 0.7, 0.1, 0]])
        v = crit(Variable(predict.log()),
        Variable(torch.LongTensor([2, 1, 0])))
        plt.imshow(crit.true_dist)
        None

The result of v is "inf".

Does anyone have the same problem?
I am wandering whether the label smoothing works efficiently. How about deleting the label smoothing process in the project.

Hope your reply. Thanks a lot!

MultipleGPULossCompute no longer works?

Sorry for raising another issue again. I am trying to adapt this code to my own dataset. Everything works fine except for loss computation with multiple gpus, with an assertion error in parallel_apply. Any idea how to fix this? I am using pytorch 0.4.1. Thank you so much for your help

train_loss = run_epoch((rebatch(pad_idx, b) for b in train_iter), model_par, MultiGPULossCompute(model.generator, criterion, devices=devices, opt=model_opt)) File "main.py", line 75, in run_epoch loss = loss_compute(out, batch.trg_y, batch.ntokens) File "/home/richie/ErrorCorrection/optim.py", line 59, in __call__ gen = nn.parallel.parallel_apply(generator, out_column) File "/home/richie/anaconda3/envs/python36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 30, in parallel_apply assert len(modules) == len(inputs) AssertionError

maybe a error in LabelSmoothing

the forward() in class LabelSmoothing throws a bug when run the block [29], can you check the code of condition judgement "if mask.dim() > 0:" should be "if mask.nelement() > 0:" .because if the mask is tensor([]), then the dim() is equal 1, and number of element is equal 0.

the following is my modified code, then there is no bug like before.
def forward(self, x, target):
assert x.size(1) == self.size
true_dist = x.data.clone()
true_dist.fill_(self.smoothing / (self.size - 2))
true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
true_dist[:, self.padding_idx] = 0
mask = torch.nonzero(target.data == self.padding_idx)
if mask.nelement() > 0:
true_dist.index_fill_(0, mask.squeeze(), 0.0)
self.true_dist = true_dist
return self.criterion(x, Variable(true_dist, requires_grad=False))

loss calculation in MultiGPULossCompute

Hi, thanks much for sharing. I have a quick question in the loss calculation for mutiplegpu. I am not very familiar with pytorch and having hard time understanding this line : o1.backward(gradient=o2), why backprop is done on out (not loss) and why we need to pass o2?, thanks for your answer.

MathJax is broken when using HTTPS

The MathJax cdn link uses HTTP, not HTTPS, and thus Chrome and other browsers will refuse to load it when the main page is loaded using HTTPS

on the section of 'Greedy Decoding', got a problem

Traceback (most recent call last):
File "C:/Users/01/Desktop/机器学习作业/AllenNLP/test.py", line 632, in
train_epoch(data_gen(V, 30, 20), model, criterion, model_opt)
File "C:/Users/01/Desktop/机器学习作业/AllenNLP/test.py", line 589, in train_epoch
out = model.forward(src, trg[:, :-1], src_mask, trg_mask[:, :-1, :-1])
File "C:/Users/01/Desktop/机器学习作业/AllenNLP/test.py", line 42, in forward
return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)
File "C:/Users/01/Desktop/机器学习作业/AllenNLP/test.py", line 36, in encode
return self.encoder(self.tgt_embed(src), src_mask)
File "C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\torch\nn\modules\container.py", line 91, in forward
input = module(input)
File "C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "C:/Users/01/Desktop/机器学习作业/AllenNLP/test.py", line 319, in forward
return self.lut(x) * math.sqrt(self.d_model)
File "C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\torch\nn\modules\sparse.py", line 108, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "C:\ProgramData\Anaconda3\envs\python36\lib\site-packages\torch\nn\functional.py", line 1076, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got CPUIntTensor instead (while checking arguments for embedding)

When I using the Colab, I got the correct result.
But on my own Pycharm, I got this problem.
Thx, this model is new to me.

Write() Unicode error during Data loading

I was following your annotated Ipython code, and some weird error came up during data loading section.

.data/iwslt/de-en/IWSLT16.TED.tst2012.de-en.en.xml
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-74-c069794b494b> in <module>()
     23     train, val, test = datasets.IWSLT.splits(
     24         exts=('.de', '.en'), fields=(SRC, TGT),
---> 25         filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and
     26             len(vars(x)['trg']) <= MAX_LEN)
     27     MIN_FREQ = 2

/usr/local/lib/python2.7/dist-packages/torchtext-0.2.1-py2.7.egg/torchtext/datasets/translation.pyc in splits(cls, exts, fields, root, train, validation, test, **kwargs)
    136 
    137         if not os.path.exists(os.path.join(path, train) + exts[0]):
--> 138             cls.clean(path)
    139 
    140         train_data = None if train is None else cls(

/usr/local/lib/python2.7/dist-packages/torchtext-0.2.1-py2.7.egg/torchtext/datasets/translation.pyc in clean(path)
    156                 for doc in root.findall('doc'):
    157                     for e in doc.findall('seg'):
--> 158                         fd_txt.write(e.text.strip() + '\n')
    159 
    160         xml_tags = ['<url', '<keywords', '<talkid', '<description',

TypeError: write() argument 1 must be unicode, not str

It seems like the code cannot locate the path of the datasets, although the dataset is actually in the path.

Was there any update on dependencies after this code released?

TypeError: init() takes from 1 to 4 positional arguments but 5 were given

occurs in the block immediately after the cell with the comment

Create the model an load it onto our GPU.

Here is the cell that throws the error:

criterion = LabelSmoothing(size=len(TGT.vocab), padding_idx=pad_idx, smoothing=0.1)
criterion.cuda()
for epoch in range(15):
train_epoch((rebatch(pad_idx, b) for b in train_iter), model, criterion, model_opt)
valid_epoch((rebatch(pad_idx, b) for b in valid_iter), model, criterion)

=============================
Here is the error message:

TypeError Traceback (most recent call last)

in ()
2 criterion.cuda()
3 for epoch in range(15):
----> 4 train_epoch((rebatch(pad_idx, b) for b in train_iter), model, criterion, model_opt)
5 valid_epoch((rebatch(pad_idx, b) for b in valid_iter), model, criterion)

2 frames

/usr/local/lib/python3.6/dist-packages/torchtext/data/iterator.py in iter(self)
149 minibatch.sort(key=self.sort_key, reverse=True)
150 yield Batch(minibatch, self.dataset, self.device,
--> 151 self.train)
152 if not self.repeat:
153 return

TypeError: init() takes from 1 to 4 positional arguments but 5 were given

error in Batch class

in the function "make_tgt_mask()" of class Batch, the codes "tgt_mask = (tgt==pad),unsqueeze(-2)" maybe wrong, it should be "tgt_mask = (tgt!=pad),unsqueeze(-2)" instead.
codes in future_mask "return torch.from_numpy(future_mask.astype('uint8'))" should be "return torch.from_numpy(future_mask.astype('uint8'))==0"

from SimpleLossCompute.call to LabelSmoothing.forward, why dose the dim of x change?

Thank you for a great piece.

I have a question about the method foward in LabelSmoothing. I debug it by Pycharm and the excuting order is from
loss = self.criterion(x.contiguous().view(-1, x.size(-1)), #[30, 9, 512] --> [270, 512]
y.contiguous().view(-1)) / norm #[30, 9] --> [270] (SimpleLossCompute's method call)
to
method forward in LabelSmoothing
def forward(self, x, target): #x is the output of model, target is label
assert x.size(1) == self.size
true_dist = x.data.clone()
....
why the dim of x changes from [270, 512] to [270, 11]

Platform dependent iwslt.pt

Hello again!

When I tried to load the pre-trained model

model = torch.load("iwslt.pt")

it fails with the error message

RuntimeError: cuda runtime error (35) : CUDA driver version is insufficient for CUDA runtime version at torch/csrc/cuda/Module.cpp:107

Is it possible (in addition to the entire model) to save just model weights so that I can load the model on any platform (it can be done with something like torch.save(the_model.state_dict(), PATH))?

The kernel appears to have died. It will restart automatically.

i was trying to run this code in Jupyter notebook,but when i run this cell, it came out an error: 'The kernel appears to have died. It will restart automatically.' I cant figure out why this error will come out,could anybody offer me some help? Thank you so much.

# Train the simple copy task.
V = 11
criterion = LabelSmoothing(size=V, padding_idx=0, smoothing=0.0)
model = make_model(V, V, N=2)
model_opt = NoamOpt(model.src_embed[0].d_model, 1, 400,
        torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9))

for epoch in range(3):
    model.train()
    run_epoch(data_gen(V, 30, 20), model, 
              SimpleLossCompute(model.generator, criterion, model_opt))
    model.eval()
    print(run_epoch(data_gen(V, 30, 5), model, 
                    SimpleLossCompute(model.generator, criterion, None)))

Question about the normalization in func 'SublayerConnection'

Just like what srush said in #1
BUT, is that won't be a problem? since you also normalized the input embedding by coding like that.

A mistake in attention-visualization?

I'm looking at the section "Attention Visualization" in the jupyter notebook: "The Annotated Transformer.ipynb".
Is there any chance that there is a mistake in the visualization code?

When presenting the "Decoder Src Layer", the code that extracts the attention scores from the "self_attn" layers is:

fig, axs = plt.subplots(1,4, figsize=(20, 10)) for h in range(4): draw(model.decoder.layers[layer].self_attn.attn[0, h].data[:len(tgt_sent), :len(sent)], sent, tgt_sent if h ==0 else [], ax=axs[h])

Shouldn't it be replaced by the "src_attn" layer instead of the "self_attn" layer?

Thanks,
Ori

Training IWSLT on CPU

Hello!

Thank you very much for your contribution.

I wonder how to adapt the code in order to train a model on IWSLT data on my PC without GPUs.

It seems like MultiGPULossCompute should be replaced in run_epoch, but SimpleLossCompute doesn't seem like an appropriate candidate.

I would appreciate any hint.

Do we need MultiGPULossCompute?

Since we have
model_par = torch.nn.DataParallel(model, devices=devices)
do we need MultiGPULossCompute?
I thought pytorch comes with targets scattering and other stuff inside MultiGPULossCompute prebuilt in DataParallel.

I faced with some empty error while using MultiGPULossCompute, so I went with SimpleLossCompute and everything was fine.

errors in class MultiHeadedAttention?

I think this line in the MultiHeadedAttention: l(x).view(nbatches, -1, self.h, self.d_k) should be l(x).view(nbatches, self.h, -1, self.d_k), such that the unsqueezed mask could be applied to all heads.

This does not work AT ALL even in Pytorch 0.3

I installed pytorch 0.3 as requested and also in python 1.1 and it does not work.

I mean even the toy example does not work.

This significantly diminish value of this implementation, so I suggest to fix the code so it works with never version of pytorch.

Or fix it so it work in any version of pytorch :)

loss.backward() inside SimpleLossCompute

In the __call__ function of class SimpleLossCompute, loss.backward() is called even if opt is None, i.e. during validation. Wouldn't this cause gradients to be computed on the validation set and thus you are training on the validation data?

BLEU score computation

How to compute the BLEU score of the models implemented in this repo for the provided examples?

"exp" not implemented for 'torch.LongTensor' pytorch 1.0

to run on pytorch 1.0 both the position and the div term need to be initialized as float instead of int in the PositionalEncoding Class

ie.

position = torch.arange(0.0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0.0, d_model, 2) *
                  -(math.log(10000.0) / d_model))

in lieu of
torch.arange(0, ...

thank you for a great breakdown of Vaswani's paper

Problem about the vocabulary of iwslt.pt

There's a problem about the vocabulary of https://s3.amazonaws.com/opennmt-models/iwslt.pt : After loading the model by model = torch.load("iwslt.pt"), it can be found that size of the English vocabulary is 36321

(iwslt.pt : size 36321; built from datasets.IWSLT: 36327)

(0): Embeddings(
      (lut): Embedding(36321, 512)
 )

However, after building the TGT.vocab by

    MAX_LEN = 100
    train, val, test = datasets.IWSLT.splits(
        exts=('.de', '.en'), fields=(SRC, TGT), 
        filter_pred=lambda x: len(vars(x)['src']) <= MAX_LEN and 
            len(vars(x)['trg']) <= MAX_LEN)
    MIN_FREQ = 2
    SRC.build_vocab(train.src, min_freq=MIN_FREQ)
    TGT.build_vocab(train.trg, min_freq=MIN_FREQ)

, it's found that the size of the English vocabulary is

print("vocab_size = ", len(TGT.vocab) )  # 36327
print("vocab_size = ", len(TGT.vocab.itos) ) # 36327
print("vocab_size = ", len(TGT.vocab.stoi) ) # 36327

The codes for building vocab are almost identical. Perhaps datasets.IWSLT has changed slightly so that the vocab differs slightly.

Though translation results of the model on the 'valid_iter' seem quite correct, the model loaded from 'iwslt.pt' still cannot fully work since the vocabulary currently built from datasets.IWSLT does not match the vocabulary size.

I am building a model based on the iwslt.pt so I need the English vocabulary. How or where to obtain the correct English vocabulary of 'https://s3.amazonaws.com/opennmt-models/iwslt.pt' (size 36321) ?

question in subsequent_mask

Why use
subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')
return torch.from_numpy(subsequent_mask) == 0

I think
return torch.from_numpy(np.tril(np.ones(attn_shape), k=0).astype('uint8'))
maybe is a more clear writing style for a new learner?

Is original input x copied self.h times ? (number of heads)

class EncoderLayer(nn.Module):
    def forward(self, x, mask):
        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))

x is repeated 3 times for getting k,q,v

class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        self.d_k = d_model // h
     ...   
    def forward(self, query, key, value, mask=None):
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]

there is l(x).view(nbatches, -1, self.h, self.d_k)
which reshape data into self.h heads.
This indicates the input should already have a dimension of number of heads.

Question/Confusion: I did not find where original input x is copied self.h times.

Batch size and iterator

Hello,

Thanks for the awesome work. I am trying to adapt this to my own dataset but having trouble in understanding the batch size used and the iterator as I am new to torchtext. The batch size is set as 12000 in the example but when I try to print the tensor sizes within the batch, the shapes don't really reflect that. Am I misunderstanding something?

   BATCH_SIZE = 12000
    train_iter = MyIterator(train, batch_size=BATCH_SIZE, device=0,
                            repeat=False, sort_key=lambda x: (len(x.src), len(x.trg)),
                            batch_size_fn=batch_size_fn, train=True)
    i = 0
    for b in train_iter:
        i+=1
    print("steps: " + str(i))
    for b in train_iter:
        print(b.src.size())
        print(b.trg.size())
        return

"""
output
steps: 402
torch.Size([31, 203])
torch.Size([59, 203])
"""

Sorry for my ignorance. But how should I interpret this? Does the step equal to total data size / batch size?

Thank you very much for your help

Question about MultiGPULossCompute

LabelSmoothing running error

When I run the case of class LabelSmoothing, I met a problem showns as follows:

RuntimeError: invalid argument 3: Index is supposed to be a vector at d:\build\pytorch\pytorch-0.4.1\aten\src\th\generic\thtensormath.cpp:569.

I find this line "true_dist.index_fill_(0, mask.squeeze(), 0)" cause the error because mask is blank.

Besides, the second case still has problem. The second case is :

crit = LabelSmoothing(5, 0, 0.4)
predict = torch.FloatTensor([[0, 0.2, 0.7, 0.1, 0],
[0, 0.2, 0.7, 0.1, 0],
[0, 0.2, 0.7, 0.1, 0]])
v = crit(Variable(predict.log()),
Variable(torch.LongTensor([2, 1, 0])))

Show the target distributions expected by the system.

plt.imshow(crit.true_dist)
None

The result of v is "inf".

Does anyone have the same problem?
I am wandering whether the label smoothing works efficiently. How about deleting the label smoothing process in the project.

Hope your reply. Thanks a lot!

run problem

when I run this code, there occurred some problems
(1) in LabelSmoothing class
this line => true_dist.index_fill_(0, mask.squeeze(), 0.0)
RuntimeError: invalid argument 3: Index is supposed to be a vector at /Users/soumith/code/builder/wheel/pytorch-src/aten/src/TH/generic/THTensorMath.cpp:569
(2) in SimpleLossCompute class
this lien = > loss = self.criterion(x.contiguous().view(-1, x.size(-1)), y.contiguous().view(-1)) / norm
RuntimeError: Expected object of type torch.FloatTensor but found type torch.LongTensor for argument #2 'other'
my torch version is 0.4.1
does anyone have these problem? thanks~

Positional Encoding Clarification

@srush Thank you so much for this post. However it will be great if you can help me with following clarification regarding Positional Encoding.

The whole intent of using positional encoding is to bring a sense of Positions (absolute or relative) and Time. Using Sine(for even positions) and Cos (for odd positions) wave, how do we embed this?
Also,

For each position we get "dmodel" (say 512) sine representations. These representations have different frequencies. so, for each position we have 512 sinusoidal representations of different frequencies. What does these each representation signify or in other words each representation corresponding to different dimension with different frequency tells what about the particular position ?
You mentioned: since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos.
Are we saying that since we can transform one function to another (linear transformations) we are keeping track of relative positions for any position with respect to any other position?

Issue with training in CPU

Training on CPU with:
torch==1.2.0+cpu torchvision==0.4.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
python 3.5

I have changed the current code as it was throwing warning and error with torch 1.+ as follows:

torch.arange(0, max_len) to torch.arange(0.0, max_len) 
torch.exp(torch.arange(0, d_model, 2) to torch.exp(torch.arange(0.0, d_model, 2)

nn.KLDivLoss(size_average=False) to nn.KLDivLoss(reduction='sum')

nn.init.xavier_uniform(p) to nn.init.xavier_uniform_(p)

Variable(torch.LongTensor([1]))).data[0] to Variable(torch.LongTensor([1])))

SimpleLossCompute
loss.data[0] to loss

Also I deleted the In[37] and In[43] as I am training with CPU only.

Now, when I train it, it's showing the following error:

Epoch Step: 1 Loss: 3.141923 Tokens per Sec: 540.000000
Floating point exception (core dumped)