Coder Social home page Coder Social logo

transformer's Introduction

WARNING

This code was written in 2019, and I was not very familiar with transformer model in that time. So don't trust this code too much. Currently I am not managing this code well, so please open pull requests if you find bugs in the code and want to fix.

Transformer

My own implementation Transformer model (Attention is All You Need - Google Brain, 2017)

model

1. Implementations

1.1 Positional Encoding

model

class PositionalEncoding(nn.Module):
    """
    compute sinusoid encoding.
    """
    def __init__(self, d_model, max_len, device):
        """
        constructor of sinusoid encoding class

        :param d_model: dimension of model
        :param max_len: max sequence length
        :param device: hardware device setting
        """
        super(PositionalEncoding, self).__init__()

        # same size with input matrix (for adding with input matrix)
        self.encoding = torch.zeros(max_len, d_model, device=device)
        self.encoding.requires_grad = False  # we don't need to compute gradient

        pos = torch.arange(0, max_len, device=device)
        pos = pos.float().unsqueeze(dim=1)
        # 1D => 2D unsqueeze to represent word's position

        _2i = torch.arange(0, d_model, step=2, device=device).float()
        # 'i' means index of d_model (e.g. embedding size = 50, 'i' = [0,50])
        # "step=2" means 'i' multiplied with two (same with 2 * i)

        self.encoding[:, 0::2] = torch.sin(pos / (10000 ** (_2i / d_model)))
        self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model)))
        # compute positional encoding to consider positional information of words

    def forward(self, x):
        # self.encoding
        # [max_len = 512, d_model = 512]

        batch_size, seq_len = x.size()
        # [batch_size = 128, seq_len = 30]

        return self.encoding[:seq_len, :]
        # [seq_len = 30, d_model = 512]
        # it will add with tok_emb : [128, 30, 512]         



1.2 Multi-Head Attention

model

class MultiHeadAttention(nn.Module):

    def __init__(self, d_model, n_head):
        super(MultiHeadAttention, self).__init__()
        self.n_head = n_head
        self.attention = ScaleDotProductAttention()
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        self.w_concat = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        # 1. dot product with weight matrices
        q, k, v = self.w_q(q), self.w_k(k), self.w_v(v)

        # 2. split tensor by number of heads
        q, k, v = self.split(q), self.split(k), self.split(v)

        # 3. do scale dot product to compute similarity
        out, attention = self.attention(q, k, v, mask=mask)
        
        # 4. concat and pass to linear layer
        out = self.concat(out)
        out = self.w_concat(out)

        # 5. visualize attention map
        # TODO : we should implement visualization

        return out

    def split(self, tensor):
        """
        split tensor by number of head

        :param tensor: [batch_size, length, d_model]
        :return: [batch_size, head, length, d_tensor]
        """
        batch_size, length, d_model = tensor.size()

        d_tensor = d_model // self.n_head
        tensor = tensor.view(batch_size, length, self.n_head, d_tensor).transpose(1, 2)
        # it is similar with group convolution (split by number of heads)

        return tensor

    def concat(self, tensor):
        """
        inverse function of self.split(tensor : torch.Tensor)

        :param tensor: [batch_size, head, length, d_tensor]
        :return: [batch_size, length, d_model]
        """
        batch_size, head, length, d_tensor = tensor.size()
        d_model = head * d_tensor

        tensor = tensor.transpose(1, 2).contiguous().view(batch_size, length, d_model)
        return tensor



1.3 Scale Dot Product Attention

model

class ScaleDotProductAttention(nn.Module):
    """
    compute scale dot product attention

    Query : given sentence that we focused on (decoder)
    Key : every sentence to check relationship with Qeury(encoder)
    Value : every sentence same with Key (encoder)
    """

    def __init__(self):
        super(ScaleDotProductAttention, self).__init__()
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, q, k, v, mask=None, e=1e-12):
        # input is 4 dimension tensor
        # [batch_size, head, length, d_tensor]
        batch_size, head, length, d_tensor = k.size()

        # 1. dot product Query with Key^T to compute similarity
        k_t = k.transpose(2, 3)  # transpose
        score = (q @ k_t) / math.sqrt(d_tensor)  # scaled dot product

        # 2. apply masking (opt)
        if mask is not None:
            score = score.masked_fill(mask == 0, -10000)

        # 3. pass them softmax to make [0, 1] range
        score = self.softmax(score)

        # 4. multiply with Value
        v = score @ v

        return v, score



1.4 Layer Norm

model

class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-12):
        super(LayerNorm, self).__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        var = x.var(-1, unbiased=False, keepdim=True)
        # '-1' means last dimension. 

        out = (x - mean) / torch.sqrt(var + self.eps)
        out = self.gamma * out + self.beta
        return out



1.5 Positionwise Feed Forward

model

class PositionwiseFeedForward(nn.Module):

    def __init__(self, d_model, hidden, drop_prob=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, hidden)
        self.linear2 = nn.Linear(hidden, d_model)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=drop_prob)

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.linear2(x)
        return x



1.6 Encoder & Decoder Structure

model

class EncoderLayer(nn.Module):

    def __init__(self, d_model, ffn_hidden, n_head, drop_prob):
        super(EncoderLayer, self).__init__()
        self.attention = MultiHeadAttention(d_model=d_model, n_head=n_head)
        self.norm1 = LayerNorm(d_model=d_model)
        self.dropout1 = nn.Dropout(p=drop_prob)

        self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)
        self.norm2 = LayerNorm(d_model=d_model)
        self.dropout2 = nn.Dropout(p=drop_prob)

    def forward(self, x, src_mask):
        # 1. compute self attention
        _x = x
        x = self.attention(q=x, k=x, v=x, mask=src_mask)
        
        # 2. add and norm
        x = self.dropout1(x)
        x = self.norm1(x + _x)
        
        # 3. positionwise feed forward network
        _x = x
        x = self.ffn(x)
      
        # 4. add and norm
        x = self.dropout2(x)
        x = self.norm2(x + _x)
        return x

class Encoder(nn.Module):

    def __init__(self, enc_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device):
        super().__init__()
        self.emb = TransformerEmbedding(d_model=d_model,
                                        max_len=max_len,
                                        vocab_size=enc_voc_size,
                                        drop_prob=drop_prob,
                                        device=device)

        self.layers = nn.ModuleList([EncoderLayer(d_model=d_model,
                                                  ffn_hidden=ffn_hidden,
                                                  n_head=n_head,
                                                  drop_prob=drop_prob)
                                     for _ in range(n_layers)])

    def forward(self, x, src_mask):
        x = self.emb(x)

        for layer in self.layers:
            x = layer(x, src_mask)

        return x

class DecoderLayer(nn.Module):

    def __init__(self, d_model, ffn_hidden, n_head, drop_prob):
        super(DecoderLayer, self).__init__()
        self.self_attention = MultiHeadAttention(d_model=d_model, n_head=n_head)
        self.norm1 = LayerNorm(d_model=d_model)
        self.dropout1 = nn.Dropout(p=drop_prob)

        self.enc_dec_attention = MultiHeadAttention(d_model=d_model, n_head=n_head)
        self.norm2 = LayerNorm(d_model=d_model)
        self.dropout2 = nn.Dropout(p=drop_prob)

        self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)
        self.norm3 = LayerNorm(d_model=d_model)
        self.dropout3 = nn.Dropout(p=drop_prob)

    def forward(self, dec, enc, trg_mask, src_mask):    
        # 1. compute self attention
        _x = dec
        x = self.self_attention(q=dec, k=dec, v=dec, mask=trg_mask)
        
        # 2. add and norm
        x = self.dropout1(x)
        x = self.norm1(x + _x)

        if enc is not None:
            # 3. compute encoder - decoder attention
            _x = x
            x = self.enc_dec_attention(q=x, k=enc, v=enc, mask=src_mask)
            
            # 4. add and norm
            x = self.dropout2(x)
            x = self.norm2(x + _x)

        # 5. positionwise feed forward network
        _x = x
        x = self.ffn(x)
        
        # 6. add and norm
        x = self.dropout3(x)
        x = self.norm3(x + _x)
        return x

class Decoder(nn.Module):
    def __init__(self, dec_voc_size, max_len, d_model, ffn_hidden, n_head, n_layers, drop_prob, device):
        super().__init__()
        self.emb = TransformerEmbedding(d_model=d_model,
                                        drop_prob=drop_prob,
                                        max_len=max_len,
                                        vocab_size=dec_voc_size,
                                        device=device)

        self.layers = nn.ModuleList([DecoderLayer(d_model=d_model,
                                                  ffn_hidden=ffn_hidden,
                                                  n_head=n_head,
                                                  drop_prob=drop_prob)
                                     for _ in range(n_layers)])

        self.linear = nn.Linear(d_model, dec_voc_size)

    def forward(self, trg, src, trg_mask, src_mask):
        trg = self.emb(trg)

        for layer in self.layers:
            trg = layer(trg, src, trg_mask, src_mask)

        # pass to LM head
        output = self.linear(trg)
        return output



2. Experiments

I use Multi30K Dataset to train and evaluate model
You can check detail of dataset here
I follow original paper's parameter settings. (below)

conf

2.1 Model Specification

  • total parameters = 55,207,087
  • model size = 215.7MB
  • lr scheduling : ReduceLROnPlateau

2.1.1 configuration

  • batch_size = 128
  • max_len = 256
  • d_model = 512
  • n_layers = 6
  • n_heads = 8
  • ffn_hidden = 2048
  • drop_prob = 0.1
  • init_lr = 0.1
  • factor = 0.9
  • patience = 10
  • warmup = 100
  • adam_eps = 5e-9
  • epoch = 1000
  • clip = 1
  • weight_decay = 5e-4

2.2 Training Result

image

  • Minimum Training Loss = 2.852672759656864
  • Minimum Validation Loss = 3.2048025131225586

Model Dataset BLEU Score
Original Paper's WMT14 EN-DE 25.8
My Implementation Multi30K EN-DE 26.4



3. Reference



4. Licence

Copyright 2019 Hyunwoong Ko.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

transformer's People

Contributors

ayaka14732 avatar egliette avatar gj98 avatar hyunwoongko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

transformer's Issues

how to get dataset

I'm new to transformer recently and don't know how to get the dataset in this project.
Please help me to provide a linux script if you can.

Masked attention

Hi,

In ScaleDotProductAttention.forward, shouldn't the masked inputs be set to -inf rather than 0? So for instance

score = score.masked_fill(mask == 0, -np.inf)

rather than

score = score.masked_fill(mask == 0, -e)

This is mentioned in Section 3.2.3 of Vaswani et al.

LayerNorm implement

in the original code,layer-norm:
out = (x - mean) / (std + self.eps)
but in formula he give, should be
out = (x - mean) / math.sqrt(std + self.eps)

Reporting a bug during test time

@hyunwoongko thanks for your nice implementation. By the way, I wanna point out an issue.
If you notice, while testing, you are utilising the following code:

def test_model(num_examples):
    iterator = test_iter
    # model.load_state_dict(torch.load("./saved/model-saved.pt"))

    with torch.no_grad():
        batch_bleu = []
        for i, batch in enumerate(iterator):
            src = batch.src
            trg = batch.trg
            output = model(src, trg[:, :-1])
            ...

I guess this part has a significant problem that cannot be ignored. While testing, we don't have any target. The target sentence should be created word by word based on the output of the decoder. You should have a loop for that. You can take a look at here.

About MultiHeadAttention's split method

Hey bro I find it confusing about ".contiguous()" usage in this class.

I'm a beginner about this field, and I just learned the purpose of using the method is make tensors save in contiguous space and increase effeciency.

But in this code:
`class MultiHeadAttention(nn.Module):

def __init__(self, d_model, n_head):
    super(MultiHeadAttention, self).__init__()
    self.n_head = n_head
    self.attention = ScaleDotProductAttention()
    self.w_q = nn.Linear(d_model, d_model)
    self.w_k = nn.Linear(d_model, d_model)
    self.w_v = nn.Linear(d_model, d_model)
    self.w_concat = nn.Linear(d_model, d_model)

def forward(self, q, k, v, mask=None):
    # 1. dot product with weight matrices
    q, k, v = self.w_q(q), self.w_k(k), self.w_v(v)

    # 2. split tensor by number of heads
    q, k, v = self.split(q), self.split(k), self.split(v)

    # 3. do scale dot product to compute similarity
    out, attention = self.attention(q, k, v, mask=mask)
    
    # 4. concat and pass to linear layer
    out = self.concat(out)
    out = self.w_concat(out)

    # 5. visualize attention map
    # TODO : we should implement visualization

    return out

def split(self, tensor):
    """
    split tensor by number of head

    :param tensor: [batch_size, length, d_model]
    :return: [batch_size, head, length, d_tensor]
    """
    batch_size, length, d_model = tensor.size()

    d_tensor = d_model // self.n_head
    tensor = tensor.view(batch_size, length, self.n_head, d_tensor).transpose(1, 2)
			
    # it is similar with group convolution (split by number of heads)

    return tensor

def concat(self, tensor):
    """
    inverse function of self.split(tensor : torch.Tensor)

    :param tensor: [batch_size, head, length, d_tensor]
    :return: [batch_size, length, d_model]
    """
    batch_size, head, length, d_tensor = tensor.size()
    d_model = head * d_tensor

    tensor = tensor.transpose(1, 2).contiguous().view(batch_size, length, d_model)
    return tensor`

I find that we din't use .contiguous() in split method while using that method in concat method.

Is there any special reason? I'd appreciate it a lot if anyone could answer me.

Question about implementation in the multi-head attention part

Hi, I want first to thank you for sharing the repo, and it is very helpful to me to understand the transformer via your code.

I just have one question about your multi-head attention part.

In the forward function, you have
out = x.view(batch_size, seq_len, self.num_heads, d).transpose(1, 2)

I understand the desired output shape should be [batch_size, num_heads, seq_len, d]. But we can do
out = x.view(batch_size, self.num_heads,, seq_len, d)
without using the transpose function.

Is there any particular reason we need to reshape and then transpose it?

Thanks

Potential bug in the pad mask

In the Transformer class, trg_pad_idx is not used, but rather src_pad_idx is used to construct both the source and target pad masks. If trg_pad_idx != src_pad_idx this would cause unintended behaviour.

The experimental results have a large gap with the one in README

I can not get the corresponding result in readme, and does anyone else has once get the corresponding result(that is, BLUE ≈ 26)
image
I wonder if author's language pack version used is different from mine, which leads to a large gap between the experimental results and the one in README .
image

About initial learing rate

I find that the initial learing rate in README is 0.1, but in conf.py is 1e-5. So which one is correct?

Weight matrix sharing confusion

In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation.

Hello! I have the paper recently and find that this part was mentioned in 3.4 Embeddings and Softmax, but your code seemingly consider the embedding layer of ouput and the linear layer before softmax layer as separate ones, so I want to ask what's your consideration of this part?

batch.trg[j] out of index.

in train.py
the size of batch.trg is [118, 35]. the for loop will definitely lead to out of bounds.

total_bleu = []
for j in range(batch_size):
    try:
        trg_words = idx_to_word(batch.trg[j], loader.target.vocab)
        output_words = output[j].max(dim=1)[1]
        output_words = idx_to_word(output_words, loader.target.vocab)
        bleu = get_bleu(hypothesis=output_words.split(), reference=trg_words.split())
        total_bleu.append(bleu)
    except:
        pass

so, is it better to use for j in range(batch.trg.shape[0]) here?

how to resolve the issue ”No module named 'torch._C'“

`
》pip show torch
Name: torch
Version: 1.13.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: c:\python310\lib\site-packages
Requires: typing-extensions
Required-by: torchaudio, torchdata, torchtext, torchvision

》pip show torchtext
Name: torchtext
Version: 0.14.0
Summary: Text utilities and datasets for PyTorch
Home-page: https://github.com/pytorch/text
Author: PyTorch core devs and James Bradbury
Author-email: [email protected]
License: BSD
Location: c:\python310\lib\site-packages
Requires: numpy, requests, torch, tqdm
Required-by:

》python train.py
Traceback (most recent call last):
File "D:\IDEA_workshop\Transformer_study\transformer-PyTorch Implementation of [Attention Is All You Need]\train.py", line 9, in
from torch import nn, optim
File "C:\Python310\lib\site-packages\torch\nn_init_.py", line 1, in
from .modules import * # noqa: F403
File "C:\Python310\lib\site-packages\torch\nn\modules_init_.py", line 1, in
from .module import Module
File "C:\Python310\lib\site-packages\torch\nn\modules\module.py", line 8, in
from ..parameter import Parameter
File "C:\Python310\lib\site-packages\torch\nn\parameter.py", line 2, in
from torch._C import _disabled_torch_function_impl
ModuleNotFoundError: No module named 'torch._C'`

About multi-head attention in attention is all you need, thanks.

Hello, author. I am sincerely that you can answer me when you saw.
I urgently want to realize why there are Q, K, V as input in multi-head attention and then feed them into the three linear of each head respectively? Does the three linear represent w_q, w_k and w_v of each head? If so, the embedding matrix needs to be convert to Q, K and V and then be convert to Q_i, K_i and V_i passing by w_q, w_k and w_v of certain head. The embedding matrix will go through two transformations.
I have seen several realizations including yours and you all directly feed the embedding matrix into the three linear of each head.
How is it to achieve? thanks for your help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.