graykode / nlp-tutorial Goto Github PK
View Code? Open in Web Editor NEWNatural Language Processing Tutorial for Deep Learning Researchers
License: MIT License
Natural Language Processing Tutorial for Deep Learning Researchers
License: MIT License
how to use seq2seq(attention) for multiple batch
Since the recent model OpenAI GPT-2 may be better than BERT and it would be an important breakthrough in NLP, why not adding its' intuitive and simplest code in this repo ?
There exists a similar task that is named text classification.
But I want to find a kind of model that the inputs are keyword set. And the keyword set is not from a sentence.
For example:
input ["apple", "pear", "water melon"] --> target class "fruit"
input ["tomato", "potato"] --> target class "vegetable"
Another example:
input ["apple", "Peking", "in summer"] --> target class "Chinese fruit"
input ["tomato", "New York", "in winter"] --> target class "American vegetable"
input ["apple", "Peking", "in winter"] --> target class "Chinese fruit"
input ["tomato", "Peking", "in winter"] --> target class "Chinese vegetable"
Thank you.
In line 16 you use
input = input + [0] * (max_len - len(input))
the padding, you use 0, which means the first word 'Lorem'.
but it is not the right choose.
I think you can change like that
# word_dict = {w: i for i, w in enumerate(list(set(sentence.split())))}
# number_dict = {i: w for i, w in enumerate(list(set(sentence.split())))}
word_dict = {w: i for i, w in enumerate(['PAD']+list(set(sentence.split())))}
number_dict = {i: w for i, w in enumerate(['PAD']+list(set(sentence.split())))}
nlp-tutorial/4-1.Seq2Seq/Seq2Seq-Torch.py
Line 92 in 6e171b9
hi, I have a question about the way to calculate attention_weights.
https://github.com/graykode/nlp-tutorial/blob/master/4-2.Seq2Seq(Attention)/Seq2Seq(Attention)-Torch.py
in line 60, the attn_weights is calculated by dec_output and enc_outputs in your code, why not dec_hidden and enc_hidden?
self.pos_emb = nn.Embedding.from_pretrained(get_sinusoid_encoding_table(src_len+1, d_model),freeze=True)
The position encoding table should be (max_len, d_model), why add 1?
Excuse me, https://github.com/graykode/nlp-tutorial/blob/master/1-1.NNLM/NNLM-Torch.py#L50 The comment here may be wrong. It should be X = X.view(-1, n_step * m) # [batch_size, n_step * m]
Sorry for disturbing you.
# Padding Should be Zero
src_vocab = {'P' : 0, 'ich' : 1, 'mochte' : 2, 'ein' : 3, 'bier' : 4}
src_vocab_size = len(src_vocab)
tgt_vocab = {'P' : 0, 'i' : 1, 'want' : 2, 'a' : 3, 'beer' : 4, 'S' : 5, 'E' : 6}
number_dict = {i: w for i, w in enumerate(tgt_vocab)}
tgt_vocab_size = len(tgt_vocab)
I changed my code more clearly.
There are some mis-points in Transformer about Position Encoding, beacause of torch.LongTensor([[1,2,3,4,5]])
that the indexing of Embedding is a mixed issue.
So I fixed right with shape of get_sinusoid_encoding_table
.
In Encoder, self.pos_emb(torch.LongTensor([[5,1,2,3,4]]))
is right as ich mochte ein bier P
and Decoder, self.pos_emb(torch.LongTensor([[5,1,2,3,4]]))
is right as S i want a beer
In original paper, maxlen
is 512, n_layer
(number of layers) are 12, but in this tutorial, that is too heavy to run,, so I fiex below this.
# BERT Parameters
maxlen = 30
batch_size = 6
max_pred = 5 # max tokens of prediction
n_layers = 6
n_heads = 12
d_model = 768
d_ff = 768*4 # 4*d_model, FeedForward dimension
d_k = d_v = 64 # dimension of K(=Q), V
n_segments = 2
Also other implementation repository about BERT, when pre processing about masking, [CLS]
, [SEP]
, [PAD]
should not to be changed as MASK
cand_maked_pos = [i for i, token in enumerate(input_ids)] # wrong this.
https://github.com/dhlee347/pytorchic-bert/blob/master/pretrain.py#L132 this code is right, so i fixed it.
Then, I added SEGMENT MASK for masking where token is zero padding.
This is very import problem.
Line 202 :
optimizer = optim.Adam(model.parameters(), lr=0.001)
In practice, I think the effect of Adam is quite bad. When epoch = 10, cost is 1.6; when epoch = 100 or 1000, cost is still equal to 1.6.
So I think we can change Adam to SGD, that is, optimizer = optim.SGD(model.parameters(), lr=0.001)
Here are the effects of using SGD:
Epoch: 0100 cost = 0.047965
Epoch: 0200 cost = 0.020129
Epoch: 0300 cost = 0.012563
Epoch: 0400 cost = 0.009101
Epoch: 0500 cost = 0.007131
Epoch: 0600 cost = 0.005862
Epoch: 0700 cost = 0.004978
Epoch: 0800 cost = 0.004325
Epoch: 0900 cost = 0.003823
Epoch: 1000 cost = 0.003426
line 70: index = randint(0, vocab_size - 1) # random index in vocabulary.
I think the replace index can't involve 'cls' ,'sep' and 'mask'!
position encoding table should be (src_len, d_model). Why (src_vocab_size, d_model) here?
Hello, I want to put the Transformer (Greedy_decoder)-Torch.py code on the gpu, using model=model.to(device), input_data also to (device), but the error still appears "Expected object of backend CUDA but backend CPU for argument #2 'mat2”
At the function translate(in line 90), there's no pre defined object 'args'.
And the function make_batch has no expected args but '[[word, 'P' * len(word)]], args' are given
so, I think the code should be modified.
from
def translate(word, args):
input_batch, output_batch, _ = make_batch([[word, 'P' * len(word)]], args)
# make hidden shape [num_layers * num_directions, batch_size, n_hidden]
hidden = torch.zeros(1, 1, args.n_hidden)
output = model(input_batch, hidden, output_batch)
# output : [max_len+1(=6), batch_size(=1), n_class]
predict = output.data.max(2, keepdim=True)[1] # select n_class dimension
decoded = [char_arr[i] for i in predict]
end = decoded.index('E')
translated = ''.join(decoded[:end])
return translated.replace('P', '')
to
# Test
def translate(word):
input_batch, output_batch = make_testbatch(word)
# make hidden shape [num_layers * num_directions, batch_size, n_hidden]
hidden = torch.zeros(1, 1, n_hidden)
output = model(input_batch, hidden, output_batch)
# output : [max_len+1(=6), batch_size(=1), n_class]
predict = output.data.max(2, keepdim=True)[1] # select n_class dimension
decoded = [char_arr[i] for i in predict]
end = decoded.index('E')
translated = ''.join(decoded[:end])
return translated.replace('P', '')
and make_testbatch should pre declared
#make test batch
def make_testbatch(input_word):
input_batch, output_batch = [], []
input_w = input_word + 'P' * (n_step - len(input_word))
input = [num_dic[n] for n in input_w]
#make a sequence with just start token(S) and pad tokens(P)
output = [num_dic[n] for n in 'S' + 'P' * n_step]
input_batch = np.eye(n_class)[input]
output_batch = np.eye(n_class)[output]
return torch.FloatTensor(input_batch).unsqueeze(0), torch.FloatTensor(output_batch).unsqueeze(0)
Thank you
Did this repository is not supported anymore?
input = [word_dict[n] for n in word[:-1]] # create (1~n-1) as input
target = [word_dict[word[-1]]]
it constraints the length of input and n_step. I think the following example is even better
for i in range(len(words) - window_size + 1):`
` x_train.append(words[i: i + window_size - 1])`
`y_train.append(words[i + window_size - 1])
I think it may be: enc_outputs = self.src_emb(enc_inputs) + self.pos_emb(torch.LongTensor([[0,1,2,3,4]]))
I think it may be: dec_outputs = self.tgt_emb(dec_inputs) + self.pos_emb(torch.LongTensor([[0,1,2,3,4]]))
def forward(self, X): embedded_chars = self.W[X] # [batch_size, sequence_length, sequence_length]
I think the shape is [batch_size, sequence_length,embedding_size]
In the code 'Seq2seq-torch.py', i saw u use np.eye,the one-hot representation, to represent embedding, so i change in a normal way ,using nn.Embedding(dict_length,embedding_dim),it can work out. but the loss i got is very high.
i wanna ask the differences between this two ways. here are my code and the result.
calculate attention_score
`
outputs = tf.concat([output[0], output[1]], 2) # output[0] : lstm_fw, output[1] : lstm_bw
outputs = tf.transpose(outputs, [1, 0, 2]) # [n_step, batch_size, n_hidden]
final_hidden_state = outputs[-1]
output_all = tf.concat([output[0], output[1]], 2)
final_hidden_state = tf.expand_dims(final_hidden_state, 2)
attn_weights = tf.squeeze(tf.matmul(output_all, final_hidden_state), 2) `
In Autocomplete We already have
X = tf.placeholder(tf.float32, [None, n_step, n_class]) # [batch_size, n_step, n_class]
Y = tf.placeholder(tf.float32, [None, n_class])
to guess next missing character
I don't quite understand that why 'batch_inputs', 'batch_labels' should be updated during each loop in Word2Vec-Skipgram-Tensor(Softmax).py .
Also ,what does 'trained_embeddings = W.eval()' mean?
Could you explain it for me?I am a bit confused.
`# code
for epoch in range(5000):
batch_inputs, batch_labels = random_batch(skip_grams, batch_size)
_, loss = sess.run([optimizer, cost], feed_dict={inputs: batch_inputs, labels: batch_labels})
if (epoch + 1)%1000 == 0:
print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))
trained_embeddings = W.eval()`
context = tf.matmul(attn_weights, enc_outputs)
dec_output = tf.squeeze(dec_output, 0) # [1, n_step]
context = tf.squeeze(context, 1) # [1, n_hidden]
I think dec_output shape is [1,n_hidden]
I try with new data includes: 8 class
ValueError: expected sequence of length 2681 at dim 1 (got 2249)
Thanks for sharing! Just found out Attention.get_att_weight
is calculating attention in a for-loop? this looks rather slow isn't it?
4-2.Seq2Seq(Attention)/Seq2Seq(Attention).ipynb
def get_att_weight(self, dec_output, enc_outputs): # get attention weight one 'dec_output' with 'enc_outputs'
n_step = len(enc_outputs)
attn_scores = torch.zeros(n_step) # attn_scores : [n_step]
for i in range(n_step):
attn_scores[i] = self.get_att_score(dec_output, enc_outputs[i])
# Normalize scores to weights in range 0 to 1
return F.softmax(attn_scores).view(1, 1, -1)
def get_att_score(self, dec_output, enc_output): # enc_outputs [batch_size, num_directions(=1) * n_hidden]
score = self.attn(enc_output) # score : [batch_size, n_hidden]
return torch.dot(dec_output.view(-1), score.view(-1)) # inner product make scalar value
Suggested parallel version
def get_att_weight(self, dec_output, enc_outputs): # get attention weight one 'dec_output' with 'enc_outputs'
n_step = len(enc_outputs)
attn_scores = torch.zeros(n_step,device=self.device) # attn_scores : [n_step]
enc_t = self.attn(enc_outputs)
score = dec_output.transpose(1,0).bmm(enc_t.transpose(1,0).transpose(2,1))
out1 = score.softmax(-1)
return out1
def attention_net(self, lstm_output, final_state):
batch_size = len(lstm_output)
hidden_forward=final_state[0]
hidden_backward=final_state[1]
hidden_f_b=torch.cat((hidden_forward, hidden_backward), 1)
hidden = hidden_f_b.view(batch_size, -1, 1) #
hidden = final_state.view(batch_size, -1, 1) # this line in source code is wrong, bi-lstm's hidden is[2,batch,embed_size] ,we need to concatenate forward and backward hidden state. if we final_state.view(batch_size, -1, 1) the hidden state is not concatenate by final_state[0][0] and final_state[1][0]
i fixed it in the following
https://github.com/zhangbo2008/nlp-tutorial/blob/master/2cpu_Input_myData.py
https://github.com/zhangbo2008/nlp-tutorial/blob/master/2gpu_Input_myData.py
if you find some mistake, please comment me
class MultiHeadAttention(nn.Module):
def init(self):
super(MultiHeadAttention, self).init()
self.W_Q = nn.Linear(d_model, d_k * n_heads)
self.W_K = nn.Linear(d_model, d_k * n_heads)
self.W_V = nn.Linear(d_model, d_v * n_heads)
def forward(self, Q, K, V, attn_mask):
# q: [batch_size x len_q x d_model], k: [batch_size x len_k x d_model], v: [batch_size x len_k x d_model]
residual, batch_size = Q, Q.size(0)
# (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)
q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2) # q_s: [batch_size x n_heads x len_q x d_k]
k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2) # k_s: [batch_size x n_heads x len_k x d_k]
v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2) # v_s: [batch_size x n_heads x len_k x d_v]
attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size x n_heads x len_q x len_k]
# context: [batch_size x n_heads x len_q x d_v], attn: [batch_size x n_heads x len_q(=len_k) x len_k(=len_q)]
context, attn = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)
context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v) # context: [batch_size x len_q x n_heads * d_v]
output = nn.Linear(n_heads * d_v, d_model)(context)
return nn.LayerNorm(d_model)(output + residual), attn # output: [batch_size x len_q x d_model]
the last second line instantiates a class every time , is it right ? the class should be instantiate in the init function ??
line 69-70 :
index = randint(0, vocab_size - 1) # random index in vocabulary
input_ids[pos] = word_dict[number_dict[index]]
The length of number_dict is 25, but the length of vocab_size is 29, so number_dict[index] might be out of range.
May be we should change line 69 into index = randint(0, len(word_list) - 1)
?
Hi, this repo is awesome, but there might be something wrong in the code above. According to the comment above, this snippet intends to change a tensor from shape [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
to shape [batch_size, n_hidden * num_directions(=2), 1(=n_layer)]
, i.e. to concatenate the 2 hidden vector from different direction for every data example in a batch(By saying "data example", I mean a batch has batch_size
examples). But I think the code above will mess up the data examples in a batch and lead to unexpected result.
For example, we can use IPython to check the effect of the snippet above.
# create a tensor with shape [num_layers(=1) * num_directions(=2), batch_size, n_hidden]
In [10]: a=torch.arange(2*3*5).reshape(2,3,5)
In [11]: a
Out[11]:
tensor([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
In [12]: a.view(-1,10,1)
Out[12]:
tensor([[[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9]],
[[10],
[11],
[12],
[13],
[14],
[15],
[16],
[17],
[18],
[19]],
[[20],
[21],
[22],
[23],
[24],
[25],
[26],
[27],
[28],
[29]]])
As you can see, we create a tensor with batch_size=3 and n_hidden=5, e.g [ 0, 1, 2, 3, 4]
and [15, 16, 17, 18, 19]
belong to the same data example in the batch, but they are from different directions, so what we want is to concatenate them in the resulting tensor. But what the code really does is to concatenate [ 0, 1, 2, 3, 4]
and [ 5, 6, 7, 8, 9]
, which are from different data examples in a batch.
I think it can be fixed by changing the line of code to hidden=torch.cat(final_state[0],final_state[1]],1).view(-1,10,1)
The effect of the new code can be shown as follows:
In [13]: torch.cat([a[0],a[1]],1).view(-1,10,1)
Out[13]:
tensor([[[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[15],
[16],
[17],
[18],
[19]],
[[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[20],
[21],
[22],
[23],
[24]],
[[10],
[11],
[12],
[13],
[14],
[25],
[26],
[27],
[28],
[29]]])
I got a advice from Reddit Page from mslavescu about creating Google Colab and linking Pages directly in GitHub Readme.
There is a problem with padding on line 73-75 . What if the sentence length is larger than maxlen? Then we end up with sequences of varying length and line 214 throws an error.
# output : [max_len+1, batch_size, num_directions(=1) * n_hidden]
output = output.transpose(0, 1) # [batch_size, max_len+1(=6), num_directions(=1) * n_hidden]
to
# output : [max_len+1, batch_size, n_class]
output = output.transpose(0, 1) # [batch_size, max_len+1(=6), n_class]
hello, first thank your code, but i want to know if batch_size is more than 1, i should how to modify the code, thank you
def get_att_weight(self, output, enc_output): # get attention weight one 'output' with 'enc_output'
'''
output: [1, batch_size, num_directions(=1) * n_hidden]
enc_output: [n_step+1, batch_size, num_directions(=1) * n_hidden]
'''
length = len(enc_output)
attn_scores = torch.zeros(length) # attn_scores : [batch_size, n_step+1]
for i in range(length):
attn_scores[i] = self.get_att_score(output, enc_output[i])
# Normalize scores to weights in range 0 to 1
# return [batch_size, 1, n_step+1]
return F.softmax(attn_scores).view(batch_size, 1, -1)
def get_att_score(self, output, enc_output):
'''
output: [batch_size, num_directions(=1) * n_hidden]
enc_output: [batch_size, num_directions(=1) * n_hidden]
'''
score = self.attn(enc_output) # score : [1, n_hidden]
return torch.dot(output.view(-1), score.view(-1)) # inner product make scalar value, get a real number
I can not find the embedding in nnlm!
Hello. It's been about 2 years since the repository started, and thank you for your interest.
Most of them are now written in legacy code that is not used in pytorch or tensorflow, so we want to update to a new version.
There is no plan to support tensorflow v2 because the python-like pytorch is more readable for beginners.
In addition, the philosophy of pytorch and tensorflow is very different, and good code cannot be produced by trying to implement them similarly.
Therefore, the existing tensorflow v1 related code will be archived in a new folder.
Seq2Seq(Attention)\Seq2Seq(Attention)-Tensor.py
The shape of the input should be [max_time, batch_size,...]. The input = tf. transpose (dec_inputs, [1, 0, 2]) has already been transformed. In tf. expand_dims (inputs [i], 1), the expansion is indeed one dimension. It seems that there should be zero dimension expansion here. Although the final shape is correct, whether it is intentional or not is here. What about a little trick?
Hello,
I think there is a problem with this file, it is the same file as TextCNN-Torch.py.
I guess it should be the version with Tensorflow?
Thanks anyway for this repo
Hi, Im an nlp rookie, I want to ask u a question, your code extract input(context) in a fixed window in 43th area, and "word sequence" is a sentences list , some words may extract their neighbour words form different sentences, so, is this way harm to the result?
And my training result seems not very well and I didn't change the codes.
If u see this issues, please answer me in your free time.
Although my english is poor, I still want to express my gratitude to u.
How is it possible to use the Attention Layer in (4.3) for sequence-to-sequence classification something like Named Entity Recognition or Semantic Role Labeling?
Colab links of NNLM and word2vec is wrong.
404
Hi
thanks for sharing your codes.
I've had read your seq2seq implementation and I was wondering about the RNN Encode-Decode model.
in the paper, 'Learning Phrase Representations using RNN Encoder–Decoder
for Statistical Machine Translation'
They say
and I couldn't find the new hidden-state activation function in your code.
Do you have any plan to add the proposed activation process?
or is it okay to just skip the parts?
thank you so much in advance
`class TextCNN(nn.Module):
def init(self):
super(TextCNN, self).init()
self.num_filters_total = num_filters * len(filter_sizes)
self.W = nn.Parameter(torch.empty(vocab_size, embedding_size).uniform_(-1, 1)).type(dtype)
self.Weight = nn.Parameter(torch.empty(self.num_filters_total, num_classes).uniform_(-1, 1)).type(dtype)
self.Bias = nn.Parameter(0.1 * torch.ones([num_classes])).type(dtype)
def forward(self, X):
embedded_chars = self.W[X] # [batch_size, sequence_length, sequence_length]
embedded_chars = embedded_chars.unsqueeze(1) # add channel(=1) [batch, channel(=1), sequence_length, embedding_size]
pooled_outputs = []
for filter_size in filter_sizes:
# conv : [input_channel(=1), output_channel(=3), (filter_height, filter_width), bias_option]
conv = nn.Conv2d(1, num_filters, (filter_size, embedding_size), bias=True)(embedded_chars)
h = F.relu(conv)
# mp : ((filter_height, filter_width))
mp = nn.MaxPool2d((sequence_length - filter_size + 1, 1))
# pooled : [batch_size(=6), output_height(=1), output_width(=1), output_channel(=3)]
pooled = mp(h).permute(0, 3, 2, 1)
pooled_outputs.append(pooled)
h_pool = torch.cat(pooled_outputs, len(filter_sizes)) # [batch_size(=6), output_height(=1), output_width(=1), output_channel(=3) * 3]
h_pool_flat = torch.reshape(h_pool, [-1, self.num_filters_total]) # [batch_size(=6), output_height * output_width * (output_channel * 3)]
model = torch.mm(h_pool_flat, self.Weight) + self.Bias # [batch_size, num_classes]
return model`
I wonder if it's wrong to create conv inside the loop?
I think there is a small problem in the 74th line of the "seq2seq-torch.py",the dimension of input_batch and output_batch is not [batch_size, max_len, n_hidden] but [batch_size, max_len, n_class]. Or I don't fully understand your code:),please help me,thx~
class BiLSTM(nn.Module):
def init(self):
super(BiLSTM, self).init()
self.lstm = nn.LSTM(input_size=n_class, hidden_size=n_hidden, bidirectional=True)
self.W = nn.Parameter(torch.randn([n_hidden * 2, n_class]).type(dtype))
self.b = nn.Parameter(torch.randn([n_class]).type(dtype))
def forward(self, X):
input = X.transpose(0, 1) # input : [n_step, batch_size, n_class]
hidden_state = Variable(torch.zeros(1*2, len(X), n_hidden)) # [num_layers(=1) * num_directions(=1), batch_size, n_hidden]
cell_state = Variable(torch.zeros(1*2, len(X), n_hidden)) # [num_layers(=1) * num_directions(=1), batch_size, n_hidden]
outputs, (_, _) = self.lstm(input, (hidden_state, cell_state))
**outputs = outputs[-1] # [batch_size, n_hidden]**
model = torch.mm(outputs, self.W) + self.b # model : [batch_size, n_class]
return model
error: "outputs = outputs[-1] # [batch_size, n_hidden]"
the shape should be [batch_size,2*n_hidden]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.