The mxseq2seq from ziyuehuang

Cannot reproduce the results in pytorch.

Hi, Sheng

I am in trouble with reproducing seq2seq in http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

The loss values are higher in gluon than pytorch, I cannot figure out the reasons. I think the network and hyperparameters are the same.

Could you please help review this? Thanks a lot. @szha

git clone https://github.com/ZiyueHuang/MXSeq2Seq.git
cd MXSeq2Seq/gluon
python seq2seq.py --cuda

Here are the outputs,

Reading lines...
Read 135842 sentence pairs
Trimmed to 10853 sentence pairs
Counting words...
Counted words:
(u'fra', 4489)
(u'eng', 2925)
[u'elles n ont pas toujours raison .', u'they re not always right .']
3.28828048161
2.78100977883
2.58317873447
2.40703460461
2.28146159903
2.15195642098
2.0522787211
1.96067533115
1.82878418601
1.74697916563
1.67514307116
1.57050654945
1.50527894126
1.41056300702
1.36106930289

Here are the loss values in pytorch,

3m 23s (- 47m 22s) (5000 6%) 2.8848
6m 44s (- 43m 48s) (10000 13%) 2.3516
10m 12s (- 40m 51s) (15000 20%) 2.0009
13m 38s (- 37m 31s) (20000 26%) 1.7755
16m 49s (- 33m 38s) (25000 33%) 1.5787
20m 5s (- 30m 7s) (30000 40%) 1.4096
23m 17s (- 26m 37s) (35000 46%) 1.3090
26m 33s (- 23m 14s) (40000 53%) 1.0980
29m 45s (- 19m 50s) (45000 60%) 1.0109
32m 57s (- 16m 28s) (50000 66%) 0.9418
36m 12s (- 13m 10s) (55000 73%) 0.8696
39m 27s (- 9m 51s) (60000 80%) 0.8121
42m 41s (- 6m 34s) (65000 86%) 0.7046
45m 56s (- 3m 16s) (70000 93%) 0.6555
49m 7s (- 0m 0s) (75000 100%) 0.6015

Attention weights

In seq2seq.py the attention weights are computed like this:

attn_weights = F.softmax(
            self.attn(F.concat(embedded, hidden[0].flatten(), dim=1)))

Where embedded is the input of the decoder and hidden is the encoder's hidden as in the train you define hidden as: decoder_hidden = encoder_hidden. The problem is that as I found online in different sources the attention weights are computed with decoder's hidden and encoder's output.

why use two different Trianer for encoder and decoder, respectively ?

what are the differences between using two Trainer and using single Trainer ?

ziyuehuang / mxseq2seq Goto Github PK

mxseq2seq's People

Stargazers

Watchers

Forkers

mxseq2seq's Issues

Cannot reproduce the results in pytorch.

Attention weights

why use two different Trianer for encoder and decoder, respectively ?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent