Hi, thanks for the code. Just want to confirm, because the seq3 training is lookin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

LM prior checkpoint naming about seq3 HOT 5 CLOSED

cloudygoose commented on August 12, 2024

LM prior checkpoint naming

from seq3.

Comments (5)

cbaziotis commented on August 12, 2024 1

Yes, just rename it.

from seq3.

cbaziotis commented on August 12, 2024 1

The results now are a bit worse indeed, however this can be caused by many things. It is comparable, but it should be better.

Also, double-check the hyper-parameters in the config files and make sure that you are using the ones on the paper. After the submission I did some housekeeping in the codebase before uploading it, and I may have copied the configs with the wrong hyperparameters. I do have the checkpoints from the reported results though for reproducibility.

Good luck with your experiments!

from seq3.

cloudygoose commented on August 12, 2024

@cbaziotis
Hi, I run it with seq3.full.yaml, and got this result
+----+-----------+-----------+-----------+
| | rouge-2 | rouge-1 | rouge-l |
|----+-----------+-----------+-----------|
| f | 0.0943 | 0.2970 | 0.3264 |
| p | 0.0780 | 0.2493 | 0.2798 |
| r | 0.1302 | 0.3975 | 0.4136 |
+----+-----------+-----------+-----------+
the second line (p) looks similar to the paper result:

Does that sound right?
Thanks!

from seq3.

cbaziotis commented on August 12, 2024

No, you want the first line (F1). Your results are far better than the ones that we report in the paper. Did you do anything different in terms of the data that you used for example?

Also, judging from the layout of the console output you didn't use the same evaluation script as I did. I recall that there were differences between them and as a result, there is the danger that you'll not be comparable to other work. I recommend that you use the scripts in the evaluation directory.

This is the complete output of our model evaluation:

---------------------------------------------
1 ROUGE-1 Average_R: 0.34890 (95%-conf.int. 0.33765 - 0.36067)
1 ROUGE-1 Average_P: 0.21015 (95%-conf.int. 0.20305 - 0.21688)
1 ROUGE-1 Average_F: 0.25392 (95%-conf.int. 0.24581 - 0.26225)
---------------------------------------------
1 ROUGE-2 Average_R: 0.11558 (95%-conf.int. 0.10758 - 0.12343)
1 ROUGE-2 Average_P: 0.06746 (95%-conf.int. 0.06294 - 0.07216)
1 ROUGE-2 Average_F: 0.08214 (95%-conf.int. 0.07655 - 0.08752)
---------------------------------------------
1 ROUGE-L Average_R: 0.31184 (95%-conf.int. 0.30061 - 0.32222)
1 ROUGE-L Average_P: 0.18772 (95%-conf.int. 0.18105 - 0.19420)
1 ROUGE-L Average_F: 0.22679 (95%-conf.int. 0.21935 - 0.23423)
---------------------------------------------
1 ROUGE-W-1.2 Average_R: 0.19307 (95%-conf.int. 0.18577 - 0.19989)
1 ROUGE-W-1.2 Average_P: 0.17531 (95%-conf.int. 0.16906 - 0.18119)
1 ROUGE-W-1.2 Average_F: 0.17580 (95%-conf.int. 0.16990 - 0.18158)

To be sure that your evaluation is correct, try to evaluate the lead-8 baseline and compare with it.

from seq3.

cloudygoose commented on August 12, 2024

@cbaziotis
Oh, I'm just printing the eval result with the dev file.
Now I switch to evaluation/gigaword, and got

Preparing documents... 0 line(s) ignored
Running ROUGE...
---------------------------------------------
1 ROUGE-1 Average_R: 0.31312 (95%-conf.int. 0.30219 - 0.32441)
1 ROUGE-1 Average_P: 0.20629 (95%-conf.int. 0.19913 - 0.21319)
1 ROUGE-1 Average_F: 0.24028 (95%-conf.int. 0.23236 - 0.24793)
---------------------------------------------
1 ROUGE-2 Average_R: 0.09535 (95%-conf.int. 0.08827 - 0.10227)
1 ROUGE-2 Average_P: 0.05992 (95%-conf.int. 0.05547 - 0.06421)
1 ROUGE-2 Average_F: 0.07076 (95%-conf.int. 0.06544 - 0.07568)
---------------------------------------------
1 ROUGE-L Average_R: 0.28157 (95%-conf.int. 0.27106 - 0.29250)
1 ROUGE-L Average_P: 0.18560 (95%-conf.int. 0.17886 - 0.19201)
1 ROUGE-L Average_F: 0.21606 (95%-conf.int. 0.20837 - 0.22356)
---------------------------------------------
1 ROUGE-W-1.2 Average_R: 0.17347 (95%-conf.int. 0.16665 - 0.18048)
1 ROUGE-W-1.2 Average_P: 0.17250 (95%-conf.int. 0.16610 - 0.17837)
1 ROUGE-W-1.2 Average_F: 0.16533 (95%-conf.int. 0.15939 - 0.17098)

So it's a little bit worse than yours. (still looks good to me) (my main purpose is to use the code, instead of comparing with it)

The difference could be I'm using pytorch1.3.0, and I changed the code in gumbel_softmax to be:
"""
#gumbels = -torch.empty_like(logits, memory_format=torch.legacy_contiguous_format).exponential_().log() # ~Gumbel(0,1)
gumbels = -torch.empty_like(logits).exponential_().log()
gumbels = (logits + gumbels) / tau # ~Gumbel(logits,tau)
y_soft = gumbels.softmax(dim = -1)
"""
Because torch.legacy_contiguous_format and _gumbel_softmax_sample can no longer be found.
I hope these make sense..

Thanks for the reply!

from seq3.

LM prior checkpoint naming about seq3 HOT 5 CLOSED

Comments (5)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent