I am curious, why hasn't RAdam been included official in pytorch? <a

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RAdam for pytorch official about radam HOT 6 OPEN

liyuanlucasliu commented on August 15, 2024

RAdam for pytorch official

from radam.

Comments (6)

LiyuanLucasLiu commented on August 15, 2024 2

Hi @Tony-Y, I'm curious why you prefer to use Adam with a warmup instead of RAdam.

I think the very basic fact both papers agree on, is that it's necessary to include warmup to handle the variance of adaptive learning rate. Their linear warmup schedule of 2/(1−β2) steps is just a further approximation to our derived first-order approximation.

Thanks for the question @brando90 and thanks for pointing me the PR. It is really an encouragement to me seeing that PR.

From my perspective, I don't think being included as an official module in PyTorch matters that much. The initiative of our study is to show the adaptive learning rate may cause some problems (the strongest evidence is the controlled experiments, i.e., Adam-2k v.s. Adam w.o. warmup). RAdam serves as a role to further verify our intuition on this matter. Although I'm very happy to see our optimizer helped & inspired many researchers, our optimizer is still experimental. It takes a lot of efforts to take the optimizer really to the next level.

We have been working on something new these two years, stay tuned : -)

from radam.

Tony-Y commented on August 15, 2024

This paper shows that "the Rectified Adam (RAdam) algorithm can be characterized as four steps of momentum SGD, followed by Adam with a fixed warmup schedule." So, we may use the Adam with a warmup schedule instead when we need RAdam.

My implementation: https://github.com/Tony-Y/pytorch_warmup

from radam.

brando90 commented on August 15, 2024

Hi Liyuan,

Great to hear form you!

I am curious, what do you mean by "It takes a lot of efforts to take the optimizer really to the next level."? There aren't many hyperparameters to tune so I am curious what that means.

Looking forward to your next opitmizer!

from radam.

brando90 commented on August 15, 2024

@Tony-Y I am also curious to know why you prefer warm up vs RAdam - especially since RAdam seems quite robust and remove hypoer parameters (which are the ML researcher's nightmare!)

from radam.

Tony-Y commented on August 15, 2024

I think that a new approach introduced by RAdam is only a nonlinear warmup. Such nonlinear warmups may outperform the untuned linear warmup sometimes.

from radam.

brando90 commented on August 15, 2024

@Tony-Y the original paper you cited "On the Adequacy of Untuned Warmup for Adaptive Optimization" claims that RAdam is just equivalent to Adam + Warm up. From that perspective, it makes no difference which one of the too I use. Isn't it simpler to just fork the RAdam repo then git clone it and then use RAdam? RAdam is just a standard pytorch optimizer so using it is trivial.

(My guess is) the other alternative is to use the hugging face warm-up (which I've never used) https://huggingface.co/transformers/main_classes/optimizer_schedules.html?highlight=cosine#transformers.get_cosine_schedule_with_warmup and then use the linear schedule the paper you linked suggested.

In the end with the claim that they are "equivalent" either algorithm is fine. I will go with RAdam for now since it's already downloaded in my code and it's just as simple to use compared to the other - unless of course you have code that makes it trivial to plug in or have a convincing case beyond they are equivalent.

If you think warm-up is better perhaps a tutorial on how to use your warm-up version would be great to make it just as simple to plug in as RAdam. :)

I am looking forward to see how this debate on optimizers on transformers progresses.

from radam.

RAdam for pytorch official about radam HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent