Coder Social home page Coder Social logo

RAdam for pytorch official about radam HOT 6 OPEN

liyuanlucasliu avatar liyuanlucasliu commented on August 15, 2024
RAdam for pytorch official

from radam.

Comments (6)

LiyuanLucasLiu avatar LiyuanLucasLiu commented on August 15, 2024 2

Hi @Tony-Y, I'm curious why you prefer to use Adam with a warmup instead of RAdam.

I think the very basic fact both papers agree on, is that it's necessary to include warmup to handle the variance of adaptive learning rate. Their linear warmup schedule of 2/(1−β2) steps is just a further approximation to our derived first-order approximation.

Thanks for the question @brando90 and thanks for pointing me the PR. It is really an encouragement to me seeing that PR.

From my perspective, I don't think being included as an official module in PyTorch matters that much. The initiative of our study is to show the adaptive learning rate may cause some problems (the strongest evidence is the controlled experiments, i.e., Adam-2k v.s. Adam w.o. warmup). RAdam serves as a role to further verify our intuition on this matter. Although I'm very happy to see our optimizer helped & inspired many researchers, our optimizer is still experimental. It takes a lot of efforts to take the optimizer really to the next level.

We have been working on something new these two years, stay tuned : -)

from radam.

Tony-Y avatar Tony-Y commented on August 15, 2024

This paper shows that "the Rectified Adam (RAdam) algorithm can be characterized as four steps of momentum SGD, followed by Adam with a fixed warmup schedule." So, we may use the Adam with a warmup schedule instead when we need RAdam.

My implementation: https://github.com/Tony-Y/pytorch_warmup

from radam.

brando90 avatar brando90 commented on August 15, 2024

Hi Liyuan,

Great to hear form you!

I am curious, what do you mean by "It takes a lot of efforts to take the optimizer really to the next level."? There aren't many hyperparameters to tune so I am curious what that means.

Looking forward to your next opitmizer!

from radam.

brando90 avatar brando90 commented on August 15, 2024

@Tony-Y I am also curious to know why you prefer warm up vs RAdam - especially since RAdam seems quite robust and remove hypoer parameters (which are the ML researcher's nightmare!)

from radam.

Tony-Y avatar Tony-Y commented on August 15, 2024

I think that a new approach introduced by RAdam is only a nonlinear warmup. Such nonlinear warmups may outperform the untuned linear warmup sometimes.

from radam.

brando90 avatar brando90 commented on August 15, 2024

@Tony-Y the original paper you cited "On the Adequacy of Untuned Warmup for Adaptive Optimization" claims that RAdam is just equivalent to Adam + Warm up. From that perspective, it makes no difference which one of the too I use. Isn't it simpler to just fork the RAdam repo then git clone it and then use RAdam? RAdam is just a standard pytorch optimizer so using it is trivial.

(My guess is) the other alternative is to use the hugging face warm-up (which I've never used) https://huggingface.co/transformers/main_classes/optimizer_schedules.html?highlight=cosine#transformers.get_cosine_schedule_with_warmup and then use the linear schedule the paper you linked suggested.

In the end with the claim that they are "equivalent" either algorithm is fine. I will go with RAdam for now since it's already downloaded in my code and it's just as simple to use compared to the other - unless of course you have code that makes it trivial to plug in or have a convincing case beyond they are equivalent.

If you think warm-up is better perhaps a tutorial on how to use your warm-up version would be great to make it just as simple to plug in as RAdam. :)

I am looking forward to see how this debate on optimizers on transformers progresses.

from radam.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.