Comments (6)
Hi @Tony-Y, I'm curious why you prefer to use Adam with a warmup instead of RAdam.
I think the very basic fact both papers agree on, is that it's necessary to include warmup to handle the variance of adaptive learning rate. Their linear warmup schedule of 2/(1−β2) steps is just a further approximation to our derived first-order approximation.
Thanks for the question @brando90 and thanks for pointing me the PR. It is really an encouragement to me seeing that PR.
From my perspective, I don't think being included as an official module in PyTorch matters that much. The initiative of our study is to show the adaptive learning rate may cause some problems (the strongest evidence is the controlled experiments, i.e., Adam-2k v.s. Adam w.o. warmup). RAdam serves as a role to further verify our intuition on this matter. Although I'm very happy to see our optimizer helped & inspired many researchers, our optimizer is still experimental. It takes a lot of efforts to take the optimizer really to the next level.
We have been working on something new these two years, stay tuned : -)
from radam.
This paper shows that "the Rectified Adam (RAdam) algorithm can be characterized as four steps of momentum SGD, followed by Adam with a fixed warmup schedule." So, we may use the Adam with a warmup schedule instead when we need RAdam.
My implementation: https://github.com/Tony-Y/pytorch_warmup
from radam.
Hi Liyuan,
Great to hear form you!
I am curious, what do you mean by "It takes a lot of efforts to take the optimizer really to the next level."? There aren't many hyperparameters to tune so I am curious what that means.
Looking forward to your next opitmizer!
from radam.
@Tony-Y I am also curious to know why you prefer warm up vs RAdam - especially since RAdam seems quite robust and remove hypoer parameters (which are the ML researcher's nightmare!)
from radam.
I think that a new approach introduced by RAdam is only a nonlinear warmup. Such nonlinear warmups may outperform the untuned linear warmup sometimes.
from radam.
@Tony-Y the original paper you cited "On the Adequacy of Untuned Warmup for Adaptive Optimization" claims that RAdam is just equivalent to Adam + Warm up. From that perspective, it makes no difference which one of the too I use. Isn't it simpler to just fork the RAdam repo then git clone it and then use RAdam? RAdam is just a standard pytorch optimizer so using it is trivial.
(My guess is) the other alternative is to use the hugging face warm-up (which I've never used) https://huggingface.co/transformers/main_classes/optimizer_schedules.html?highlight=cosine#transformers.get_cosine_schedule_with_warmup and then use the linear schedule the paper you linked suggested.
In the end with the claim that they are "equivalent" either algorithm is fine. I will go with RAdam for now since it's already downloaded in my code and it's just as simple to use compared to the other - unless of course you have code that makes it trivial to plug in or have a convincing case beyond they are equivalent.
If you think warm-up is better perhaps a tutorial on how to use your warm-up version would be great to make it just as simple to plug in as RAdam. :)
I am looking forward to see how this debate on optimizers on transformers progresses.
from radam.
Related Issues (20)
- What's the difference between RAdam and PlainRAdam? HOT 1
- Overload of addcmul_ is deprecated: HOT 2
- Cannot reproduce the PPL on One Billion Words HOT 1
- Hi HOT 1
- RAdam Instability vs AdamW / Adam HOT 8
- Algorithm 2 Arxiv paper 1/beta2 typo? HOT 2
- Why there are 10 slots in the buffer? HOT 1
- Any concern for using `math.sqrt` instead of `torch.sqrt` HOT 2
- Deprecated Warning in `RAdam` with torch==1.7.1 HOT 2
- Will radam be affacted by weight decay?
- simplify add_ HOT 1
- NaNs HOT 4
- Is RAdam needed when fitting perfectly a small batch e.g. 500 examples? HOT 3
- Question of RAdam's dependence on the number of examples HOT 1
- Should one be using RAdam or PlainRadam? HOT 1
- How to choose decay rate? (No success with RAdam - does one need a decay scheduler or gradient clipping) HOT 5
- Are the plots you have wrt epochs or iterations? HOT 1
- Does RAdam usually need an annealing and warm up scheduler? HOT 2
- Question regarding 2nd Moment Update HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from radam.