Coder Social home page Coder Social logo

liyuanlucasliu / radam Goto Github PK

View Code? Open in Web Editor NEW
2.5K 58.0 338.0 955 KB

On the Variance of the Adaptive Learning Rate and Beyond

Home Page: https://arxiv.org/abs/1908.03265

License: Apache License 2.0

Python 99.06% Shell 0.94%
optimizer adam adam-optimizer warmup

radam's Introduction

License Travis-CI

RAdam

On the Variance of the Adaptive Learning Rate and Beyond

We are in an early-release beta. Expect some adventures and rough edges.

Table of Contents

Introduction

If warmup is the answer, what is the question?

The learning rate warmup for Adam is a must-have trick for stable training in certain situations (or eps tuning). But the underlying mechanism is largely unknown. In our study, we suggest one fundamental cause is the large variance of the adaptive learning rates, and provide both theoretical and empirical support evidence.

In addition to explaining why we should use warmup, we also propose RAdam, a theoretically sound variant of Adam.

Motivation

As shown in Figure 1, we assume that gradients follow a normal distribution (mean: \mu, variance: 1). The variance of the adaptive learning rate is simulated and plotted in Figure 1 (blue curve). We observe that the adaptive learning rate has a large variance in the early stage of training.

When using a Transformer for NMT, a warmup stage is usually required to avoid convergence problems (e.g., Adam-vanilla converges around 500 PPL in Figure 2, while Adam-warmup successfully converges under 10 PPL). In further explorations, we notice that, if we use additional 2000 samples to estimate the adaptive learning rate, the convergence problems are avoided (Adam-2k); or, if we increase the value of eps, the convergence problems are also relieved (Adam-eps).

Therefore, we conjecture that the large variance in the early stage causes the convergence problem, and further propose Rectified Adam by analytically reducing the large variance. More details can be found in our paper.

Questions and Discussions

Do I need to tune learning rate?

Yes, the robustness of RAdam is not infinity. In our experiments, it works for a broader range of learning rates, but not all learning rates.

Notes on Transformer (more discussions can be found in our Transformer Clinic project)

Choice of the Original Transformer. We choose the original Transformer as our main study object because, without warmup, it suffers from the most serious convergence problems in our experiments. With such serious problems, our controlled experiments can better verify our hypothesis (i.e., we demonstrate that Adam-2k / Adam-eps can avoid spurious local optima by minimal changes).

Sensitivity. We observe that the Transformer is sensitive to the architecture configuration, despite its efficiency and effectiveness. For example, by changing the position of the layer norm, the model may / may not require the warmup to get a good performance. Intuitively, since the gradient of the attention layer could be more sparse and the adaptive learning rates for smaller gradients have a larger variance, they are more sensitive. Nevertheless, we believe this problem deserves more in-depth analysis and is beyond the scope of our study.

Why does warmup have a bigger impact on some models than others?

Although the adaptive learning rate has a larger variance in the early stage, the exact magnitude is subject to the model design. Thus, the convergent problem could be more serious for some models/tasks than others. In our experiments, we observe that RAdam achieves consistent improvements over the vanilla Adam. It verifies the variance issue widely exists (since we can get better performance by fixing it).

What if the gradient is not zero-meaned?

As in Figure 1 (above), even if the gradient is not zero-meaned, the original adaptive learning rate still has a larger variance in the beginning, thus applying the rectification can help to stabilize the training.

Another related concern is that, when the mean of the gradient is significantly larger than its variance, the magnitude of the "problematic" variance may not be very large (i.e., in Figure 1, when \mu equals to 10, the adaptive learning rate variance is relatively small and may not cause problems). We think it provides a possible explaination on why warmup have a bigger impact on some models than others. Still, we suggest that, in real-world applications, neural networks usually have some parts of parameters meet our assumption well (i.e., their gradient variance is larger than their gradient mean), and needs the rectification to stabilize the training.

Why does SGD need warmup?

To the best of our knowledge, the warmup heuristic is originally designed for large minibatch SGD [0], based on the intuition that the network changes rapidly in the early stage. However, we find that it does not explain why Adam requires warmup. Notice that, Adam-2k uses the same large learning rate but with a better estimation of the adaptive learning rate can also avoid the convergence problems.

The reason why sometimes warmup also helps SGD still lacks of theoretical support. FYI, when optimizing a simple 2-layer CNN with gradient descent, the thoery of [1] could be used to show the benifits of warmup. Specifically, the lr must be $O(cos \phi)$, where $\phi$ is the angle between the current weight and the ground true weight and $cos \phi$ could be very small due to high dimensional space and random initialization. And thus lr must be very small at the beginning to guarentee the convergence. $cos \phi$ however can be improved in the later stage, and thus the learning rate is allowed to be larger. Their theory somehow can justify why warmup is needed by gradient descend and neural networks. But it is still far-fetched for the real scenario.

[0] Goyal et al, Accurate, Large Minibatch SGD: Training Imagenet in 1 Hour, 2017

[1] Du et al, Gradient Descent Learns One-hidden-layer CNN: Don’t be Afraid of Spurious Local Minima, 2017

Quick Start Guide

  1. Directly replace the vanilla Adam with RAdam without changing any settings.
  2. Further tune hyper-parameters (including the learning rate) for a better performance.

Note that in our paper, our major contribution is to identify why we need the warmup for Adam. Although some researchers successfully improve their model performance (user comments), considering the difficulty of training NNs, directly plugging in RAdam may not result in an immediate performance boost. Based on our experience, replacing the vanilla Adam with RAdam usually results in a better performance; however, if warmup has already been employed and tuned in the baseline method, it is necessary to also tune hyper-parameters for RAdam.

Related Posts and Repos

Unofficial Re-Implementations

RAdam is very easy to implement, we provide PyTorch implementations here, while third party ones can be found at:

Keras Implementation

Keras Implementation

Julia implementation in Flux.jl

Unofficial Introduction & Mentions

We provide a simple introduction in Motivation, and more details can be found in our paper. There are some unofficial introductions available (with better writings), and they are listed here for reference only (contents/claims in our paper are more accurate):

Medium Post

related Twitter Post

CSDN Post (in Chinese)

User Comments

We are happy to see that our algorithms are found to be useful by some users : -)

"...I tested it on ImageNette and quickly got new high accuracy scores for the 5 and 20 epoch 128px leaderboard scores, so I know it works... https://forums.fast.ai/t/meet-radam-imo-the-new-state-of-the-art-ai-optimizer/52656

— Less Wright August 15, 2019

Thought "sounds interesting, I'll give it a try" - top 5 are vanilla Adam, bottom 4 (I only have access to 4 GPUs) are RAdam... so far looking pretty promising! pic.twitter.com/irvJSeoVfx

— Hamish Dickson (@_mishy) August 16, 2019

RAdam works great for me! It’s good to several % accuracy for free, but the biggest thing I like is the training stability. RAdam is way more stable! https://medium.com/@mgrankin/radam-works-great-for-me-344d37183943

— Grankin Mikhail August 17, 2019

"... Also, I achieved higher accuracy results using the newly proposed RAdam optimization function.... https://towardsdatascience.com/optimism-is-on-the-menu-a-recession-is-not-d87cce265b10

— Sameer Ahuja August 24, 2019

"... Out-of-box RAdam implementation performs better than Adam and finetuned SGD... https://twitter.com/ukrdailo/status/1166265186920980480

— Alex Dailo August 27, 2019

Citation

Please cite the following paper if you found our model useful. Thanks!

Liyuan Liu , Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han (2020). On the Variance of the Adaptive Learning Rate and Beyond. the Eighth International Conference on Learning Representations.

@inproceedings{liu2019radam,
 author = {Liu, Liyuan and Jiang, Haoming and He, Pengcheng and Chen, Weizhu and Liu, Xiaodong and Gao, Jianfeng and Han, Jiawei},
 booktitle = {Proceedings of the Eighth International Conference on Learning Representations (ICLR 2020)},
 month = {April},
 title = {On the Variance of the Adaptive Learning Rate and Beyond},
 year = {2020}
}

radam's People

Contributors

akhileshgotmare avatar cclauss avatar gcampax avatar hmjianggatech avatar jefffessler avatar liyuanlucasliu avatar lsrock1 avatar namisan avatar nyavramov avatar tbazin avatar tony-y avatar waldeland avatar zzaebok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

radam's Issues

[AdamW] amsgrad issue

Hi there,

Thanks a lot for the implementation! Out of curiosity, I tested the other optimizers apart from radam. And I think there is a missing argument in the constructor of AdamW on line 154:

def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, amsgrad=False, warmup=0):

Otherwise calling the constructor throws the following error:

  File "main.py", line 125, in main
    optimizer = AdamW(model_params, weight_decay=args.weight_decay)
  File "/home/fg/fv-training/optimizer.py", line 156, in __init__
    weight_decay=weight_decay, amsgrad=amsgrad, warmup = warmup)
NameError: name 'amsgrad' is not defined

Cheers

Sensitivity wrt LR restarts

I'm observing sensitivity wrt LR restarts in a typical SGDR schedule with cosine annealing as in Loschilov & Hutter. RAdam still seems to be doing better than AdamW so far, but the jumps imply possible numerical instability at LR discontinuities.

Here's the training loss compared to AdamW (PyTorch 1.2.0 version):
radam_jumps

Here's the validation loss:
radam_val

What's the recommendation here? Should I use warmup in every cycle rather than just in the beginning? I thought RAdam was supposed to obviate the need for warmup. Is this a bug?

Typo in paper

Hi, I read your arxiv paper and found some typos in it.

  1. In Equation 1
    image
    gt should be gi.

  2. In numerator of Equation 3
    image
    I calculated it by myself and I think it is Gamma((rho-1)/2) instead of Gamma(rho/2 -1).

  3. In
    image
    SMA numerator and denominator are upside down .

Let me know if I was wrong :)

Thank you for interesting paper.

Become very unstable in BERT+MultiTask mode

我使用BERT+MultiTask来训练模型。
在每个train step的时候,task是随机选择的。

切换到RAdam后,train变得非常不稳定,有些task的loss变小了,但是有些task的loss变得非常大。
我觉得有可能是因为不同的task所产生的loss的大小不同,使得variance一直变得很大。
不知道我猜得对不对。

我想,像我这种情况,可能只用能Adam+WarmUp了。

RAdam Instability vs AdamW / Adam

Late to the party, but once again good work to you all @LiyuanLucasLiu !

So I was testing RAdam vs AdamW on simple linear models [ie Logistic Regression / Linear Regression]. Obviously for these small problems, using new methods is a bit overdoing it, but trying them on small problems [Sklearn datasets like Boston, MNIST, Wine] is also important :)

After finding the best LR using the Learning Range Finder (which turns out to be the same LR for both [0.046]) + using gradient centralization + batch size = 16, with careful bias intialization (mean(y)), RAdam does seem more "stable" than AdamW.
image

However, I noticed that if you do NOT standardize your data, RAdam's gradient diverges dramatically. The LR Range Test on NOT standardized data gave LR = 6.51e-05, which is super small. But, RAdam diverges.
image

AdamW [lr = 1e-3] also has higher error when not standardized, however, the loss doesn't diverge a lot.
image

I also tried before (p < 5), to manually clip gradients by dividing by its norm. It's now much closer to AdamW.
image

So my Q is: is this expected of RAdam to diverge if the dataset is not standardized? Should AdamW be used instead? Is it because of SGD + Momentum when (p < 5) that this divergement is seen?

Having issues importing to a Kaggle notebook

Hello,

I tried importing RAdam in a Kaggle notebook. The usual way to import GitHub repos in Kaggle notebooks is to write

!pip install git+https://github.com/LiyuanLucasLiu/RAdam
from radam import RAdam

However, after running I got this error:

Collecting git+https://github.com/LiyuanLucasLiu/RAdam
Cloning https://github.com/LiyuanLucasLiu/RAdam to /tmp/pip-req-build-flm052t0
Running command git clone -q https://github.com/LiyuanLucasLiu/RAdam /tmp/pip-req-build-flm052t0
ERROR: Command errored out with exit status 1:
command: /opt/conda/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-flm052t0/setup.py'"'"'; file='"'"'/tmp/pip-req-build-flm052t0/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
cwd: /tmp/pip-req-build-flm052t0/
Complete output (5 lines):
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.6/tokenize.py", line 452, in open
buffer = _builtin_open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-flm052t0/setup.py'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Any ideas?
Thank you.

simplify add_

Hi,

I have a small optimization to suggest:

Is there any particular reason to not simplify

[line 84] p_data_fp32.add_(-group['weight_decay'] * group['lr'], p_data_fp32)

into

p_data_fp32.mul_(-group['weight_decay'] * group['lr'])

?
Other lines could be simplified the same manner.

Does RAdam break training with different learning rates for different param_groups?

If I understand the source code for RAdam in radam.py correctly the global buffer will cache step_size parameters just depending on the state['step']. This would fail in training regimes where each param_group has its own learning rate as the buffer would contain a step_size based on the learning rate from the first processed group.

Different implementation of radam.py

The implementation of RAdam is different between these two files.

  1. RAdam/radam.py
  2. RAdam/language-model/model_word_ada/radam.py

Which should I use ?

And there is a parameter (self.buffer) in the first file.
What is the role of this ?

NaNs

I observed that the RAdam method can start at first epochs to be produce NaN Loss while Adams not. It's not only for one or two experiments but a general observation. I wonder if we can merge Adabound clamp to RAdam to avoid this type of issue in the very beginning of the training ?

About the estimation of DoF

I appreciate your wonderful work!
I have 2 questions about the estimation of degree of freedom.
According to the question 3 of this issue, the correct SMA should be as follows but why can you say the DoF of Scaled inverse chi square is f(t, ß2)?
Screen Shot 2020-01-21 at 12 31 42 PM

And, in the same section of your paper, why did you use (t+1-i) instead of g^2 in the following equation?

Screen Shot 2020-01-21 at 12 46 41 PM

Thank you in advance!

KeyError: 'buffer'

I directly replace Adam with RAdam.
But i found dict group didn't have a key called buffer, can u give any advice?
optimizer = optim.Adam(model.parameters())----> optimizer=RAdam(model.parameters())

Will radam be affacted by weight decay?

Hi,

It is said that naive adam will make performance bad if weight decay is added. Thus people invented adamW to make adam compatible with weight decay. Now I have a question, does radam work well if I use it together with weight decay ?

Deprecated Warning in `RAdam` with torch==1.7.1

Hi @LiyuanLucasLiu , thanks for your incredible lib. With RAdam I got better performance without changing any hyperparameter. However, this a deprecated warning in RAdam with torch==1.7.1:

UserWarning: This overload of addcmul_ is deprecated:
	addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
	addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

according to the doc of addcmul_ in torch==1.7.1

Docstring:
addcmul_(tensor1, tensor2, *, value=1) -> Tensor

In-place version of :meth:`~Tensor.addcmul`
Type:      builtin_function_or_method

so to adapt to 1.7.1 and disable this warning, I only need to change exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) to exp_avg_sq.mul_(beta2).addcmul_(grad, grad, 1 - beta2), right?

math.sqrt gets a negative argument

Hi! I have been trying to train the TransformerXL language model (https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/run_wt103_base.sh) with RAdam and I get *** ValueError: math domain error

Traceback (most recent call last):
  File "train.py", line 543, in <module>
    train()
  File "train.py", line 463, in train
    optimizer.step()
  File "/transformer-xl/pytorch/radam.py", line 69, in step
    N_sma * N_sma_max / (N_sma_max - 2))) / beta1_t
ValueError: math domain error

this is because the argument to math.sqrt is negative here -
https://github.com/LiyuanLucasLiu/RAdam/blob/master/language-model/model_word_ada/radam.py#L67

What would be the right fix for this? I tried math.sqrt(abs()) but that performs worse than adam.

Overload of addcmul_ is deprecated:

I'm getting these warning as I train.

UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value)

Seems like a pretty simple argument reordering on lines 63, 158, 241, but I'm not exactly sure of the consequences.

Please add the license

Your code can't legally be used by others without a license. If possible, please use the same license as PyTorch.

Thanks!

distributed training generating "exp_avg error"

Hi, Thanks for sharing the code. I have test on the single node programming and it works.

However, when I use distributed training in Pytorch, it saids:

line 39, in step
state['exp_avg'] = state['exp_avg'].type_as(p_data_fp32)
KeyError: 'exp_avg'

Any suggestions towards this?

Much appreciated for any comments!

Notebook tutorial

I'm trying to find the easiest way to test RAdam.

Could you commit a basic jupyter notebook to test MNIST for instance ?

ResNet56

I am sorry, I come again. Can you tell me the hyperparameters setting on ResNet56? I got a very poor test accuracy 91.1 which is worse than that in ResNet20 . I set lr=0.01 and weight-decay=1e-4. Is there something wrong?

Why there are 10 slots in the buffer?

Hi.

I'm trying to understand RAdam implementation:

defaults = dict(..., buffer=[[None, None, None] for _ in range(10)])
...
for group in self.param_groups:
    for p in group['params']:
        ...
        state = self.state[p]
        ...
        state['step'] += 1
        buffered = group['buffer'][int(state['step'] % 10)]
        if state['step'] == buffered[0]:
            ...  # Reuse rectification constants computed during handling of a previous param.
        else:
            ...  # Compute rectification constants for a new step.

To me, it looks like state['step'] is steadily incremented by one, and only the values for a previous step can possibly be reused, so a buffer with a capacity of one would be sufficient. But the actual buffer has 10 entries. What for?

Cannot reproduce the PPL on One Billion Words

For the experiments of language model (LM) on One Billion Words, the final test PPL with Adam and RAdam are around 41 and 40, respectively, worse than the numbers reported in the paper (36.9 for Adam and 35.7 for RAdam).
Github version: 5716b3e

Question of RAdam's dependence on the number of examples

I'd like to confirm what:

Specifically, we identify that, due to the limited amount of samples in the early stage of model training, the adaptive learning rate has an undesirably large variance and can cause the model to converge to suspicious/bad local optima.

means. I read the the paper and didn't see RAdam depend on the number of training examples. So I don't really undestand what you mean by " limited amount of samples in the early stage of model training". If you are training via epochs the model "sees" the entire training set at every "epoch step". So I am unsure how to interpret that.

Thanks for sharing your work with us!

Theory question on warmup

Due to the lack of samples in the early stage, the adaptive learning rate has an
undesirably large variance, which leads to suspicious/bad local optima -- pg. 3

Does this apply when feeding the same dataset in a different configuration? Namely, I'm training a timeseries (16-channel EEG) CNN-LSTM classifier, and vary the input timesteps across epochs for the same model. While the information source probability distribution remains identical, what the neural net effectively "sees" differs substantially between, say, 13500 and 216000 timesteps.

This considered, is warmup for the first epoch of every new timesteps setting advisable? Thanks

Does RAdam have a Keras version?

Hi, good job!
You implemented RAdam in PyTorch version, is it possible to offer a Keras version later?
Appreciate :)
你好,你是用PyTorch实现的RAdam,那么有没有计划提供一个Keras版本的RAdam呢?多谢啦

Question regarding 2nd Moment Update

Hi Lucas,

First of all I would like to say thank you as the RAdam is really an amazing optimizer I have been using since last year.

Recently I am interested in the algorithm behind RAdam and I have a dumb question if you don't mind - the update of exponential moving 2nd moment formula in your paper is: vt ← 1/β2vt-1 + (1 - β2)gt2. I remembered in Adam optimizer, the update of is vt ← β2vt-1 + (1 - β2)gt2 without "1/". I also noticed that in your paper's version 1 the formula does not include "1/", so why is "1/" added?

I also checked the code, if I understand correctly, the following is updating the 2nd moment:

exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

Could kindly help me out when you have free time?

Many thanks,
Bowen

adam-2k

Could you explain how adam-2k is designed to get additional samples, thank you

Speed performance

Good day! Thanks for your work.

Is RAdam more computationally effective than Adam? In my task setting RAdam makes much faster steps on the same batches and I'm trying to figure out why...
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.