liyuanlucasliu / radam Goto Github PK

View Code? Open in Web Editor NEW

2.5K 58.0 338.0 955 KB

On the Variance of the Adaptive Learning Rate and Beyond

Home Page: https://arxiv.org/abs/1908.03265

License: Apache License 2.0

Python 99.06% Shell 0.94%

optimizer adam adam-optimizer warmup

radam's Introduction

RAdam

On the Variance of the Adaptive Learning Rate and Beyond

We are in an early-release beta. Expect some adventures and rough edges.

Introduction
Motivation
Questions and Discussions
Quick Start Guide
Related Posts and Repos
Citation

Introduction

If warmup is the answer, what is the question?

The learning rate warmup for Adam is a must-have trick for stable training in certain situations (or eps tuning). But the underlying mechanism is largely unknown. In our study, we suggest one fundamental cause is the large variance of the adaptive learning rates, and provide both theoretical and empirical support evidence.

In addition to explaining why we should use warmup, we also propose RAdam, a theoretically sound variant of Adam.

Motivation

As shown in Figure 1, we assume that gradients follow a normal distribution (mean: \mu, variance: 1). The variance of the adaptive learning rate is simulated and plotted in Figure 1 (blue curve). We observe that the adaptive learning rate has a large variance in the early stage of training.

When using a Transformer for NMT, a warmup stage is usually required to avoid convergence problems (e.g., Adam-vanilla converges around 500 PPL in Figure 2, while Adam-warmup successfully converges under 10 PPL). In further explorations, we notice that, if we use additional 2000 samples to estimate the adaptive learning rate, the convergence problems are avoided (Adam-2k); or, if we increase the value of eps, the convergence problems are also relieved (Adam-eps).

Therefore, we conjecture that the large variance in the early stage causes the convergence problem, and further propose Rectified Adam by analytically reducing the large variance. More details can be found in our paper.

Questions and Discussions

Do I need to tune learning rate?

Yes, the robustness of RAdam is not infinity. In our experiments, it works for a broader range of learning rates, but not all learning rates.

Notes on Transformer (more discussions can be found in our Transformer Clinic project)

Choice of the Original Transformer. We choose the original Transformer as our main study object because, without warmup, it suffers from the most serious convergence problems in our experiments. With such serious problems, our controlled experiments can better verify our hypothesis (i.e., we demonstrate that Adam-2k / Adam-eps can avoid spurious local optima by minimal changes).

Sensitivity. We observe that the Transformer is sensitive to the architecture configuration, despite its efficiency and effectiveness. For example, by changing the position of the layer norm, the model may / may not require the warmup to get a good performance. Intuitively, since the gradient of the attention layer could be more sparse and the adaptive learning rates for smaller gradients have a larger variance, they are more sensitive. Nevertheless, we believe this problem deserves more in-depth analysis and is beyond the scope of our study.

Why does warmup have a bigger impact on some models than others?

Although the adaptive learning rate has a larger variance in the early stage, the exact magnitude is subject to the model design. Thus, the convergent problem could be more serious for some models/tasks than others. In our experiments, we observe that RAdam achieves consistent improvements over the vanilla Adam. It verifies the variance issue widely exists (since we can get better performance by fixing it).

What if the gradient is not zero-meaned?

As in Figure 1 (above), even if the gradient is not zero-meaned, the original adaptive learning rate still has a larger variance in the beginning, thus applying the rectification can help to stabilize the training.

Another related concern is that, when the mean of the gradient is significantly larger than its variance, the magnitude of the "problematic" variance may not be very large (i.e., in Figure 1, when \mu equals to 10, the adaptive learning rate variance is relatively small and may not cause problems). We think it provides a possible explaination on why warmup have a bigger impact on some models than others. Still, we suggest that, in real-world applications, neural networks usually have some parts of parameters meet our assumption well (i.e., their gradient variance is larger than their gradient mean), and needs the rectification to stabilize the training.

Why does SGD need warmup?

To the best of our knowledge, the warmup heuristic is originally designed for large minibatch SGD [0], based on the intuition that the network changes rapidly in the early stage. However, we find that it does not explain why Adam requires warmup. Notice that, Adam-2k uses the same large learning rate but with a better estimation of the adaptive learning rate can also avoid the convergence problems.

The reason why sometimes warmup also helps SGD still lacks of theoretical support. FYI, when optimizing a simple 2-layer CNN with gradient descent, the thoery of [1] could be used to show the benifits of warmup. Specifically, the lr must be $O(cos \phi)$, where $\phi$ is the angle between the current weight and the ground true weight and $cos \phi$ could be very small due to high dimensional space and random initialization. And thus lr must be very small at the beginning to guarentee the convergence. $cos \phi$ however can be improved in the later stage, and thus the learning rate is allowed to be larger. Their theory somehow can justify why warmup is needed by gradient descend and neural networks. But it is still far-fetched for the real scenario.

[0] Goyal et al, Accurate, Large Minibatch SGD: Training Imagenet in 1 Hour, 2017
[1] Du et al, Gradient Descent Learns One-hidden-layer CNN: Don’t be Afraid of Spurious Local Minima, 2017

Quick Start Guide

Directly replace the vanilla Adam with RAdam without changing any settings.
Further tune hyper-parameters (including the learning rate) for a better performance.

Note that in our paper, our major contribution is to identify why we need the warmup for Adam. Although some researchers successfully improve their model performance (user comments), considering the difficulty of training NNs, directly plugging in RAdam may not result in an immediate performance boost. Based on our experience, replacing the vanilla Adam with RAdam usually results in a better performance; however, if warmup has already been employed and tuned in the baseline method, it is necessary to also tune hyper-parameters for RAdam.

Citation

Please cite the following paper if you found our model useful. Thanks!

Liyuan Liu , Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han (2020). On the Variance of the Adaptive Learning Rate and Beyond. the Eighth International Conference on Learning Representations.

@inproceedings{liu2019radam,
 author = {Liu, Liyuan and Jiang, Haoming and He, Pengcheng and Chen, Weizhu and Liu, Xiaodong and Gao, Jianfeng and Han, Jiawei},
 booktitle = {Proceedings of the Eighth International Conference on Learning Representations (ICLR 2020)},
 month = {April},
 title = {On the Variance of the Adaptive Learning Rate and Beyond},
 year = {2020}
}

radam's People

Contributors

Stargazers

Watchers

Forkers

maxmatical bennnun sameer-ahuja stellarstorm chaoso yueyedeai suyanzhou626 dreamfarwhb jionie flashsnail awesome-archive h-jia nisoka git-ztx peiliangli linhduongtuan haifengzeng xiaohongxiao killerwuhan dehaozhang edenbuaa chisyliu objectdetection wliuxingxiangyu xdcesc wind-meta luckynote dgai91 dsp6414 batermj niranajn006 super-ljg pandinosaurus shengzhang90 zlannnn zhanghansenit tin-chata caifazhou lamardealmaker tomarraj008 sfrias dahburj msrocean namisan zxzxzxygithub americoalves yibit lxytsos ylch tpnguyen dyzhou666 molakejin ruizewang zhanghlgithub kien-vu yamlong chenghuige rotorliu speedxzk stjordanis soonhwan-kwon angleboy8 nyk510 rohitkeshari neilliang90 muhamedgaafar iamweiweishi guyrose3 ashwathaithal giladdiv scape1989 runngezhang abhinand5 brjathu jabogithub dogydev lifeisstrange realcodebase tchigher ashlee-lu shaunstanislauslau tongjinle123 chomolungma alanmorninglight mikigom nangeblog bdotgradb kyang888 mrcasutt zymale cclauss hhy5277 luizgh nyavramov formleaf sethips anantheshjshet wh-forker ddeeppnneett cgvarela

radam's Issues

Can you make radam alone installable via pip?

Thanks for your amazing work. Can you make radam easier to install by releasing a pip package?

How to choose decay rate? (No success with RAdam - does one need a decay scheduler or gradient clipping)

Hi Liyuan,

I've had difficulties with RAdam in my sequence learning problems. I am using a standard pytorch transformer with your library. I was wondering if one usually needs a scheduler (e.g. to reduce the learning rate at the end) besides RAdam or gradient clipping since my models seem to diverge often:

distributed training generating "exp_avg error"

Hi, Thanks for sharing the code. I have test on the single node programming and it works.

However, when I use distributed training in Pytorch, it saids:

line 39, in step
state['exp_avg'] = state['exp_avg'].type_as(p_data_fp32)
KeyError: 'exp_avg'

Any suggestions towards this?

Much appreciated for any comments!

math.sqrt gets a negative argument

Hi! I have been trying to train the TransformerXL language model (https://github.com/kimiyoung/transformer-xl/blob/master/pytorch/run_wt103_base.sh) with RAdam and I get *** ValueError: math domain error

Traceback (most recent call last):
  File "train.py", line 543, in <module>
    train()
  File "train.py", line 463, in train
    optimizer.step()
  File "/transformer-xl/pytorch/radam.py", line 69, in step
    N_sma * N_sma_max / (N_sma_max - 2))) / beta1_t
ValueError: math domain error

this is because the argument to math.sqrt is negative here -
https://github.com/LiyuanLucasLiu/RAdam/blob/master/language-model/model_word_ada/radam.py#L67

What would be the right fix for this? I tried math.sqrt(abs()) but that performs worse than adam.

Question of RAdam's dependence on the number of examples

I'd like to confirm what:

Specifically, we identify that, due to the limited amount of samples in the early stage of model training, the adaptive learning rate has an undesirably large variance and can cause the model to converge to suspicious/bad local optima.

means. I read the the paper and didn't see RAdam depend on the number of training examples. So I don't really undestand what you mean by " limited amount of samples in the early stage of model training". If you are training via epochs the model "sees" the entire training set at every "epoch step". So I am unsure how to interpret that.

Thanks for sharing your work with us!

simplify add_

Hi,

I have a small optimization to suggest:

Is there any particular reason to not simplify

[line 84] p_data_fp32.add_(-group['weight_decay'] * group['lr'], p_data_fp32)

into

p_data_fp32.mul_(-group['weight_decay'] * group['lr'])

?
Other lines could be simplified the same manner.

Worse performance

I have get a poor test accuracy, and I set the random seed as 1.

Does RAdam have a Keras version？

Hi, good job!
You implemented RAdam in PyTorch version, is it possible to offer a Keras version later?
Appreciate :)
你好，你是用PyTorch实现的RAdam，那么有没有计划提供一个Keras版本的RAdam呢？多谢啦

TypeError: must be real number, not NoneType

Hello,
I use hovorod with radam, but it has typeerror

TypeError: must be real number, not NoneType

amsgrad is not defined in AdamW class

Please include amsgrad in init or remove from defaults dict :)

Any concern for using `math.sqrt` instead of `torch.sqrt`

I find you use a lot of math.sqrt in your implementation. Any concern for not using torch.sqrt instead? I think math.sqrt is slower than torch.sqrt because it's on CPU.

Hi

NaNs

I observed that the RAdam method can start at first epochs to be produce NaN Loss while Adams not. It's not only for one or two experiments but a general observation. I wonder if we can merge Adabound clamp to RAdam to avoid this type of issue in the very beginning of the training ?

Typo in paper

Hi, I read your arxiv paper and found some typos in it.

In Equation 1

gt should be gi.
In numerator of Equation 3

I calculated it by myself and I think it is Gamma((rho-1)/2) instead of Gamma(rho/2 -1).
In

SMA numerator and denominator are upside down .

Let me know if I was wrong :)

Thank you for interesting paper.

Theory question on warmup

Due to the lack of samples in the early stage, the adaptive learning rate has an
undesirably large variance, which leads to suspicious/bad local optima -- pg. 3

Does this apply when feeding the same dataset in a different configuration? Namely, I'm training a timeseries (16-channel EEG) CNN-LSTM classifier, and vary the input timesteps across epochs for the same model. While the information source probability distribution remains identical, what the neural net effectively "sees" differs substantially between, say, 13500 and 216000 timesteps.

This considered, is warmup for the first epoch of every new timesteps setting advisable? Thanks

Is RAdam needed when fitting perfectly a small batch e.g. 500 examples?

how can i use this in tf1.4

How can i use this in tf1.4

Are the plots you have wrt epochs or iterations?

For example figure 1:

in general, I am trying to figure out if in general people train transformers wrt epochs or iterations (1 iteration is one batch).

Algorithm 2 Arxiv paper 1/beta2 typo?

Hey again!

Not 100% sure, in v1 of your paper, beta2 was used.
In v2 and v3, 1/beta2 is used.

So is 1/beta2 correct?

Does RAdam break training with different learning rates for different param_groups?

If I understand the source code for RAdam in radam.py correctly the global buffer will cache step_size parameters just depending on the state['step']. This would fail in training regimes where each param_group has its own learning rate as the buffer would contain a step_size based on the learning rate from the first processed group.

ResNet56

I am sorry, I come again. Can you tell me the hyperparameters setting on ResNet56? I got a very poor test accuracy 91.1 which is worse than that in ResNet20 . I set lr=0.01 and weight-decay=1e-4. Is there something wrong?

Please add the license

Your code can't legally be used by others without a license. If possible, please use the same license as PyTorch.

Thanks!

About the estimation of DoF

I appreciate your wonderful work!
I have 2 questions about the estimation of degree of freedom.
According to the question 3 of this issue, the correct SMA should be as follows but why can you say the DoF of Scaled inverse chi square is f(t, ß2)?

And, in the same section of your paper, why did you use (t+1-i) instead of g^2 in the following equation?

Thank you in advance!

Cannot reproduce the PPL on One Billion Words

For the experiments of language model (LM) on One Billion Words, the final test PPL with Adam and RAdam are around 41 and 40, respectively, worse than the numbers reported in the paper (36.9 for Adam and 35.7 for RAdam).
Github version: 5716b3e

Could you pls give me some clues on the hyperparameters for ImageNet training?

Hi, thank you for you great contribution.
And I have tried some hyperparameters for Imagenet training by rAdam, but the acc never surpassed 0.32.
Could you pls send me some key params, like 'lr' 'beta1' 'beta2' 'schedule' 'gamma' 'weight_decay'.
Thank you.

KeyError: 'buffer'

I directly replace Adam with RAdam.
But i found dict group didn't have a key called buffer, can u give any advice?
optimizer = optim.Adam(model.parameters())----> optimizer=RAdam(model.parameters())

RAdam for pytorch official

I am curious, why hasn't RAdam been included official in pytorch?

pytorch/pytorch#24892

Should one be using RAdam or PlainRadam?

Perhaps a small tutorial of how to use the code in practice might be useful?

Different implementation of radam.py

The implementation of RAdam is different between these two files.

RAdam/radam.py
RAdam/language-model/model_word_ada/radam.py

Which should I use ?

And there is a parameter (self.buffer) in the first file.
What is the role of this ?

Notebook tutorial

I'm trying to find the easiest way to test RAdam.

Could you commit a basic jupyter notebook to test MNIST for instance ?

Sensitivity wrt LR restarts

I'm observing sensitivity wrt LR restarts in a typical SGDR schedule with cosine annealing as in Loschilov & Hutter. RAdam still seems to be doing better than AdamW so far, but the jumps imply possible numerical instability at LR discontinuities.

Here's the training loss compared to AdamW (PyTorch 1.2.0 version):

Here's the validation loss:

What's the recommendation here? Should I use warmup in every cycle rather than just in the beginning? I thought RAdam was supposed to obviate the need for warmup. Is this a bug?

Why there are 10 slots in the buffer?

Hi.

I'm trying to understand RAdam implementation:

defaults = dict(..., buffer=[[None, None, None] for _ in range(10)])
...
for group in self.param_groups:
    for p in group['params']:
        ...
        state = self.state[p]
        ...
        state['step'] += 1
        buffered = group['buffer'][int(state['step'] % 10)]
        if state['step'] == buffered[0]:
            ...  # Reuse rectification constants computed during handling of a previous param.
        else:
            ...  # Compute rectification constants for a new step.

To me, it looks like state['step'] is steadily incremented by one, and only the values for a previous step can possibly be reused, so a buffer with a capacity of one would be sufficient. But the actual buffer has 10 entries. What for?

[AdamW] amsgrad issue

Hi there,

Thanks a lot for the implementation! Out of curiosity, I tested the other optimizers apart from radam. And I think there is a missing argument in the constructor of AdamW on line 154:

def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, amsgrad=False, warmup=0):

Otherwise calling the constructor throws the following error:

  File "main.py", line 125, in main
    optimizer = AdamW(model_params, weight_decay=args.weight_decay)
  File "/home/fg/fv-training/optimizer.py", line 156, in __init__
    weight_decay=weight_decay, amsgrad=amsgrad, warmup = warmup)
NameError: name 'amsgrad' is not defined

Cheers

Does RAdam usually need an annealing and warm up scheduler?

since it fixes the variance issue wouldn't it mean it still needs a annealing scheduler (but not a warm up scheduler)?

"Please see the Training recipes for how to train the models."

"Please see the Training recipes for how to train the models." The link to train recipies is broken for CNNs. Please post a easy to use class like optim to plug and play with our models.

Could you share the tensorflow implementations?

Hi，Could you share the tensorflow implementations?
@LiyuanLucasLiu

Deprecated Warning in `RAdam` with torch==1.7.1

Hi @LiyuanLucasLiu , thanks for your incredible lib. With RAdam I got better performance without changing any hyperparameter. However, this a deprecated warning in RAdam with torch==1.7.1:

UserWarning: This overload of addcmul_ is deprecated:
	addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
	addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
  exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

according to the doc of addcmul_ in torch==1.7.1

Docstring:
addcmul_(tensor1, tensor2, *, value=1) -> Tensor

In-place version of :meth:`~Tensor.addcmul`
Type:      builtin_function_or_method

so to adapt to 1.7.1 and disable this warning, I only need to change exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad) to exp_avg_sq.mul_(beta2).addcmul_(grad, grad, 1 - beta2), right?

Overload of addcmul_ is deprecated:

I'm getting these warning as I train.

UserWarning: This overload of addcmul_ is deprecated:
        addcmul_(Number value, Tensor tensor1, Tensor tensor2)
Consider using one of the following signatures instead:
        addcmul_(Tensor tensor1, Tensor tensor2, *, Number value)

Seems like a pretty simple argument reordering on lines 63, 158, 241, but I'm not exactly sure of the consequences.

Question regarding 2nd Moment Update

Hi Lucas,

First of all I would like to say thank you as the RAdam is really an amazing optimizer I have been using since last year.

Recently I am interested in the algorithm behind RAdam and I have a dumb question if you don't mind - the update of exponential moving 2nd moment formula in your paper is: v_t ← 1/β₂v_t-1 + (1 - β₂)g_t². I remembered in Adam optimizer, the update of is v_t ← β₂v_t-1 + (1 - β₂)g_t² without "1/". I also noticed that in your paper's version 1 the formula does not include "1/", so why is "1/" added?

I also checked the code, if I understand correctly, the following is updating the 2nd moment:

exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

Could kindly help me out when you have free time?

Many thanks,
Bowen

RAdam Instability vs AdamW / Adam

Late to the party, but once again good work to you all @LiyuanLucasLiu !

So I was testing RAdam vs AdamW on simple linear models [ie Logistic Regression / Linear Regression]. Obviously for these small problems, using new methods is a bit overdoing it, but trying them on small problems [Sklearn datasets like Boston, MNIST, Wine] is also important :)

After finding the best LR using the Learning Range Finder (which turns out to be the same LR for both [0.046]) + using gradient centralization + batch size = 16, with careful bias intialization (mean(y)), RAdam does seem more "stable" than AdamW.

However, I noticed that if you do NOT standardize your data, RAdam's gradient diverges dramatically. The LR Range Test on NOT standardized data gave LR = 6.51e-05, which is super small. But, RAdam diverges.

AdamW [lr = 1e-3] also has higher error when not standardized, however, the loss doesn't diverge a lot.

I also tried before (p < 5), to manually clip gradients by dividing by its norm. It's now much closer to AdamW.

So my Q is: is this expected of RAdam to diverge if the dataset is not standardized? Should AdamW be used instead? Is it because of SGD + Momentum when (p < 5) that this divergement is seen?

Become very unstable in BERT+MultiTask mode

我使用BERT+MultiTask来训练模型。
在每个train step的时候，task是随机选择的。

切换到RAdam后，train变得非常不稳定，有些task的loss变小了，但是有些task的loss变得非常大。
我觉得有可能是因为不同的task所产生的loss的大小不同，使得variance一直变得很大。
不知道我猜得对不对。

我想，像我这种情况，可能只用能Adam+WarmUp了。

!pip install git+https://github.com/LiyuanLucasLiu/RAdam
from radam import RAdam

However, after running I got this error:

Collecting git+https://github.com/LiyuanLucasLiu/RAdam
Cloning https://github.com/LiyuanLucasLiu/RAdam to /tmp/pip-req-build-flm052t0
Running command git clone -q https://github.com/LiyuanLucasLiu/RAdam /tmp/pip-req-build-flm052t0
ERROR: Command errored out with exit status 1:
command: /opt/conda/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-flm052t0/setup.py'"'"'; file='"'"'/tmp/pip-req-build-flm052t0/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info
cwd: /tmp/pip-req-build-flm052t0/
Complete output (5 lines):
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.6/tokenize.py", line 452, in open
buffer = _builtin_open(filename, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-req-build-flm052t0/setup.py'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Any ideas?
Thank you.

What's the difference between RAdam and PlainRAdam?

Hi,

There are RAdam and PlainRAdam in the src file (see https://github.com/LiyuanLucasLiu/RAdam/blob/master/radam/radam.py).
What's the difference between them, and which one should I use?

Why it have convertion to fp32?

Im using nvidia apex and curios how it works with.

liyuanlucasliu / radam Goto Github PK

radam's Introduction

RAdam

On the Variance of the Adaptive Learning Rate and Beyond

Table of Contents

Introduction

If warmup is the answer, what is the question?

Motivation

Questions and Discussions

Do I need to tune learning rate?

Notes on Transformer (more discussions can be found in our Transformer Clinic project)

Why does warmup have a bigger impact on some models than others?

What if the gradient is not zero-meaned?

Why does SGD need warmup?

Quick Start Guide

Related Posts and Repos

Unofficial Re-Implementations

Unofficial Introduction & Mentions

User Comments

Citation

radam's People

Contributors

Stargazers

Watchers

Forkers

radam's Issues

Recommend Projects

Recommend Topics

Recommend Org