jiangoforit / yellowfin_pytorch Goto Github PK

auto-tuning momentum SGD optimizer

License: Apache License 2.0

Python 99.46% Shell 0.54%

yellowfin_pytorch's Introduction

YellowFin

YellowFin is an auto-tuning optimizer based on momentum SGD which requires no manual specification of learning rate and momentum. It measures the objective landscape on-the-fly and tunes momentum as well as learning rate using local quadratic approximation.

The implementation here can be a drop-in replacement for any optimizer in PyTorch. It supports step and zero_grad functions like any PyTorch optimizer after from yellowfin import YFOptimizer. We also provide interface to manually set the learning rate schedule at every iteration for finer control (see Detailed Guideline Section).

For more technical details, please refer to our paper YellowFin and the Art of Momentum Tuning.

For more usage details, please refer to the inline documentation of tuner_utils/yellowfin.py. Example usage can be found here for ResNext on CIFAR10 and Tied LSTM on PTB.

YellowFin is under active development. Many members of the community have kindly submitted issues and pull requests. We are incorporating fixes and smoothing things out. As a result the repository code is in flux. Please make sure you use the latest version and submit any issues you might have!

Updates

[2017.07.03] Fixed a gradient clipping bug. Please pull our latest master branch to make gradient clipping great again in YellowFin.

[2017.07.28] Switched to logrithmic smoothing to accelerate adaptation to curvature range trends.

[2017.08.01] Added optional feature to enforce non-increasing value of lr * gradient norm for stablity in some rare cases.

[2017.08.05] Added feature to correct estimation bias from sparse gradient.

[2017.08.16] Replace numpy root solver with closed form solution using Vieta's substitution for cubic eqaution. It solves the stability issue of the numpy root solver.

[2017.10.29] Major fixe for stability. We added eps to protect fractions in our code, as well as an adaptive clipping feature to properly deal with exploding gradient (manual clipping is still supported as described in the detailed instruction below).

Setup instructions for experiments

Please clone the master branch and follow the instructions to run YellowFin on ResNext for CIFAR10 and tied LSTM on Penn Treebank for language modeling. The models are adapted from ResNext repo and PyTorch example tied LSTM repo respectively. Thanks to the researchers for developing the models. For more experiments on more convolutional and recurrent neural networks, please refer to our Tensorflow implementation of YellowFin.

Note YellowFin is tested with PyTorch v0.2.0 for compatibility. It is tested under Python 2.7.

Run CIFAR10 ResNext experiments

The experiments on 110 layer ResNet with CIFAR10 and 164 layer ResNet with CIFAR100 can be launched using

cd pytorch-cifar
python main.py --logdir=path_to_logs --opt_method=YF

Run Penn Treebank tied LSTM experiments

The experiments on multiple-layer LSTM on Penn Treebank can be launched using

cd word_language_model
python main.py --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --opt_method=YF --logdir=path_to_logs --cuda

For more experiments, please refer to our YellowFin Tensorflow Repo.

Detailed guidelines

Basic use: optimizer = YFOptimizer(parameter_list) uses the uniform setting (i.e. without tuning) for all the PyTorch and Tensorflow experiments in our paper.
Interface for manual finer control: If you want to more finely control the learning rate (say using a manually set constant learning rate), or you want to use the typical lr-dropping technique after a ceritain number of epochs, please use set_lr_factor() in the YFOptimizer class. E.g. if you want to use a manually set constant learning rate, you can run set_lr_factor(desired_lr / self._lr) before self.step() at each iteration. Or e.g., if you want to always multiply a factor 2.0 to the learning rate originally tuned by YellowFin, you may use optimizer.set_lr_factor(2.0) right after optimizer = YFOptimizer(parameter_list) and before training with YellowFin. More details can be found here. (The argument lr and mu during YFOptimizer initialization are dummy, only for backward compatibility)
Gradient clipping: The default setting uses adaptive gradient clipping to prevent gradient explosion, thresholding norm of gradient to the square root of our estimated maximal curvature. There are three cases regarding gradient clipping. We recommend first turning off gradient clipping, and only turning it on when necessary.
- If you want to manually set threshold to clip the gradient, please first use adapt_clip=False to turn off the auto-clipping feature. Then, you can consider either using the clip_thresh=thresh_on_the_gradient_norm argument when initializing the YFOptimizer to clip acoording to your set threshold inside YFOptimizer, or clipping the gradient outside of YFOptimizer before step() is called.
- If you want to totally turn off gradient clipping in YFOptimizer, please use clip_thresh=None, adapt_clip=False when initializing the YFOptimizer.
Normalization: When using log probability style losses, please make sure the loss is properly normalized. In some RNN/LSTM cases, the cross_entropy need to be averaged by the number of samples in a minibatch. Sometimes, it also needs to be averaged over the number of classes and the sequence length of each sample in some PyTorch loss functions. E.g. in nn.MultiLabelSoftMarginLoss, size_average=True needs to be set.

Non-increasing move: In some rare cases, we have observe increasing value of lr * || grad ||, i.e. the move, may result in unstableness. We implemented an engineering trick to enforce non-increasing value of lr * || grad ||. The default setting turns the feature off, you can turn it on with force_non_inc_step_after_iter=the starting iter you want to enforce the non-increasing value if it is really necessary. We recommend force_non_inc_step_after_iter to be at least a few hundreds because some models may need to gradually raise the magnitude of gradient in the beginning (e.g. a model, not properly initialized, may have near zero-gradient and need iterations to get reasonable gradient level).

Citation

If you use YellowFin in your paper, please cite the paper:

@article{zhang2017yellowfin,
  title={YellowFin and the Art of Momentum Tuning},
  author={Zhang, Jian and Mitliagkas, Ioannis and R{\'e}, Christopher},
  journal={arXiv preprint arXiv:1706.03471},
  year={2017}
}

Acknowledgement

We thank Olexa Bilaniuk, Andrew Drozdov, Paroma Varma, Bryan He, as well as github user @elPistolero @esvhd for the help in contributing to and testing the codebase.

Implementation for other platforms

For Tensorflow users, we implemented YellowFin Tensorflow Repo.

We thank the contributors for YellowFin in different deep learning frameworks.

yellowfin_pytorch's People

Contributors

Stargazers

Watchers

yellowfin_pytorch's Issues

Different variance in publication and implementation

In the publication the variance in "Algorithm 3 Gradient variance" is defined as:

However in the PyTorch implementation variance is defined as:

Did you try YellowFin with variance from the publication? Were the results worse? Which definition of the variance did you use to produce results from the publication?

Feature Request: Implement state_dict() / load_state_dict()

thanks for this code release, I wish more papers did this! It really makes it effortless to try out this new optimizer :)

It'd be great if YellowFin supported the state_dict() and load_state_dict() functions, to maintain a consistent serialization API with the other PyTorch optimizers.

Bad performance on large vision models

Hello there,

I am doing my best to learn how to use this optimizer, as I would very much like to have an auto-tuned optimizer where I do not have to spend endless days fiddling with hyperparameters. I have tried to use YellowFin to learn large vision models such as MobileNet, but my results are always very disappointing as compared to a traditional optimizer such as SGD. I am not so concerned about convergence time as I am about loss/accuracy; I have found that YellowFin tends to converge to a much worse loss/accuracy than my SGD runs do.

I am posting here an example of training MobileNet on the ImageNet dataset with a batch size of 64, comparing the training and testing loss (as well as testing accuracy) of a few epochs of training on MobileNet. In both cases, I have a learning rate schedule applied to set the learning rate factor to 0.3 ^ (epoch // 10), which causes the learning rate to fall to 3/10 of its value every 10 epochs. You can see the effect of this learning rate schedule in the sgd plot fairly easily, the yf plot shows it less clearly. In these figures, the training loss (per minibatch) is shown in blue, while the testing loss (per epoch) is shown in red, with the relevant axis shown on the left. The top-1 and top-5 accuracies on the training dataset are shown in green (per epoch), with their relevant axis given on the right. Other than the optimizer choice, all other training settings are the same, including minibatch size (64), dataset (ImageNet) and model architecture (MobileNet).

Here is a plot for an SGD optimizer run (note that I have this model only partially trained, this is because it has trained enough that we can already see it will converge to a significantly better loss than the YF model did, below):

Here is a plot for a YellowFin optimizer run:

If there are any questions about my methodology I would be happy to explain in greater detail. There is nothing particularly special going on in my model, I am simply trying to determine why YellowFin seems to converge with such poor results.

Assertion Error: assert root.size == 1

I get the following error with the following stack:
optimizer.step()
self.after_apply()
self.get_mu()
assert root.size == 1

It works fine for most of my runs but fails the assertion sometimes. I will try reproducing it in a simpler setting.

illegal memory access

Feel free to close since it seems like a PyTorch bug, but as a heads up, in case others hit the same issue, on some runs I got this:

Traceback (most recent call last):
  File "train.py", line 219, in <module>
    optimizer.step()
  File "/home/grant/repos/aud1/yellowfin.py", line 371, in step
    self.after_apply()
  File "/home/grant/repos/aud1/yellowfin.py", line 276, in after_apply
    self.grad_sparsity()
  File "/home/grant/repos/aud1/yellowfin.py", line 219, in grad_sparsity
    grad_non_zero = grad.nonzero()
RuntimeError: an illegal memory access was encountered

NaN and AssertionError

Thanks for open sourcing the code !

I've tried it on a simple MLP and could not find a set of parameters (lr and mu) that would not yield one of those two errors:

assert root.size == 1 AssertionError

and

numpy.linalg.linalg.LinAlgError: Array must not contain infs or NaNs

Any tips on best practices ?

AttributeError: 'YFOptimizer' object has no attribute '_state_checkpoint'

Hi,

The following error pops up when I use yellowfin.py in my setup.

File "/home/bchatter/Documents/workspace/python/async-opt/resnet_cifar_yellowfin/yellowfin.py", line 568, in step
    self.load_state_dict_perturb(copy.deepcopy(self._state_checkpoint) )
AttributeError: 'YFOptimizer' object has no attribute '_state_checkpoint'

I attempted solving this issue by copying the line 543 in yellowfin.py self._state_checkpoint = copy.deepcopy(self.state_dict() ) just above the line 568. However, with that the optimizer starts to be non-converging (at times diverges, but certainly does not converge).

Could you please look at the issue?

Learning Rate Decay

Hi,

For YellowFin optimzer, do I need to use the learning rate decay trick?
For example, I evaluate the model on dev set and if the performance drops I will halve the learning rate.
This trick works very well for optimizers such as Adam and Adadelta. So will it also work if I switch to the YellowFin optimizer?

Thanks.

Does not work with pytorch 0.4

There seems two be two reasons for this:

0.4 introduced 0-dimensional tensors (scalars) and to get their value as a python float we need to call .item() on them. If we don't (and train on GPU) yellowfin will hold on to tensors on both CPU and GPU and try to do operations on them (which will cause an exception since they are on different devices, that exception will be swallowed by the checkpoint restoration mechanism).
Tensors and Variables have been merged in 0.4, so unless the code is changed yellowfin will hold on to tensors with gradient history causing a memory leak.

The first issue seems to be quite easy to patch, I can send a pull-request for that part if you want to.

Why alpha and mu are global, not parameter-wise?

Hi! Thank you for your great work!

Could you tell why you decided to use one global alpha and one global mu for whole the model instead of creating a separate alpha and mu for each matrix of weights in the model? The other approach seems to be more natural to me because each matrix of weights might have different distributions of its values and values of its gradient. Did you consider it? Do you see a reason not to do so?

The nonzero count in grad_sparsity fails if grad is zero

If grad consists only of zeros then torch.nonzero(grad) or grad.nonzero() will return an empty tensor with dim() == 0.

In this case accessing the size tuple fails since it is empty.

LR keeps growing instead of shrinking

Hi,

I'm running into a situation where YellowFin keeps adjusting the LR upward instead of decaying it downward - which, of course, prevents the network from converging. Any idea why it would be happening? Thanks!

I initialize YF like so: optimizer = YFOptimizer(net.parameters(), lr=train_args['lr'])

And the output is...

[epoch 4], [iter 20 / 123], [train main loss 0.15482], [lr 0.057262]
[epoch 4], [iter 40 / 123], [train main loss 0.14807], [lr 0.058468]
[epoch 4], [iter 60 / 123], [train main loss 0.14976], [lr 0.059635]
[epoch 4], [iter 80 / 123], [train main loss 0.14867], [lr 0.060768]
[epoch 4], [iter 100 / 123], [train main loss 0.14935], [lr 0.061881]
[epoch 4], [iter 120 / 123], [train main loss 0.14653], [lr 0.062980]

----------------------------------------------------------------------------------------------
[epoch 5], [iter 20 / 123], [train main loss 0.14709], [lr 0.064228]
[epoch 5], [iter 40 / 123], [train main loss 0.14188], [lr 0.065275]
[epoch 5], [iter 60 / 123], [train main loss 0.16187], [lr 0.066297]
[epoch 5], [iter 80 / 123], [train main loss 0.15231], [lr 0.067289]
[epoch 5], [iter 100 / 123], [train main loss 0.15639], [lr 0.068227]
[epoch 5], [iter 120 / 123], [train main loss 0.15515], [lr 0.069117]

----------------------------------------------------------------------------------------------
[epoch 6], [iter 20 / 123], [train main loss 0.13752], [lr 0.070135]
[epoch 6], [iter 40 / 123], [train main loss 0.13210], [lr 0.071002]
[epoch 6], [iter 60 / 123], [train main loss 0.13821], [lr 0.071850]
[epoch 6], [iter 80 / 123], [train main loss 0.13456], [lr 0.072690]
[epoch 6], [iter 100 / 123], [train main loss 0.13225], [lr 0.073533]
[epoch 6], [iter 120 / 123], [train main loss 0.13367], [lr 0.074379]

YF doesn't work for the cv task

Hi, I tried your optimizer instead of SGD for this challenge https://www.kaggle.com/c/planet-understanding-the-amazon-from-space and get this kind of train/valid curves https://s.mail.ru/BFbQ/1K6w1bqD7 , which is obv awful (tried different setups but no success). SGD & plateau scheduler reach 0.08258 val loss.

What data do you need to investigate such bad performance ? For example learning rates changes between 0.1 and 3 (!) .

too many things are kept as state

I was trying to experiment with the effects of changing the clipping threshold during training, but I noticed that my changes were getting overridden because this is kept as state. Also some other user set options are kept in the state, but probably should not be lest it be difficult to change them during training.

'YFOptimizer' object has no attribute '_h_min' when calling optimizer.state_dict()

When I try to save the optimizer.state_dict() before the first training step, this error occurred. Seems that we should add self._h_min = 0.0 and self._h_max = 0.0 in __init__() ?

Python3 changes for word_language_model

Hi there,

Thanks for sharing this amazing piece of work.

I made some small changes to the word_language_model's main.py script to make it work with python3. The changes were essentially related to print function calls and a couple of minor indentations.

Tested and the script now runs successfully with Python 3.6. Would you be interested in merging it?

Thanks.

gradient clipping doesn't work with dict params

When using per-layer LR I get an exception:

  File "/home/tyantov/workspace/kaggle-planet/planet/train.py", line 375, in main
    tr.run(config)
  File "/home/tyantov/workspace/kaggle-planet/planet/train.py", line 183, in run
    train_score = boilerplate.train(train_loader, self._model, criterion, optimizer, epoch)
  File "/home/tyantov/workspace/kaggle-planet/planet/boilerplate.py", line 217, in train
    optimizer.step()
  File "/home/tyantov/workspace/kaggle-planet/planet/generic_models/yellowfin.py", line 202, in step
    torch.nn.utils.clip_grad_norm(self._var_list, self._clip_thresh)
  File "/home/tyantov/anaconda2/lib/python2.7/site-packages/torch/nn/utils/clip_grad.py", line 17, in clip_grad_norm
    parameters = list(filter(lambda p: p.grad is not None, parameters))
  File "/home/tyantov/anaconda2/lib/python2.7/site-packages/torch/nn/utils/clip_grad.py", line 17, in <lambda>
    parameters = list(filter(lambda p: p.grad is not None, parameters))
AttributeError: 'dict' object has no attribute 'grad'

Code:

     if exact_layers:
        logger.info('Learning exact layers, number=%d', len(exact_layers))
        parameters = []
        for i, layer in enumerate(exact_layers):
            if isinstance(layer, tuple) and len(layer) == 2:
                layer, multiplier = layer
                init_multiplier = 1
            elif isinstance(layer, tuple) and len(layer) == 3:
                layer, init_multiplier, multiplier = layer
            else:
                multiplier = 1
                init_multiplier = 1
            lr = config.lr * multiplier
            init_lr = config.lr * multiplier * init_multiplier
            logger.info('Layer=%d, lr=%.5f', i, init_lr)
            parameters.append({'params': layer.parameters(), 'lr': init_lr, 'after_warmup_lr': lr})
    else:
        logger.info('Optimizing all parameters, lr=%.5f', config.lr)
        parameters = model.parameters()

Exact line: parameters.append({'params': layer.parameters(), 'lr': init_lr,
standart optimizers work with dict params, YF not.