amirgholami / adahessian Goto Github PK

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

License: MIT License

Python 96.24% Shell 0.40% C++ 0.85% Cuda 1.67% Lua 0.27% Cython 0.57%

second-order-optimization hessian hessian-free optimizer adahessian

adahessian's Introduction

Introduction

AdaHessian is a second order based optimizer for the neural network training based on PyTorch. The library supports the training of convolutional neural networks (image_classification) and transformer-based models (transformer). Our TensorFlow implementation is adahessian_tf.

Please see this paper for more details on the AdaHessian algorithm.

For more details please see:

Performance on Rastrigin and Rosenbrock Fucntions:

Below is the convergence of AdaHessian on Rastrigin and Rosenbrock functions, and comparison with SGD and ADAM. Please see pytorch-optimizer repo for comparison with other optimizers.

Loss Function	AdaHessian	SGD	ADAM

Installation -- Git (Recommended)

Please first clone the AdaHessian library to your local system:

git clone https://github.com/amirgholami/adahessian.git

You can import the optimizer as follows:

from optim_adahessian import Adahessian
...
model = YourModel()
optimizer = Adahessian(model.parameters())
...
for input, output in data:
  optimizer.zero_grad()
  loss = loss_function(output, model(input))
  loss.backward(create_graph=True)  # You need this line for Hessian backprop
  optimizer.step()
...

Please note that the optim_adahessian is in the image_classification folder. We also have adapted the Adahessian implementation to be compatible with fairseq repo, which can be used for NLP tasks. This is the link to that version, which can be found in transformer folder.

Installation -- Pip

If you are interested to install the library through pip, then we recommend doing so through pytorch-optimizer package as follows:

$ pip install torch_optimizer

import torch_optimizer as optim

# model = ...
optimizer = optim.Adahessian(
    m.parameters(),
    lr= 1.0,
    betas= (0.9, 0.999)
    eps= 1e-4,
    weight_decay=0.0,
    hessian_power=1.0,
)
      loss_fn(m(input), target).backward(create_graph = True) # create_graph=True is necessary for Hessian calculation
optimizer.step()

For different kernel size (e.g, matrix, Conv1D, Conv2D, etc)

We found out it would be helpful to add instruction about how to adopt AdaHessian for your own models and problems. Hence, we add a prototype version of AdaHessian as well as some useful comments in the instruction folder.

External implementations and discussions

We are thankful to all the researchers who have extended AdaHessian for different purposes or analyzed it. We include the following links in case you are interested to learn more about AdaHessian.

Description	Link	New Features
External Pytorch Library Implementation	Link	--
Reddit Discussion	Link	--
Fast.ai Discussion	Link	--
Best-Deep-Learning-Optimizers Code	Link	--
ada-hessian Code	Link	Support Delayed Hessian Update
JAX Code	link	--
AdaHessian Analysis	Link	Analyze AdaHessian on a 2D example

Citation

AdaHessian has been developed as part of the following paper. We appreciate it if you would please cite the following paper if you found the library useful for your work:

@article{yao2020adahessian,
  title={ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning},
  author={Yao, Zhewei and Gholami, Amir and Shen, Sheng and Keutzer, Kurt and Mahoney, Michael W},
  journal={AAAI (Accepted)},
  year={2021}
}

Copyright

THIS SOFTWARE AND/OR DATA WAS DEPOSITED IN THE BAIR OPEN RESEARCH COMMONS REPOSITORY ON 02/27/23.

adahessian's People

Contributors

Stargazers

Watchers

Forkers

lld533 billysx sailfish009 johnson-yue mneunhoe juntang-zhuang nestordemeure joaompereira srivastavakshitij rioyokotalab yunwjr ricbrag fagan2888 tranhp98 rubenverhack jianwang-scu opooladz shyhuai abhilashmathews budhirajachinmay kimihsieh lcrypto florescl reeshogue nvnhcmus liuyibo-leo pkulwj1994 lddlxx xunaijie vishalbelsare justasb h-atsu louispoulain born-2learn liqiaofeng1990 salehhariyanti jloveric cbiehl zueigung1419 dopawei liwenxuan0825 standardgalactic jy-sakata jianzhu star-dust-ctrl junaidiqbalsyed avesus danield21 noundla-srividya

adahessian's Issues

Reasonable learning rate range for adahessian?

Hi
For training a chatbot, I want to switch to adahessian from adam as the final step in fine-tuning of my model. I have a question about what is a reasonable learning rate to use for adahessian. For adam I used fairly small learning rates - starting at 2e-5 and reducing from there - which worked pretty well. However, as I understand it, adahessian preconditions the parameter update like an inverse Hessian does in a Newton step. But in a Newton step for a quadratic model, the ideal learning rate is 1.0. So I assume that I should be using a much larger learning rate for adahessian than I have been using for adam. Do you have any suggestions based on your experience?
Thanks!

Pre-trained model not available anymore (google drive link expired)

It looks like the pre-trained model link has expired.
Could you upload it again?

Help using adahessian in TensorFlow

Hi, I'm trying to use adahessian in TensorFlow for a simple regression experiment but having trouble.

I have a simple example in this google colab notebook: https://colab.research.google.com/drive/1EbKZ0YHhyu6g8chFlJD74dzWrbo82mbV?usp=sharing

I am getting the following error

ValueError: Variable <tf.Variable 'dense_12/kernel:0' shape=(1, 100) dtype=float32> has `None` for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval.

In the notebook I first write a little training loop that works with standard optimisers such as Adam. See "example training with Adam"

Then in the next section "example training with Adahessian" I basically copy the previous code and make a few modifications to try and get Adahessian to work.

Specifically, I only changed

from

optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

optimizer = AdaHessian(learning_rate=0.01)

and from

grads = tape.gradient(current_loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

grads, Hessian = optimizer.get_gradients_hessian(current_loss, model.trainable_weights)
optimizer.apply_gradients_hessian(zip(grads, Hessian, model.trainable_weights))

Can anyone see what I'm doing wrong? Thanks!

AdaHessian tensorflow implementation

Hi,
First of all, really nice work!
I do wanna try your 2nd order optimizer now.
But I only know tensorflow and all my existing models are implimentad with tensorflow.
Could you provide a tensorflow version?

It seems you only need to implement the method below:
python def get_trace(grad, var)

About how to group my params

it seemed we use the average Hessian for all of params, I want to know how to group my params like take one output channel as one block and compute the average as their diagonal estimation values.

Images

Hey,

How do you get all these plots mentioned in the github?

Can you please point out the resource?

AdaHessian in tensorflow 1 version

Thank you for your excellent work. Is it possible to change the tensorflow version of AdaHessian from tensorflow 2 to tensorflow 1?

What is the correct code for AllenNLP/NER task?

Hello @amirgholami ,

I´m super excited about this optimizer. Thank you!

I want to use it in a NER task using AllenNLP. But I´m confused because the code differs between the image_classification and transformer examples.

At https://github.com/amirgholami/adahessian/blob/5c176cdcbeacff1d9edfc77062d0bc7594f326a9/image_classification/optim_adahessian.py in function get_trace, we have:

hutchinson_trace = []
        for hv, vi in zip(hvs, v):
            param_size = hv.size()
            if len(param_size) <= 2:  # for 0/1/2D tensor
                tmp_output = torch.abs(hv * vi)
                hutchinson_trace.append(tmp_output) # Hessian diagonal block size is 1 here.
            elif len(param_size) == 4:  # Conv kernel
                tmp_output = torch.abs(torch.sum(torch.abs(
                    hv * vi), dim=[2, 3], keepdim=True)) / vi[0, 1].numel() # Hessian diagonal block size is 9 here: torch.sum() reduces the dim 2/3.
                hutchinson_trace.append(tmp_output)

While in https://github.com/amirgholami/adahessian/blob/bd9f5a6760bf1ba4474e2e8a5fad237a1577d989/transformer/fairseq/optim/adahessian.py we have:

hutchinson_trace = []
        for hv, vi in zip(hvs, v):
            param_size = hv.size()
            if len(param_size) <= 1: # for Bias and LN 
                tmp_output = torch.abs( hv * vi)  + 0.
                hutchinson_trace.append( tmp_output )
            elif len(param_size) == 2: # Matrix
                tmp_output1 = torch.abs((hv * vi + 0.)).view(-1, self.block_length) # faltten to the N times self.block_length
                tmp_output2 = torch.abs(torch.sum(tmp_output1, dim=[1])).view(-1) / float(self.block_length)
                tmp_output3 = tmp_output2.repeat_interleave(self.block_length).view(param_size)
                hutchinson_trace.append(tmp_output3)

Which one should I choose?

In my NLP task I have parameters with sizes varying between 1 and 4. For parameters with size 3 , neither would match it in the loop. Is this correct?

Is Hutch++ applicable to improve AdaHessian?

Hi
I recently came across this paper on an improved accuracy Hutchinson method, but I am not well versed enough in the discipline to know if it can be used with AdaHessian. Do you think it can be used to improve AdaHessian?

https://arxiv.org/pdf/2010.09649.pdf

Benchmark on Object detectors

Hi,

First of all, thanks to Zhewei, Amir and others for the great contribution. I got introduced to AdaHessian (and PyHessian) from your recent talk. I see that you have benchmarked for CV tasks for image classification. Have you tried this out on Object detectors, or other CV tasks as well?

Thanks,
Sam

Optimizer is not respecting "trainable" attribute of variables.

The current version does not respect untrainable variables. It can be fixed by placing a simple if-statement. However, I'm not sure if this is the best place. Therefore I'm not suggesting it as a PR, but report the issue here.

        eagerly_outside_functions = ops.executing_eagerly_outside_functions()
        update_ops = []
        with ops.name_scope(name or self._name, skip_on_eager=True):
            for grad, hess, var in grads_hessian_and_vars:

                # FIX UNTRAINABLE
                if var.trainable:
                    def _assume_mirrored(grad, hess):
                        if isinstance(grad, ds_values.PerReplica):
                            return ds_values.Mirrored(grad.values), ds_values.Mirrored(hess.values)
                        return grad, hess

                    grad, hess = nest.map_structure(_assume_mirrored, grad, hess)
                    # Colocate the update with variables to avoid unnecessary communication
                    # delays. See b/136304694.
                    with distribution.extended.colocate_vars_with(var):
                        with ops.name_scope("update" if eagerly_outside_functions else
                                    "update_" + var.op.name, skip_on_eager=True):
                            update_ops.extend(distribution.extended.update(
                                    var, apply_grad_to_update_var, args=(grad, hess), group=False))

Performance issue about tf.function

Hello! Our static bug checker has found a performance issue in adahessian_tf/run_experiments.py and adahessian_tf/cifar_training_tools.py: cifar_training is repeatedly called in a for loop, but there is a tf.function decorated function step defined and called in cifar_training.

In that case, when cifar_training is called in a loop, the function step will create a new graph every time, and that can trigger tf.function retracing warning.

Here is the tensorflow document to support it.

Briefly, for better efficiency, it's better to use:

@tf.function
def inner():
    pass

def outer():
    inner()

than:

def outer():
    @tf.function
    def inner():
        pass
    inner()

Looking forward to your reply.

Alpha unused

adahessian/adahessian_tf/adahessian.py

Line 341 in fe2c574

alpha = (

I noticed that the alpha variable calculated above is not being used. I suspect this is not intended.

I get this error when I use the AdaHessian. Is it a bug?

torch_optimizer/adahessian.py", line 128, in get_trace
hutchinson_trace.append(tmp_output)
UnboundLocalError: local variable 'tmp_output' referenced before assignment

too many abs ?

While looking at your averaging code (line 89 to 144 of this file) I noticed that you compute abs(sum(abs(hv * vi))).
As far as I understand it, the outer absolute value is not needed as you are already doind a sum of positive terms.

Also, note that if you are using a Rademarcher distribution, you can drop the vi term from torch.sum(torch.abs(hv * vi)) as abs(vi) == 1 (but keeping it in place might make the algorithm easier to read as it keeps the code close to the math).

Use of AdaHessian with batched training data?

Hi
I recently started using the version of AdaHessian from https://github.com/jettify/pytorch-optimizer in the facebookresearch parlai system to see how it works for training chatbots. I am not very experienced in the discipline so please excuse my clumsy use of the terminology here. It seems the approaches for training they use divide the training data into minibatches. In a given training epoch, they cycle through the minibatches where for each minibatch they compute and backpropagate the loss for that minibatch to get the gradient of the loss with respect to model parameters and then do a gradient descent step to update model parameters. I haven’t seen any discussion of using batches with AdaHessian. Does that mean that AdaHessian doesn’t work with this batching approach, and all the training samples should be used in the computation of loss and gradient of the loss?

Also, can you please confirm that the version of AdaHessian in pytorch-optimizer is the most current version of the code?

Thanks!

Replace numpy power by TF pow

adahessian/adahessian_tf/adahessian.py

Line 384 in bacccec

    
           denom = np.power(math_ops.sqrt(v / bias_correct2), self.hessian_power) + coefficients['epsilon']

Do not use numpy functions within a tf.function decorator. Use tensorflow implementation if possible.


        denom = tf.math.pow(math_ops.sqrt(v / bias_correct2), self.hessian_power) + coefficients['epsilon']

Settings on ImageNet

Hello,

I'm a little confused of your experimental settings on ImageNet. Could you please clairify the following questions?

1/ The initial learning rate is set to 0.15. That is to say, weight decay args.wd / args.weight_decay = 1e-4 / 0.15 on ImageNet. Is it right?

2/ Two lr schedules have been studied in this paper, i.e. the step decay schedule and the plateau based schedule but the one that leads to better result is only reported. Regarding to Fig. A.9, the plateau based schedule seems to be better than standard step decay schedule for adahessian on ImageNet. May I know the best Top-1 accuracy obtained with your method using the step decay schedule? Also, could you further share the hyper parameter settings of the plateau based schedule in PyTorch? Do you use all default hyper parameters?

Many thanks!

Use of FP16 in backward with create_graph = True?

Hi
I have a quick question. For your transformer or any other application, have you used FP16 when getting gradients from a backward call? In the model I am working with, for any scale factor on the loss that I’ve tried, backward seems to give reasonable gradients when I don’t set create_graph to True. But when I do set it to true, while some of the gradients are the same as with it set to False, many others show up as nan’s. All seems OK when I use FP32 operations, but I’d like to get FP16’s advantages in GPU memory/speed.
Any suggestions you can provide would be appreciated!

Can this deal with complex numbers?

Hi authors,

I intended to use this method on complex numbers and it turned out with a error message like:

File "optimizer.py", line 433, in <listcomp> * torch.randint_like( RuntimeError: check_random_bounds handles only integral, floating-point and boolean types

I'm wondering if it's possible to improve this for complex numbers? Thanks.

Inconsistence between paper and training scripts on NMT tasks

On page 16 of the newest version, it mentions that,

We set dropout as 0.0 for Transformer base/small model.

However --dropout 0.3 is used in

adahessian/transformer/config/adahessian.sh

Line 22 in d1a3442

--dropout 0.3 --attention-dropout 0.1 --relu-dropout 0.1 \

More importantly, for lr of AdamW, the paper adopts lr from this work, which utilizes

lr =7×10−4/5×10−4 for Transformer-Base/Big respectively

While lr=0.0015 is used in

adahessian/transformer/config/adam.sh

Line 17 in d1a3442

--lr 0.0015 --min-lr 1e-9 \

Would be grateful if the original training parameters are provided for reproducing results.

Scalability Question

Hi there,

thank you for making your code available. You used a jacobian vector product with torch.autograd.grad() to implement Hessian-free product Hz. I'm not sure how the operation using autograd isn't O(n²). It seems like it will compute the full hessian and then multiply it by the z vector. The implementation for the hessian-free product is Hv≈(∇f(x+εv)−∇f(x))/ε which requires you to do a second forward and backward passes which isn't what you have. Is there something I'm missing? Is there any computational time comparison between your algorithm and first-order optimizers?

Moreover, you mentioned that backpropagate g^Tz, but I don't see that in the code. I think autograd doesn't support backpropagating more than grads at the moment and there is no hessian propagation in PyTorch.

Thank you.

Alternative to Rademacher distribution

Hello,

First, congratulation for developing AdaHessian, it is a great idea!

Second, have you experimented with alternatives to the Rademacher distribution?
A uniform or gaussian distribution should also work and, depending on the characteristics of the Hessian, might be a better default.

Have a good day,
Nestor

Language Modelling code

Apologies if I have this wrong, but is there code for the language modelling experiments? I think that /transformer only contains the NMT experiments. Thanks.

Object Detection

Can the optimizer be used for Object Detection? I tested it myself and it seems that there will be an error

Wired behaviors of AdaHessian on ResNext-50

Hi,

Thanks for this great work. Recently, we tried to train ResNext-50 on ImageNet classification using AdaHessian. The implementation we used is from https://github.com/davda54/ada-hessian.

However, I got some wired observations. Please see the training log:

Epoch: 1/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 6.1249, top1: 2.74%, top5: 8.40%, time: 9660.5s

Avg  loss: 4.7754, top1: 10.54%, top5: 27.53%

Best loss: 4.7754, top1: 10.54%, top5: 27.53%, epoch: 1

Epoch: 2/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 4.2148, top1: 18.27%, top5: 38.85%, time: 9638.9s

Avg  loss: 3.4256, top1: 27.41%, top5: 53.10%

Best loss: 3.4256, top1: 27.41%, top5: 53.10%, epoch: 2

Epoch: 3/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 3.3622, top1: 30.28%, top5: 55.08%, time: 9635.2s

Avg  loss: 2.7773, top1: 38.40%, top5: 65.36%

Best loss: 2.7773, top1: 38.40%, top5: 65.36%, epoch: 3

Epoch: 4/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.9959, top1: 36.21%, top5: 61.72%, time: 9636.2s

Avg  loss: 2.6380, top1: 40.47%, top5: 67.98%

Best loss: 2.6380, top1: 40.47%, top5: 67.98%, epoch: 4

Epoch: 5/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.8171, top1: 39.26%, top5: 64.87%, time: 9630.8s

Avg  loss: 2.5880, top1: 41.73%, top5: 68.91%

Best loss: 2.5880, top1: 41.73%, top5: 68.91%, epoch: 5

Epoch: 6/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.7149, top1: 41.07%, top5: 66.66%, time: 9640.7s

Avg  loss: 2.3805, top1: 45.68%, top5: 72.20%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 7/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.6456, top1: 42.30%, top5: 67.90%, time: 9639.8s

Avg  loss: 5.2944, top1: 13.36%, top5: 30.77%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 8/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5855, top1: 43.46%, top5: 68.86%, time: 9637.7s

Avg  loss: 14.9700, top1: 0.14%, top5: 0.49%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 9/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5401, top1: 44.36%, top5: 69.65%, time: 9642.6s

Avg  loss: 8.2867, top1: 0.10%, top5: 0.50%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

Epoch: 10/120 (adahessian, lr=0.150000, betas=(0.9, 0.999), eps=1.0e-02, lr decay=milestone [40, 80], decay rate=0.100, warmup=100, init_lr=1.0e-03, wd=1.0e-03 (decoupled))

Average loss: 2.5080, top1: 45.03%, top5: 70.24%, time: 9633.9s

Avg  loss: 11.4105, top1: 0.10%, top5: 0.50%

Best loss: 2.3805, top1: 45.68%, top5: 72.20%, epoch: 6

We see that at the first 6 epochs, AdaHessian worked well. But from the 7th epoch, the training loss still decreased normally. But the test lost increased and the test accuracy declined, rapidly. We have tried several hyper-parameters and different random seeds, but this always happens.

We provided the details of our setting below for your reference.
The implementation of ResNext-50 is the standard one in PyTorch. The training is performed across 8 V100 GPUs, with total batch size 256 (32 per GPU).
We have tried to search the hyper-parameters: lr in {0.1, 0.15}, eps in {1e-2, 1e-4}, weight decay in {1e-4, 2e-4, 4e-4, 8e-4, 1e-3}. For other hyper-parameters, we used the default values.
We also applied linear warmup of the learning rate at the first 100 steps, otherwise AdaHessian crashed at the beginning of model training.

Compatibility with other PyTorch optimizers

Hi Amir,
AdaHessian sounds really promising! Is this talk still happening?

Anyways, I noticed the signature of the step method in AdaHessian is different from other optimizers, because it requires the list of parameters and gradients as an argument. I wonder if you could not do it directly using the .grad property of the parameters. I think in the loss you just need to have loss.backward(retain_graph=True, create_graph=True) instead of only loss.backward(). Then, to make sure the user actually did this when backpropagating the loss gradient, you could check if each .grad property had a .grad_fn property, and if not issue an error and asking the user to use loss.backward(retain_graph=True, create_graph=True).

Possible to use with PyTorch Lightning?

is it possible to use this library with PyTorch Lightning? if so, could you please provide an example?

using PyTorch Lightning in 'manual mode', using
self.manual_backward(loss, create_graph=True)

was the closest I got, but it still wouldnt work. It ran somewhat but crashed after a few batches saying

RuntimeError: Gradient tensor 2 does not have grad_fn. When calling loss.backward(), make sure the option create_graph is set to True.

(even though I did set this)

Error using adahessian in PyTorch

Hi,

I've tried using adahessian as a drop-in replacement for adadelta in the PyTorch mnist example (with loss.backward(create_graph=True)), but this produces the error:

NameError: name 'gradsH' is not defined

This variable looks to be underfined in instruction/adahessian.py, is there something I'm missing?