Coder Social home page Coder Social logo

Comments (9)

githubharald avatar githubharald commented on May 26, 2024

Hi,

we tried AdaHessian for training an image segmentation model.
However, even after trying a lot of possible hyperparameter combinations, we were not able to outperform Adam with AdaHessian.
Compared with Adam, AdaHessian 1.) took longer to converge, 2.) the validation metric (intersection over union) was 2-3% worse.
Interestingly, the same behavior was observed when using another Newton-like method (Apollo).

See the plot for some more details (x: epochs, y: interextion over union).

cmp_adahessian

Would be interesting to hear some more experiences for computer vision tasks.
And maybe someone also figured out how to outperform Adam on a segmentation task?

Best
Harald

from adahessian.

amirgholami avatar amirgholami commented on May 26, 2024

Hi Harold,

Thanks for sharing the results. One thing that is a bit surprising is that ADAM gets to 0.84 in 1 epoch. Is there any learning rate schedule used? Do you happen to have a repo for the above experiments so we can have a closer look?

Best,
-Amir

from adahessian.

githubharald avatar githubharald commented on May 26, 2024

Hi Amir,

as it is a commercial project, I can't provide any code/model details/data, sorry.

Some more words about the plot: Adam starts with ~12% in the first epoch, then goes to 78%, and after the 3rd epoch it reaches >80%. The "global" learning rate is constant for the first epochs (red dots indicate global learning rate reduction). The feature extractor is pre-trained, so this might be a reason for the fast reaching of ~80%, another one would be that segmenting the objects on a coarse level is simple (e.g. just classify pixels near image center as foreground object). The difficult part starts when the model reaches ~80%.

Did you ever encounter such an effect as shown in the plot with AdaHessian when training some other model?

Sometimes there is a "trick" you must be aware of to get things going - e.g. for the Apollo optimizer I was told how important a warmup phase is.
Without this warmup, I got 82% final IOU and getting to 80% took 200 epochs, but with warmup I get to 89% final IOU and the 80% are reached in only 5 epochs.
So, maybe there is also a trick to get AdaHessian going and I'm simply not aware of it.

from adahessian.

amirgholami avatar amirgholami commented on May 26, 2024

Hi Herald,

We have never observed something like this. Without seeing an example myself it is hard to pinpoint the problem but the following is what I would have checked:

1- Based on your comment about Apollo, it seems that for this problem the first few epochs are critical. What happens if you use the same learning rate warmup schedule for AdaHessian? Having warmup may be helpful since in the beginning iterations the estimate of the Hessian is still updating
2- I would check the adaptive learning rate for each of the layers to make sure that none of them has abnormal variance compared to other layers.
3- If any of the above layers shows abnormal behavior you may want to try using ADAM for those layers and AdaHessian for others
4- What happens if you use AdaHessian on the trained model? Does the accuracy still decrease?

Please let us know how it goes.

Best,
-Amir

from adahessian.

githubharald avatar githubharald commented on May 26, 2024

Hi Amir,

two observations:

  1. Setting the eps value to values of 0.1 helps to avoid divergence (which sometimes happened when using AdaHessian). It seems to avoid too big steps, and pushes the update direction a bit more into the steepest descent direction
  2. Using a warmup phase, starting with low learning rates of 1e-9, and increasing to rather high values of 0.1, gives much faster convergence, reaching 80% afters 6 epochs. So the same "trick" as for Apollo worked. Seems like this warmup-time is needed to get an approximation of the Hessian which is somehow reasonable. Can't say anything about the final accuracy yet.

I should add that we're using the implementation from davda54, however, compared on some rather simple functions it behaves the same as your (original) implementation.

Best,
Harald

from adahessian.

githubharald avatar githubharald commented on May 26, 2024

Also the accuracy is now almost the same as with Adam (in fact, AdaHessian is even slightly better, but this might be caused by the stochastic nature of the training process and multiple training runs would be needed to say more about this).
So, to conclude: AdaHessian is able to give the same convergence rate and accuracy as Adam. For our task it was important to have the parameters set as described above.

from adahessian.

amirgholami avatar amirgholami commented on May 26, 2024

Many thanks Herald for the update. That is great to know.

from adahessian.

aman2304 avatar aman2304 commented on May 26, 2024

@githubharald would it possible to post the Apollo IOU plot with warmup?

from adahessian.

githubharald avatar githubharald commented on May 26, 2024

no longer have the data, but looked somehow similar to the Adam plot when using warmup.

from adahessian.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.