Hi, First of all, thanks to Zhewei, Amir and others for the great co

Hi Amir, two observations: Setting the eps v

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Benchmark on Object detectors about adahessian HOT 9 CLOSED

amirgholami commented on May 26, 2024

Benchmark on Object detectors

from adahessian.

Comments (9)

githubharald commented on May 26, 2024

Hi,

we tried AdaHessian for training an image segmentation model.
However, even after trying a lot of possible hyperparameter combinations, we were not able to outperform Adam with AdaHessian.
Compared with Adam, AdaHessian 1.) took longer to converge, 2.) the validation metric (intersection over union) was 2-3% worse.
Interestingly, the same behavior was observed when using another Newton-like method (Apollo).

See the plot for some more details (x: epochs, y: interextion over union).

Would be interesting to hear some more experiences for computer vision tasks.
And maybe someone also figured out how to outperform Adam on a segmentation task?

Best
Harald

from adahessian.

amirgholami commented on May 26, 2024

Hi Harold,

Thanks for sharing the results. One thing that is a bit surprising is that ADAM gets to 0.84 in 1 epoch. Is there any learning rate schedule used? Do you happen to have a repo for the above experiments so we can have a closer look?

Best,
-Amir

from adahessian.

githubharald commented on May 26, 2024

Hi Amir,

as it is a commercial project, I can't provide any code/model details/data, sorry.

Some more words about the plot: Adam starts with ~12% in the first epoch, then goes to 78%, and after the 3rd epoch it reaches >80%. The "global" learning rate is constant for the first epochs (red dots indicate global learning rate reduction). The feature extractor is pre-trained, so this might be a reason for the fast reaching of ~80%, another one would be that segmenting the objects on a coarse level is simple (e.g. just classify pixels near image center as foreground object). The difficult part starts when the model reaches ~80%.

Did you ever encounter such an effect as shown in the plot with AdaHessian when training some other model?

Sometimes there is a "trick" you must be aware of to get things going - e.g. for the Apollo optimizer I was told how important a warmup phase is.
Without this warmup, I got 82% final IOU and getting to 80% took 200 epochs, but with warmup I get to 89% final IOU and the 80% are reached in only 5 epochs.
So, maybe there is also a trick to get AdaHessian going and I'm simply not aware of it.

from adahessian.

amirgholami commented on May 26, 2024

Hi Herald,

We have never observed something like this. Without seeing an example myself it is hard to pinpoint the problem but the following is what I would have checked:

1- Based on your comment about Apollo, it seems that for this problem the first few epochs are critical. What happens if you use the same learning rate warmup schedule for AdaHessian? Having warmup may be helpful since in the beginning iterations the estimate of the Hessian is still updating
2- I would check the adaptive learning rate for each of the layers to make sure that none of them has abnormal variance compared to other layers.
3- If any of the above layers shows abnormal behavior you may want to try using ADAM for those layers and AdaHessian for others
4- What happens if you use AdaHessian on the trained model? Does the accuracy still decrease?

Please let us know how it goes.

Best,
-Amir

from adahessian.

githubharald commented on May 26, 2024

Hi Amir,

two observations:

Setting the eps value to values of 0.1 helps to avoid divergence (which sometimes happened when using AdaHessian). It seems to avoid too big steps, and pushes the update direction a bit more into the steepest descent direction
Using a warmup phase, starting with low learning rates of 1e-9, and increasing to rather high values of 0.1, gives much faster convergence, reaching 80% afters 6 epochs. So the same "trick" as for Apollo worked. Seems like this warmup-time is needed to get an approximation of the Hessian which is somehow reasonable. Can't say anything about the final accuracy yet.

I should add that we're using the implementation from davda54, however, compared on some rather simple functions it behaves the same as your (original) implementation.

Best,
Harald

from adahessian.

githubharald commented on May 26, 2024

Also the accuracy is now almost the same as with Adam (in fact, AdaHessian is even slightly better, but this might be caused by the stochastic nature of the training process and multiple training runs would be needed to say more about this).
So, to conclude: AdaHessian is able to give the same convergence rate and accuracy as Adam. For our task it was important to have the parameters set as described above.

from adahessian.

amirgholami commented on May 26, 2024

Many thanks Herald for the update. That is great to know.

from adahessian.

aman2304 commented on May 26, 2024

@githubharald would it possible to post the Apollo IOU plot with warmup?

from adahessian.

githubharald commented on May 26, 2024

no longer have the data, but looked somehow similar to the Adam plot when using warmup.

from adahessian.

Benchmark on Object detectors about adahessian HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent