Comments (9)
Hi,
we tried AdaHessian for training an image segmentation model.
However, even after trying a lot of possible hyperparameter combinations, we were not able to outperform Adam with AdaHessian.
Compared with Adam, AdaHessian 1.) took longer to converge, 2.) the validation metric (intersection over union) was 2-3% worse.
Interestingly, the same behavior was observed when using another Newton-like method (Apollo).
See the plot for some more details (x: epochs, y: interextion over union).
Would be interesting to hear some more experiences for computer vision tasks.
And maybe someone also figured out how to outperform Adam on a segmentation task?
Best
Harald
from adahessian.
Hi Harold,
Thanks for sharing the results. One thing that is a bit surprising is that ADAM gets to 0.84 in 1 epoch. Is there any learning rate schedule used? Do you happen to have a repo for the above experiments so we can have a closer look?
Best,
-Amir
from adahessian.
Hi Amir,
as it is a commercial project, I can't provide any code/model details/data, sorry.
Some more words about the plot: Adam starts with ~12% in the first epoch, then goes to 78%, and after the 3rd epoch it reaches >80%. The "global" learning rate is constant for the first epochs (red dots indicate global learning rate reduction). The feature extractor is pre-trained, so this might be a reason for the fast reaching of ~80%, another one would be that segmenting the objects on a coarse level is simple (e.g. just classify pixels near image center as foreground object). The difficult part starts when the model reaches ~80%.
Did you ever encounter such an effect as shown in the plot with AdaHessian when training some other model?
Sometimes there is a "trick" you must be aware of to get things going - e.g. for the Apollo optimizer I was told how important a warmup phase is.
Without this warmup, I got 82% final IOU and getting to 80% took 200 epochs, but with warmup I get to 89% final IOU and the 80% are reached in only 5 epochs.
So, maybe there is also a trick to get AdaHessian going and I'm simply not aware of it.
from adahessian.
Hi Herald,
We have never observed something like this. Without seeing an example myself it is hard to pinpoint the problem but the following is what I would have checked:
1- Based on your comment about Apollo, it seems that for this problem the first few epochs are critical. What happens if you use the same learning rate warmup schedule for AdaHessian? Having warmup may be helpful since in the beginning iterations the estimate of the Hessian is still updating
2- I would check the adaptive learning rate for each of the layers to make sure that none of them has abnormal variance compared to other layers.
3- If any of the above layers shows abnormal behavior you may want to try using ADAM for those layers and AdaHessian for others
4- What happens if you use AdaHessian on the trained model? Does the accuracy still decrease?
Please let us know how it goes.
Best,
-Amir
from adahessian.
Hi Amir,
two observations:
- Setting the eps value to values of 0.1 helps to avoid divergence (which sometimes happened when using AdaHessian). It seems to avoid too big steps, and pushes the update direction a bit more into the steepest descent direction
- Using a warmup phase, starting with low learning rates of 1e-9, and increasing to rather high values of 0.1, gives much faster convergence, reaching 80% afters 6 epochs. So the same "trick" as for Apollo worked. Seems like this warmup-time is needed to get an approximation of the Hessian which is somehow reasonable. Can't say anything about the final accuracy yet.
I should add that we're using the implementation from davda54, however, compared on some rather simple functions it behaves the same as your (original) implementation.
Best,
Harald
from adahessian.
Also the accuracy is now almost the same as with Adam (in fact, AdaHessian is even slightly better, but this might be caused by the stochastic nature of the training process and multiple training runs would be needed to say more about this).
So, to conclude: AdaHessian is able to give the same convergence rate and accuracy as Adam. For our task it was important to have the parameters set as described above.
from adahessian.
Many thanks Herald for the update. That is great to know.
from adahessian.
@githubharald would it possible to post the Apollo IOU plot with warmup?
from adahessian.
no longer have the data, but looked somehow similar to the Adam plot when using warmup.
from adahessian.
Related Issues (20)
- AdaHessian in tensorflow 1 version
- Alpha unused HOT 1
- Optimizer is not respecting "trainable" attribute of variables.
- Replace numpy power by TF pow HOT 1
- Help using adahessian in TensorFlow HOT 3
- Error using adahessian in PyTorch HOT 3
- About how to group my params
- Use of AdaHessian with batched training data? HOT 2
- Reasonable learning rate range for adahessian?
- Use of FP16 in backward with create_graph = True?
- Is Hutch++ applicable to improve AdaHessian? HOT 1
- Scalability Question HOT 1
- Inconsistence between paper and training scripts on NMT tasks
- Images
- Object Detection HOT 1
- Possible to use with PyTorch Lightning? HOT 1
- Pre-trained model not available anymore (google drive link expired)
- Can this deal with complex numbers?
- Performance issue about tf.function HOT 1
- I get this error when I use the AdaHessian. Is it a bug?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from adahessian.