Coder Social home page Coder Social logo

Comments (11)

XuezheMax avatar XuezheMax commented on June 9, 2024 5

I run experiments with Adam and RAdam on ResNet-18. I decoupled the weight decay for both of them, so they are actually AdamW and RAdamW. The lr schedule is the same as AdaBelief: decaying at 70 and 80 epochs by 0.1, with total 90 epochs for training. The implementation is from this repo.
Here are the updates (3 runs for each experiment):

method wd=1e-2 wd=1e-4
AdamW 69.73 67.57
RAdamW 69.80 67.68

I think these results suggest that the baselines for ImageNet need to be updated by using the same weight decay 1e-2.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 9, 2024

Seems effect of weight decay dominates the effect of optimizers in this case. What learning rate schedule did you use? Does that influence results?

from adabelief-optimizer.

XuezheMax avatar XuezheMax commented on June 9, 2024

I used the same lr scheduler: decay at epoch 70 and 80 by 0.1.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 9, 2024

Thanks for your feedback. Just curious, what hardware did you use? I'm quite surprised that you can finish 3 runs within 12 hours (since your earliest post on weight decay here). Typically one round of ImageNet training takes me 3 to 4 days with 4 GPUs.

from adabelief-optimizer.

XuezheMax avatar XuezheMax commented on June 9, 2024

I run with 8 v100 (from AWS) and it took around 10 hours to complete the training with 90 epochs.
One comment that might be useful for you is that the CPU memory sometimes is the bottleneck for running ImageNet experiments since the dataset is very large.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 9, 2024

Thanks for the suggestions and experiments, it might be the reason, feel quite stuck when experimenting with my 1080 GPU.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 9, 2024

It surprises me that RAdam does not outperform Adam, since RAdam uses decoupled weight decay, do you have any results about AdamW with larger weight decay? Based on your results, I somehow doubt if decoupled weight decay is actually helpful. BTW, is the result reported in Apollo paper achieved by Apollo or ApolloW?

from adabelief-optimizer.

XuezheMax avatar XuezheMax commented on June 9, 2024

Oh, sorry for the confusion. Here in my results, Adam is actually AdamW. Without decoupling the weight decay, Adam works significantly worse than AdamW.

For the results in Apollo, I did not decouple the weight decay for Apollo. I tried ApolloW, but the performance is similar to Apollo.

from adabelief-optimizer.

juntang-zhuang avatar juntang-zhuang commented on June 9, 2024

Thanks a lot. I think your results suggest that weight decay is not properly set for AdamW family, and the baseline needs to be improved.

By looking at the literature, I found something weird, [1] also uses AdamW, and set weight decay as 5e-2, which is also a large number, and they achieve 67.93. Though the authors claim they performed a grid search, not sure if their grid includes 1e-2 as you used here. I'll take a more careful look later to see if some training details are different from yours.

BTW, regarding Apollo, is it because it's scale-variant, hence the weight decay is similar to a decoupled weight decay, like in SGD? Any idea why Apollo is not influenced by decoupled weight decay so much?

[1] Closing the Generalization Gap of Adaptive Gradient Methods in TrainingDeep Neural Networks

from adabelief-optimizer.

XuezheMax avatar XuezheMax commented on June 9, 2024

I also tried wd=5e-2 for AdamW, and the results are even slightly better than wd=1e-2. So I guess the models were not properly trained in [1].

For Apollo and SGD, I think one possible reason that decoupled weight decay is not that influential is that they were not using second-order momentum. In ICLR 2021, there is a new submission about stable weight decay in Adam, maybe we can get some ideas from it :-)

from adabelief-optimizer.

soloice avatar soloice commented on June 9, 2024

Thanks for your feedback. Just curious, what hardware did you use? I'm quite surprised that you can finish 3 runs within 12 hours (since your earliest post on weight decay here). Typically one round of ImageNet training takes me 3 to 4 days with 4 GPUs.

This could be reasonable. According to this benchmark, V100s are 5x faster than 1080Tis.

from adabelief-optimizer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.