Coder Social home page Coder Social logo

Comments (4)

nzmora avatar nzmora commented on August 26, 2024 1

I'm happy you're now getting good results.
I want to explain my choice for LR. So like most other hyper-parameters, the LR is born of some kind of trial & error process. I started out with the LR published in the paper, but since that didn't provide the expected Top1 results, I searched for near-by values. Of course, there are other parameters that play into this (e.g. the LR decay policy) and the fact that I'm constraining myself to using a specific random seed (seed=0 because of the '--deterministic' switch) doesn't help much.
So that's how I arrived at LR=0.4 (and now 0.3 -- I will commit the change today). It's amazing (and frustrating) how much of the training process is just alchemy.

from distiller.

Eric-mingjie avatar Eric-mingjie commented on August 26, 2024

With batch size 256, the accuracy is 91.87%. With batch size 128, the accuracy is 92.4%. Still not on par with claimed 93% :-)

from distiller.

nzmora avatar nzmora commented on August 26, 2024

Hi,
 
In the Distiller model-zoo you can find a link to the baseline model that we trained.
The Top1 that this models reports when evaluated using PyTorch 0.4 is 92.87 (not 92.97% as we reported, but I assume this is because we tested on an earlier version of PyTorch, or maybe I had a typo).

$ time python3 compress_classifier.py --arch resnet56_cifar  ../../../data.cifar10 --resume=checkpoint.resnet56_cifar_baseline.pth.tar.1 -e
10000 samples (256 per mini-batch)
Test: [   10/   39]    Loss 0.355382    Top1 92.773438    Top5 99.531250
Test: [   20/   39]    Loss 0.363244    Top1 92.578125    Top5 99.648438
Test: [   30/   39]    Loss 0.351659    Top1 92.929688    Top5 99.713542
Test: [   40/   39]    Loss 0.360662    Top1 92.870000    Top5 99.740000
==> Top1: 92.870    Top5: 99.740    Loss: 0.361

Here's the checkpoint file, in case you won't find it.

But because you have issues reproducing this result, I went and re-ran it. I used the following command-line:

time python3 compress_classifier.py --arch resnet56_cifar  ../../../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml -j=1 --deterministic

And I did get lower performance than we claimed (92.54 instead of 92.97):

2018-07-02 15:09:38,946 - --- validate (epoch=179)-----------
2018-07-02 15:09:38,946 - 5000 samples (256 per mini-batch)
2018-07-02 15:09:40,623 - ==> Top1: 92.420    Top5: 99.780    Loss: 0.353

2018-07-02 15:09:40,624 - Saving checkpoint to: logs/2018.07.02-140142/checkpoint.pth.tar
2018-07-02 15:09:40,669 - --- test ---------------------
2018-07-02 15:09:40,669 - 10000 samples (256 per mini-batch)
2018-07-02 15:09:43,797 - ==> Top1: 92.540    Top5: 99.770    Loss: 0.382

So, I changed the starting LR to 0.3 and ran the training once again, and now I got the results I expected: 92.85.
Here's the command line:
time python3 compress_classifier.py --arch resnet56_cifar ../../../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml -j=1 --deterministic

2018-07-02 16:36:31,555 - --- validate (epoch=179)-----------
2018-07-02 16:36:31,555 - 5000 samples (256 per mini-batch)
2018-07-02 16:36:33,121 - ==> Top1: 91.520    Top5: 99.680    Loss: 0.387

2018-07-02 16:36:33,123 - Saving checkpoint to: logs/2018.07.02-152746/checkpoint.pth.tar
2018-07-02 16:36:33,159 - --- test ---------------------
2018-07-02 16:36:33,159 - 10000 samples (256 per mini-batch)
2018-07-02 16:36:36,194 - ==> Top1: 92.850    Top5: 99.780    Loss: 0.364

Here's a graph showing the training behavior of validation/Top1:
image

So why can't you reproduce these results? I don't know for sure. I suppose that a difference in the CuDNN versions or something, could cause this. It is unfortunate and interesting. If you want to dig deeper into this issue, I'd be happy to upload all of the artifacts of my two runs (logs, tfevents file, checkpoint) to AWS and you can then download and study the differences from your system.
I'm pasting my system's configuration, from the logs - maybe you can see a meaningful difference:

2018-07-02 15:27:46,877 - Number of CPUs: 88
2018-07-02 15:27:50,540 - Number of GPUs: 4
2018-07-02 15:27:50,540 - CUDA version: 8.0.61
2018-07-02 15:27:50,541 - CUDNN version: 7102
2018-07-02 15:27:50,541 - Kernel: 4.13.0-36-generic
2018-07-02 15:27:50,541 - Python: 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
2018-07-02 15:27:50,542 - PyTorch: 0.4.0
2018-07-02 15:27:50,542 - Numpy: 1.14.3
2018-07-02 15:27:50,577 - Active Git branch: master
2018-07-02 15:27:50,588 - Git commit: 19d33c50122fd7d082ca18dca544fcd09e57733d
2018-07-02 15:27:50,588 - App args: ['compress_classifier.py', '--arch', 'resnet56_cifar', '../../../data.cifar10', '-p=50', '--lr=0.3', '--epochs=180', '--compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml', '-j=1', '--deterministic']
2018-07-02 15:27:50,590 - ==> using cifar10 dataset
2018-07-02 15:27:50,590 - => creating resnet56_cifar model for CIFAR10
2018-07-02 15:27:54,745 - Optimizer Type: <class 'torch.optim.sgd.SGD'>
2018-07-02 15:27:54,745 - Optimizer Args: {'lr': 0.3, 'weight_decay': 0.0001, 'momentum': 0.9, 'nesterov': False, 'dampening': 0}
2018-07-02 15:27:56,423 - Dataset sizes:
        training=45000
        validation=5000
        test=10000
2018-07-02 15:27:56,423 - Reading compression schedule from: ../pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml
2018-07-02 15:27:56,441 - Schedule contents:
{
  "lr_schedulers": {
    "training_lr": {
      "class": "StepLR",
      "step_size": 45,
      "gamma": 0.1
    }
  },
  "policies": [
    {
      "lr_scheduler": {
        "instance_name": "training_lr"
      },
      "starting_epoch": 35,
      "ending_epoch": 200,
      "frequency": 1
    }
  ]
}

from distiller.

Eric-mingjie avatar Eric-mingjie commented on August 26, 2024

Thanks for the reply.
Yeah, the result when the initial learning rate is 0.3 seems to be good. But i still couldn't understand why you choose this learning rate schedule? I think the original paper uses 0.1 as the start learning rate.

from distiller.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.