I run the following command to run a baseline model for resnet56 on cifar10: <bloc

Hi, In the Distiller <a href="https://nervanasystems.github.io/distiller/mod

Unable to reproduce 6.96% test error for resnet-56 on cifar10 about distiller HOT 4 CLOSED

intellabs commented on August 26, 2024

Unable to reproduce 6.96% test error for resnet-56 on cifar10

from distiller.

Comments (4)

nzmora commented on August 26, 2024 1

I'm happy you're now getting good results.
I want to explain my choice for LR. So like most other hyper-parameters, the LR is born of some kind of trial & error process. I started out with the LR published in the paper, but since that didn't provide the expected Top1 results, I searched for near-by values. Of course, there are other parameters that play into this (e.g. the LR decay policy) and the fact that I'm constraining myself to using a specific random seed (seed=0 because of the '--deterministic' switch) doesn't help much.
So that's how I arrived at LR=0.4 (and now 0.3 -- I will commit the change today). It's amazing (and frustrating) how much of the training process is just alchemy.

from distiller.

Eric-mingjie commented on August 26, 2024

With batch size 256, the accuracy is 91.87%. With batch size 128, the accuracy is 92.4%. Still not on par with claimed 93% :-)

from distiller.

nzmora commented on August 26, 2024

Hi,

In the Distiller model-zoo you can find a link to the baseline model that we trained.
The Top1 that this models reports when evaluated using PyTorch 0.4 is 92.87 (not 92.97% as we reported, but I assume this is because we tested on an earlier version of PyTorch, or maybe I had a typo).

$ time python3 compress_classifier.py --arch resnet56_cifar  ../../../data.cifar10 --resume=checkpoint.resnet56_cifar_baseline.pth.tar.1 -e
10000 samples (256 per mini-batch)
Test: [   10/   39]    Loss 0.355382    Top1 92.773438    Top5 99.531250
Test: [   20/   39]    Loss 0.363244    Top1 92.578125    Top5 99.648438
Test: [   30/   39]    Loss 0.351659    Top1 92.929688    Top5 99.713542
Test: [   40/   39]    Loss 0.360662    Top1 92.870000    Top5 99.740000
==> Top1: 92.870    Top5: 99.740    Loss: 0.361

Here's the checkpoint file, in case you won't find it.

But because you have issues reproducing this result, I went and re-ran it. I used the following command-line:

time python3 compress_classifier.py --arch resnet56_cifar  ../../../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml -j=1 --deterministic

And I did get lower performance than we claimed (92.54 instead of 92.97):

2018-07-02 15:09:38,946 - --- validate (epoch=179)-----------
2018-07-02 15:09:38,946 - 5000 samples (256 per mini-batch)
2018-07-02 15:09:40,623 - ==> Top1: 92.420    Top5: 99.780    Loss: 0.353

2018-07-02 15:09:40,624 - Saving checkpoint to: logs/2018.07.02-140142/checkpoint.pth.tar
2018-07-02 15:09:40,669 - --- test ---------------------
2018-07-02 15:09:40,669 - 10000 samples (256 per mini-batch)
2018-07-02 15:09:43,797 - ==> Top1: 92.540    Top5: 99.770    Loss: 0.382

So, I changed the starting LR to 0.3 and ran the training once again, and now I got the results I expected: 92.85.
Here's the command line:
time python3 compress_classifier.py --arch resnet56_cifar ../../../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml -j=1 --deterministic

2018-07-02 16:36:31,555 - --- validate (epoch=179)-----------
2018-07-02 16:36:31,555 - 5000 samples (256 per mini-batch)
2018-07-02 16:36:33,121 - ==> Top1: 91.520    Top5: 99.680    Loss: 0.387

2018-07-02 16:36:33,123 - Saving checkpoint to: logs/2018.07.02-152746/checkpoint.pth.tar
2018-07-02 16:36:33,159 - --- test ---------------------
2018-07-02 16:36:33,159 - 10000 samples (256 per mini-batch)
2018-07-02 16:36:36,194 - ==> Top1: 92.850    Top5: 99.780    Loss: 0.364

Here's a graph showing the training behavior of validation/Top1:

So why can't you reproduce these results? I don't know for sure. I suppose that a difference in the CuDNN versions or something, could cause this. It is unfortunate and interesting. If you want to dig deeper into this issue, I'd be happy to upload all of the artifacts of my two runs (logs, tfevents file, checkpoint) to AWS and you can then download and study the differences from your system.
I'm pasting my system's configuration, from the logs - maybe you can see a meaningful difference:

2018-07-02 15:27:46,877 - Number of CPUs: 88
2018-07-02 15:27:50,540 - Number of GPUs: 4
2018-07-02 15:27:50,540 - CUDA version: 8.0.61
2018-07-02 15:27:50,541 - CUDNN version: 7102
2018-07-02 15:27:50,541 - Kernel: 4.13.0-36-generic
2018-07-02 15:27:50,541 - Python: 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
2018-07-02 15:27:50,542 - PyTorch: 0.4.0
2018-07-02 15:27:50,542 - Numpy: 1.14.3
2018-07-02 15:27:50,577 - Active Git branch: master
2018-07-02 15:27:50,588 - Git commit: 19d33c50122fd7d082ca18dca544fcd09e57733d
2018-07-02 15:27:50,588 - App args: ['compress_classifier.py', '--arch', 'resnet56_cifar', '../../../data.cifar10', '-p=50', '--lr=0.3', '--epochs=180', '--compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml', '-j=1', '--deterministic']
2018-07-02 15:27:50,590 - ==> using cifar10 dataset
2018-07-02 15:27:50,590 - => creating resnet56_cifar model for CIFAR10
2018-07-02 15:27:54,745 - Optimizer Type: <class 'torch.optim.sgd.SGD'>
2018-07-02 15:27:54,745 - Optimizer Args: {'lr': 0.3, 'weight_decay': 0.0001, 'momentum': 0.9, 'nesterov': False, 'dampening': 0}
2018-07-02 15:27:56,423 - Dataset sizes:
        training=45000
        validation=5000
        test=10000
2018-07-02 15:27:56,423 - Reading compression schedule from: ../pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml
2018-07-02 15:27:56,441 - Schedule contents:
{
  "lr_schedulers": {
    "training_lr": {
      "class": "StepLR",
      "step_size": 45,
      "gamma": 0.1
    }
  },
  "policies": [
    {
      "lr_scheduler": {
        "instance_name": "training_lr"
      },
      "starting_epoch": 35,
      "ending_epoch": 200,
      "frequency": 1
    }
  ]
}

from distiller.

Eric-mingjie commented on August 26, 2024

Thanks for the reply.
Yeah, the result when the initial learning rate is 0.3 seems to be good. But i still couldn't understand why you choose this learning rate schedule? I think the original paper uses 0.1 as the start learning rate.

from distiller.

Unable to reproduce 6.96% test error for resnet-56 on cifar10 about distiller HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent