eric-mingjie / rethinking-network-pruning Goto Github PK

View Code? Open in Web Editor NEW

1.5K 33.0 294.0 168 KB

Rethinking the Value of Network Pruning (Pytorch) (ICLR 2019)

License: MIT License

Python 100.00%

convolutional-neural-networks network-pruning deep-learning pytorch

rethinking-network-pruning's Introduction

Rethinking the Value of Network Pruning

This repository contains the code for reproducing the results, and trained ImageNet models, in the following paper:

Rethinking the Value of Network Pruning. [arXiv] [OpenReview]

Zhuang Liu*, Mingjie Sun*, Tinghui Zhou, Gao Huang, Trevor Darrell (* equal contribution).

ICLR 2019. Also Best Paper Award at NIPS 2018 Workshop on Compact Deep Neural Networks.

Several pruning methods' implementations contained in this repo can also be readily used for other research purposes.

Paper Summary

Fig 1: A typical three-stage network pruning pipeline.

Our paper shows that for structured pruning, training the pruned model from scratch can almost always achieve comparable or higher level of accuracy than the model obtained from the typical "training, pruning and fine-tuning" (Fig. 1) procedure. We conclude that for those pruning methods:

Training a large, over-parameterized model is often not necessary to obtain an efficient final model.
Learned “important” weights of the large model are typically not useful for the small pruned model.
The pruned architecture itself, rather than a set of inherited “important” weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm.

Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods.

Fig 2: Difference between predefined and automatically discovered target architectures, in channel pruning. The pruning ratio x is user-specified, while a, b, c, d are determined by the pruning algorithm. Unstructured sparse pruning can also be viewed as automatic. Our finding has different implications for predefined and automatic methods: for a predefined method, it is possible to skip the traditional "training, pruning and fine-tuning" pipeline and directly train the pruned model; for automatic methods, the pruning can be seen as a form of architecture learning.

We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization. For more details please refer to our paper.

Implementation

We evaluated the following seven pruning methods.

The first six is structured while the last one is unstructured (or sparse). For CIFAR, our code is based on pytorch-classification and network-slimming. For ImageNet, we use the official Pytorch ImageNet training code. The instructions and models are in each subfolder.

For experiments on The Lottery Ticket Hypothesis, please refer to the folder cifar/lottery-ticket.

Our experiment environment is Python 3.6 & PyTorch 0.3.1.

Contact

Feel free to discuss papers/code with us through issues/emails!

sunmj15 at gmail.com
liuzhuangthu at gmail.com

Citation

If you use our code in your research, please cite:

@inproceedings{liu2018rethinking,
  title={Rethinking the Value of Network Pruning},
  author={Liu, Zhuang and Sun, Mingjie and Zhou, Tinghui and Huang, Gao and Darrell, Trevor},
  booktitle={ICLR},
  year={2019}
}

rethinking-network-pruning's People

Contributors

Stargazers

Watchers

Forkers

keyky dreadlord1984 liuzhuang13 dzuwhf jdc08161063 shubham-dhanuka wh-forker cf2220160244 lith0613 nd1511 hyzcn vzhangmeng726 zhangleiedu xzm2004260 xjtuwyd slayton58 lxwithgod wjyou wiwi awful1994 muyoucun shentanyue sdung60239 dukebw trantorrepository ganji15 starstylesky liujingcs jixianghu jiangwx smartcai pengzhenghao roy-algoritm deisler134 hahnyuan pascalinn xuzengmin abrams90 trinhquocnguyen shiruipeng1985 facexteam hulalazz liyuanyaun holylow leonardyao hintonthu stivensss shaliu2018 pgadosey mjanddy gzqhappy ycwu133 xiangpeng95 emily0219 youngbaby123 guanh01 janhenriklambrechts hsiehbrandon omarawad2 zhiqijiang xwushirley zzmcdc alexfrontxq akailcy smallzzy bennnun coolde mazzzystar kanbo0409 xueyuuu stjordanis jamyong 666dzy666 desinurch kamata1729 jeffjunzhang anorthman gu1h quelleg lzr9926 betterhalfwzm namisan ms-krajesh zhyj3038 psu1 jaak03 jawan19847387 chendan003 liu-ca we0091234 giegloop pikerbright yyx1107 csjunxu rockman-star fengxingxiang caihengyu520 christian-lyc jarygrace ytjiang8255

rethinking-network-pruning's Issues

ImageNet ResNet FLOPs

I tried to compute the flops of resnet torchvision models (imagenet/network-slimming/compute_flops.py). However, the result is 24.64G Flops, which conflict with https://github.com/albanie/convnet-burden#image-classification-architectures (4G FLOPs).

Code to reproduce:

import torchvision

from compute_flops import count_model_param_flops
from models.resnet import resnet50
from vgg import slimmingvgg as vgg11


def main():
    model_torchvision = torchvision.models.resnet50()

    flops_torchvision = count_model_param_flops(model_torchvision, 224)

    print(flops_torchvision)
    pass


if __name__ == '__main__':
    main()

Output:

+ Number of FLOPs: 24.64G
tensor(2.4636e+10)

Pruning shemes for regression-based resnet_2x

Hi, thanks for sharing the code.
May I know how do you get the following pruning scheme for resent_2x in regression-pruning:

cfg_2x = [35, 64, 55, 101, 51, 39, 97, 50, 37, 144, 128, 106, 205, 105, 72, 198, 105, 72, 288, 128, 110, 278, 256, 225, 418, 209, 147, 407, 204, 158, 423, 212, 155, 412, 211, 148, 595, 256, 213, 606, 512, 433, 1222, 512, 437, 1147, 512, 440]

Is it from the original code, or you implement yourself?

BTW, it seems that according to compute_flops.py, the reduction of flops is not 2 folds, though the name is resnet_2x.py

Question about the validation set

First of all, I would like to say thanks to authors for sharing their excellent research work. However, part of the code is kind of confusing me. The main concern is about the training/validation/test splits. I'm wondering that is this a common approach to use the test set to find the best model (e.g. CIFAR10 experiments)?
In other words, is it necessary to explicitly use a validation set to choose the best model?
Also, is this something conventional and widely adopted in other studies as shown in the reimplemented code?

Size mismatch while trying to implement MLPprune.py on cifar10 in Network Slimming

While trying to implement MLPprune.py similar to vggprune.py, I get the following error:

Traceback (most recent call last):
File "MLPprune.py", line 174, in
test(model)
File "MLPprune.py", line 123, in test
output = model(data)
...
...
> RuntimeError: size mismatch, m1: [256 x 3072], m2: [3 x 128]

My code for MLP is the same as vggprune.py, except for changing model.arch.
Where to make the change in the vggprune.py model?

Can't reproduce network slimming on ImageNet

We trained the network slimming model with the command https://github.com/Eric-mingjie/rethinking-network-pruning/blob/master/imagenet/network-slimming/README.md#train-with-sparsity, and prune with 50%. However, we could not prune the same result as models you provided.

More specifically, in our result, the classifier.1.weight was pruned to 0 channels, and the classifier.4.weight almost keeps all original channels.

Pruning result:

layer index: 4   total channel: 64       remaining channel: 26
layer index: 8   total channel: 128      remaining channel: 86
layer index: 12          total channel: 256      remaining channel: 111
layer index: 15          total channel: 256      remaining channel: 182
layer index: 19          total channel: 512      remaining channel: 171
layer index: 22          total channel: 512      remaining channel: 176
layer index: 26          total channel: 512      remaining channel: 295
layer index: 29          total channel: 512      remaining channel: 328
layer index: 34          total channel: 4096     remaining channel: 0
layer index: 37          total channel: 4096     remaining channel: 4096

Applying l1 norm in Network slimming/vggprune.py

In the original paper, the authors have applied L1 norm on the scaling factors of the batchnorm. However, in your code, you have obtained a threshold to prune out the nodes with batchNorm scaling factors less than that threshold (thre).
It seems like you have not applied the L1 norm in your code.

PLease let me know if I am missing anything.

About cfg in l1-norm-pruning/vggprune.py

Hi,

I notice that this cfg used to prune vgg16 model has a slightly different configuration than the original vgg16 cfg.

vgg16_cfg = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512]

Why did you choose to perform pruning only on the first and the last six conv layers? Is there any reason for this?

mask

Is there a code that implements the mask? Regression-pruning

VGG-16 on CIFAR10 dataset architecture

Hello,
I want to know the architecture of the VGG-16 used on CIFAR-10 dataset , it contains 13 convolution layers followed by an average pooling layer and one fully connected layer?
regards

A question about the pruning algorithm

Dear author,

I just found an issue that with the current algorithm, the threshold for the pruning is selected based on the batch norm scaling factors of all layers. Hence, it is possible that all the scaling factors of certain layer are below the threshold and hence all channels in that layer is masked. In such cases, the mask implementation is blocking the data flow inside the neural network.

I encounter this problem when setting the pruning percentage to be 0.5 as shown in the readme file and I got almost 0% accuracy after first round of pruning.

Could you please advice if this is the correct method? Should I use fine tuning to recover the accuracy or I should decrease the pruning ratio first and do the pruning progressively?

As the ratio of 0.5 is suggested in the code, can I check do you also encounter the similar situation of 0% accuracy for the first round of pruning with 0.5 percentage?

Thank you so much for your reply and advice.

after network-slimming,the size of modell are the same as premodel?

Initialization of BN weights

I notice that the ImageNet of vgg initializes the weight as 0.5

rethinking-network-pruning/imagenet/network-slimming/vgg.py

Line 51 in 74166a7

m.weight.data.fill_(0.5)

which is different from the implementation of torchvision

https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py#L55

Are there any reasons? Thanks!

Question about skip

I want to ask the skip in pruning.py. Why choose these layer to prune? Cause I want to prune my own resnet101 and I wonder if there are any rules about how to choose layers to prune.
Thanks~

Reproduce Fig. 4 from paper

Hi, thank you for this great work.
How would one reproduce Figure 4 from your paper i.e "The average sparsity pattern"?
Thanks

updateBN

Hello. I got a question while reproducing your interesting experiment.

In section 2 of https://arxiv.org/pdf/1708.06519.pdf, "Scaling Factors and Sparsity-induced Penalty" shows below equation.

Question:
g(γ) means L1 norm, but https://github.com/Eric-mingjie/rethinking-network-pruning/blob/master/imagenet/network-slimming/main_finetune.py#L187 applys torch.sign like "m.weight.grad.data.add_(sparsity * torch.sign(m.weight.data))" not L1 norm

So isn't it right to use "m.weight.grad.data.add_(sparsity * m.weight.data.abs()))" for updateBN?

Error in imagenet/network-slimming

Two errors that I have found:

There is no --arch in prune.py, but it is used in README
There should be --prune in prune.py because it is used in the following code, see https://github.com/Eric-mingjie/rethinking-network-pruning/blob/master/imagenet/network-slimming/prune.py#L81

Did you test the code before the release?

Question about predefined structured pruning

Hi, thanks for the great work. I have a question about the experiments in predefined structured pruning methods. I am not sure I am understanding the paper correctly.

For predefined structured pruning methods, given the pruning ratio (e.g. 50%), the only difference in different methods is how to find the "least important" channels to prune. But after pruning, they all will result in the same structured pruned models. According to the paper, all these pruned models should have the same performance, even when training from scratch. My question is, if this is true, does this mean that it is meaningless to do those predefined structured pruning since they all lead to the same pruned models which has the same performance. One can just construct a ResNet_0.5x and train from scratch and it will have the same performance as the predefined structured pruning methods. I am looking forward to your reply.

when should I train it？

Hi，I want to pure my model on tracking task. I‘m using resnet22 and my tracking datasets, train it by my program.So I have a trouble ,when should I train it? If I want to achieve a 60% pruning rate, Then I should first prune the randomly weighted model to 60% and then train it, or gradually prune the model to do a fine tuning and pruning process.
look forward your help

about train() and eval()

anyone konws what the line128 model.tarin() and line 154 model.eval() in network-slimming/main.py are meant for. i did not find the defination of these two function,can i just delete them,thanks

Question for Network Slimming on cifar 100

The accuracy of pruned VGG19 with sparse rate of 0.5 (before finetune) becomes 10.23, and rises to 72.30 after finetune. This is natural.
However, ResNet164 with sparse rate of 0.5 has exactly same accuracy with original model (75.55), which I think is weird, and accuracy drops after finetuning (75.41). Is this right? I check whether the model size actually decreased, and I found no problem.
Is this result natural?

Pruning steps

If I want to prune VGG model using l1 norm pruning method and CIFAR dataset I have to run:
1- main.py
2-Vggprune.py
3-main_finetune.py
Because when i start with Vggprune.py I obtained a test accuracy with 10% of the model and the same test acuuracy of the newmodel (pruned model).

Also, I don't undestand this line:
out_channels = m.weight.data.shape[0]
And why the choice of : start_mask = torch.ones(3) , is it because the in_channels are 3 ??

N/A

size mismatch, m1: [2 x 288], m2: [8 x 120]

I am trying to prune simple LeNet5 model using L1-norm pruning and CIFAR10 dataset. The model has 6 kernels in the first layer and 16 in the second convolutional layer. The output of the last convolutional layer of the original model is 16x6x6 and the number of nodes in the first dense layer is 120 which makes a matrics of [576, 120]. The output dimension of the last convolutional layer after pruning (5 kernels are pruned from first layer and 8 from second layer) is 6x6x8 that makes a matrics of [288, 120]. But while training it is giving dimension mismatch error. The problem is in copying weights from original model to pruned model in the dense layer. Here is the code where weights are being copied.

size mismatch, m1: [2 x 288], m2: [8 x 120] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:290
Any Suggestions ???

RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

I've installed the correct requirements. But after running this:
python main.py --dataset cifar10 --arch vgg --depth 16

I'm getting the following error:

Traceback (most recent call last):
  File "main.py", line 166, in <module>
    train(epoch)
  File "main.py", line 125, in train
    output = model(data)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jeferson/repo/rethinking-network-pruning/cifar/l1-norm-pruning/models/vgg.py", line 56, in forward
    x = self.feature(x)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/container.py", line 67, in forward
    input = module(input)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 282, in forward
    self.padding, self.dilation, self.groups)
  File "/home/jeferson/repo/rethinking-network-pruning/repense/lib/python3.6/site-packages/torch/nn/functional.py", line 90, in conv2d
    return f(input, weight, bias)
RuntimeError: CUDNN_STATUS_EXECUTION_FAILED

Am I doing something wrong?

关于阻止已经置零的通道进行权重更新出现的问题

我对您的工作感到很感兴趣，也在进行相关的研究，在实验的过程中，发现了一些有意思的现象，希望能与您进行讨论。
for m in model.modules():
if isinstance(m, nn.BatchNorm2d) or isinstance(m, nn.BatchNorm1d):
mask = (m.weight.data != 0)
mask = mask.float().cuda()
m.weight.grad.data.mul_(mask)
m.bias.grad.data.mul_(mask)
这部分代码是您用来阻止已经置零的权重进行进一步的梯度更新，这是很不错的想法，我也想在自己的工作中加入这部分代码，但是在pytorch中发现这部分代码并不能绝对的阻止权重的更新。虽然直观上他应该可以阻止参数的更新，但是实际上，那些已经置零的通道仍然会进行少量的更新。这样的一个直接的影响就是，应该失效的通道仍然在默默的发挥作用。不知道您是否注意到这种情况，期待您的回复。

train from scratch

Hi, thanks for your hard work!

I am curious about the experiments train from scratch in your paper. Specifically, for example, if you prune by magnitude in fine-grained, do you directly prune the weights by using the information from initiated weights (i.e. sort the initialized weights and choose the minimum top-k individual weights and prune them) ? Or use the pre-trained model's weights and prune the model and then re-initialized the remained weights to re-train them?

L2 weight decay on batch normalization layer

I notice that PyTorch will apply the weight decay on all trainable parameters, including BatchNorm. In training Network Slimming on ImageNet, the weight decay is 1e-4 which is 10x larger than the sparsity 1e-5. Does this affect the effectiveness of the sparsity loss? Could I set the weight decay as 0 for bn layers? Are there any experiment results on 0 weight decay bn layers?

Some questions about ThiNet

Why do the thinet model which in your repo not apply the algorithm in the origin paper?Maybe I didn`t find it ?

Network slimming loss function

Hi, Thank you for sharing a good experiment.
I have a question about the loss function of network slimming.

The paper shows the training objective as shown below.

But codes only use cross entropy function when training after pruning. Is this the right implementation?
Please explain if I misunderstood.

Question about skip

I want to ask the skip in pruning.py. Why choose these layer to prune? Cause I want to prune my own resnet101 and I wonder if there are any rules about how to choose layers to prune.
Thanks~

Cifar10 vgg19 zero remaining channel (network slimming)

Hi, I have one more question.
When vgg19 is pruned (70%) as a guide, the channel remains as shown below.

layer index: 3 total channel: 64 remaining channel: 45
layer index: 6 total channel: 64 remaining channel: 64
layer index: 10 total channel: 128 remaining channel: 128
layer index: 13 total channel: 128 remaining channel: 128
layer index: 17 total channel: 256 remaining channel: 256
layer index: 20 total channel: 256 remaining channel: 256
layer index: 23 total channel: 256 remaining channel: 249
layer index: 26 total channel: 256 remaining channel: 184
layer index: 30 total channel: 512 remaining channel: 36
layer index: 33 total channel: 512 remaining channel: 6
layer index: 36 total channel: 512 remaining channel: 2
layer index: 39 total channel: 512 remaining channel: 0
layer index: 43 total channel: 512 remaining channel: 0
layer index: 46 total channel: 512 remaining channel: 0
layer index: 49 total channel: 512 remaining channel: 5
layer index: 52 total channel: 512 remaining channel: 292

Since the remaining channels of index 39,43 and 46 are zero, an error "IndexError: index 0 is out of bounds for dimension 0 with size 0" occurs https://github.com/Eric-mingjie/rethinking-network-pruning/blob/master/cifar/network-slimming/vggprune.py#L125
zero-remaining channels means that it is not trained. This is a big problem.

Is there anything else I need to do to run 70% like the paper result?

Or do I have to apply the mask implementation you mentioned #44 (comment) to do 70% pruning?
This method(mask imp) was also used, but there were still the zero-remaining channels because the pruning method, which eliminates channels below a threshold, was the same. This was not an appropriate solution.

Is there a special way to prevent the zero-remaining channels?

Pruning strategy

Dear author:
If I want to prune VGG-16 model using L1-norm based channel pruning method in ImageNet.

Should I prune the shallow or deep convolutional layer first?
I read some related papers, but it don't seem to mention this one in the article.
Or is it simply based on sensitivity of layer and experience?

Best Regard.

mask implementation

I'm curious about the implementation of pruning algorithm for weight level pruning. Specifically, to my understanding, in cifar/weight_level/ file, you first train a model, prune it and then fine tune it.

My question lies at the code: cifar/weight-level/cifar_finetune.py, at line 246 to 251. Correct me if I'm wrong, but it seems that at each of the training iteration, you check the weight of the Conv2d and mask out the gradient for those weights that are zero. My question is, in addition to the weights that are zeroed out in the pruning phase, is it possible that the number of zeroed weight increases as the training proceed? If so, then your code seems to freeze these unpruned weights to zero. Thanks for any further feedback.

prune mobilenetv2

hello, @liuzhuang13 @Eric-mingjie ,have you ever do the pruning of mobilenetV2?
I try to prune mobilenetV2 with several methods, it seems hard to train the pruned model to convergence in imagenet.

Accuracy of the pruned model

Hi,

I have followed the code here and run the sparse training code as below:

python main.py --arch vgg11_bn --s 0.00001 --save [PATH TO SAVE RESULTS] [IMAGENET]

After the training, the accuracy is 71.4% which is fine. However, The pruning results is almost 0 with the 0.5 pruning ratio. As I decrease the pruning ratio to be 0.2, the top 1 accuracy increase to 15% which also far below the expectation. Could you please advice this is normal or there could be something wrong?

I would like a one time pruning and do not want to prune iterative.

Thanks for your reply.

Best regards,

Custom Dataset and architecture

@liuzhuang13 @Eric-mingjie @quelleG Thanks for the sharing the wonderful work , i just have few queries .

Is the source code applicable only to the imagenet dataset or can i use to other custom dataset
the architecture i have is a modified version of res can i use the source code .
how much performance gain you have obtained from ur exp

Differient result in network slimming

I have used vgg11_bn in cifar10 but the result was total differient from this paper. In this paper Network architectures obtained by pruning 60% channels on VGG-16 (in total 13 conv-layers) using Network Slimming is pretty workable, and the 9,10,11,12,13 conv layers are pruned a lot. But in my experiment I find that the higher the gamma average of the later layers. So after pruning, almost all the front layers have been pruned, and the acc is much lower. Have you met this situation?

Regarding accuracy of the Scratch B models in the paper

Hello,

Thanks for an interesting paper. I was looking at the accuracy of Scratch B models compared to big-unpruned networks and it seems Scratch B is performing better than unpruned networks most of the time. This seems to be counter intuitive as the bigger network if could be trained effectively should outperform smaller networks. Do you think the difference is statistically significant?

Thanks

A question about the training epochs

Hi, thanks for the great work! I think it is really a valuable observation that looing for the optimal structure should be the real value of channel pruning.
But I have a question: when you compare the performance of the Fine-tuned, Scratch-E and Scratch-B, it seems finetuning only takes a small number of epochs. For example Cifar10, the finetuning only takes 40 epochs while scratch-E takes 160 epochs. Can it be the reason why Scratch models outperform the Finetuned ones?
From my experiments, I find finetuning 40 epochs is really not enough, especially for the higher pruning ratios.
I think it is maybe not so equal to comapre two models with so different training epochs.

Why zero bn layer at training?

In Network Slimming, the repo use BN_grad_zero to add the mask on the network. I think this should only be used in finetune phase. Why use it at training phase?

rethinking-network-pruning/imagenet/network-slimming/main.py

Line 225 in 74166a7

BN_grad_zero(model)

'async' is a reserved word in Python >= 3.7

The async keyword argument in conversion calls is deprecated in PyTorch >= 0.4.0, and it has been replaced by non_blocking. This is necessary because async is a keyword in Python >= 3.7

flake8 testing of https://github.com/Eric-mingjie/rethinking-network-pruning on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./cifar/weight-level/cifar_finetune.py:229:63: E999 SyntaxError: invalid syntax
            inputs, targets = inputs.cuda(), targets.cuda(async=True)
                                                              ^
./cifar/weight-level/cifar_B.py:273:63: E999 SyntaxError: invalid syntax
            inputs, targets = inputs.cuda(), targets.cuda(async=True)
                                                              ^
./cifar/weight-level/cifar.py:232:63: E999 SyntaxError: invalid syntax
            inputs, targets = inputs.cuda(), targets.cuda(async=True)
                                                              ^
./cifar/weight-level/cifar_E.py:265:63: E999 SyntaxError: invalid syntax
            inputs, targets = inputs.cuda(), targets.cuda(async=True)
                                                              ^
./imagenet/regression-pruning/compute_flops.py:91:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.Conv2d):
                               ^
./imagenet/regression-pruning/compute_flops.py:93:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.Linear):
                               ^
./imagenet/regression-pruning/compute_flops.py:95:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.BatchNorm2d):
                               ^
./imagenet/regression-pruning/compute_flops.py:97:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.ReLU):
                               ^
./imagenet/regression-pruning/compute_flops.py:99:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.MaxPool2d) or isinstance(net, torch.nn.AvgPool2d):
                               ^
./imagenet/regression-pruning/compute_flops.py:99:71: F821 undefined name 'torch'
            if isinstance(net, torch.nn.MaxPool2d) or isinstance(net, torch.nn.AvgPool2d):
                                                                      ^
./imagenet/regression-pruning/compute_flops.py:101:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.Upsample):
                               ^
./imagenet/regression-pruning/compute_flops.py:110:22: F821 undefined name 'torch'
    input = Variable(torch.rand(3,input_res,input_res).unsqueeze(0), requires_grad = True)
                     ^
./imagenet/regression-pruning/main_E.py:197:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/regression-pruning/main_B.py:211:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/regression-pruning/models/vgg_5x.py:8:1: F822 undefined name 'vgg16_official' in __all__
__all__ = [
^
./imagenet/thinet/compute_flops.py:91:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.Conv2d):
                               ^
./imagenet/thinet/compute_flops.py:93:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.Linear):
                               ^
./imagenet/thinet/compute_flops.py:95:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.BatchNorm2d):
                               ^
./imagenet/thinet/compute_flops.py:97:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.ReLU):
                               ^
./imagenet/thinet/compute_flops.py:99:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.MaxPool2d) or isinstance(net, torch.nn.AvgPool2d):
                               ^
./imagenet/thinet/compute_flops.py:99:71: F821 undefined name 'torch'
            if isinstance(net, torch.nn.MaxPool2d) or isinstance(net, torch.nn.AvgPool2d):
                                                                      ^
./imagenet/thinet/compute_flops.py:101:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.Upsample):
                               ^
./imagenet/thinet/compute_flops.py:110:22: F821 undefined name 'torch'
    input = Variable(torch.rand(3,input_res,input_res).unsqueeze(0), requires_grad = True)
                     ^
./imagenet/thinet/main_E.py:206:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/thinet/main_B.py:226:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/l1-norm-pruning/main_finetune.py:206:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/l1-norm-pruning/compute_flops.py:91:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.Conv2d):
                               ^
./imagenet/l1-norm-pruning/compute_flops.py:93:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.Linear):
                               ^
./imagenet/l1-norm-pruning/compute_flops.py:95:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.BatchNorm2d):
                               ^
./imagenet/l1-norm-pruning/compute_flops.py:97:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.ReLU):
                               ^
./imagenet/l1-norm-pruning/compute_flops.py:99:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.MaxPool2d) or isinstance(net, torch.nn.AvgPool2d):
                               ^
./imagenet/l1-norm-pruning/compute_flops.py:99:71: F821 undefined name 'torch'
            if isinstance(net, torch.nn.MaxPool2d) or isinstance(net, torch.nn.AvgPool2d):
                                                                      ^
./imagenet/l1-norm-pruning/compute_flops.py:101:32: F821 undefined name 'torch'
            if isinstance(net, torch.nn.Upsample):
                               ^
./imagenet/l1-norm-pruning/compute_flops.py:110:22: F821 undefined name 'torch'
    input = Variable(torch.rand(3,input_res,input_res).unsqueeze(0), requires_grad = True)
                     ^
./imagenet/l1-norm-pruning/main_E.py:203:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/l1-norm-pruning/prune.py:83:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/l1-norm-pruning/main_B.py:216:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/network-slimming/main.py:206:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/network-slimming/main_finetune.py:212:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/network-slimming/main_E.py:214:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/network-slimming/prune.py:142:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
./imagenet/network-slimming/main_B.py:220:34: E999 SyntaxError: invalid syntax
        target = target.cuda(async=True)
                                 ^
17    E999 SyntaxError: invalid syntax
24    F821 undefined name 'torch'
1     F822 undefined name 'vgg16_official' in __all__
42

preresnet layer inconsistent when apply custom config

The number of out channel of conv1 in preresnet is fixed (16). However, when using custom config for preresnet, the in_channels of the first layer in the layer1 may not be 16 (depends on cfg). There will be an error.

rethinking-network-pruning/cifar/network-slimming/models/preresnet.py

Lines 70 to 78 in 74166a7

    
           self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1, 
        
                                  bias=False) 
        
           self.layer1 = self._make_layer(block, 16, n, cfg = cfg[0:3*n]) 
        
           self.layer2 = self._make_layer(block, 32, n, cfg = cfg[3*n:6*n], stride=2) 
        
           self.layer3 = self._make_layer(block, 64, n, cfg = cfg[6*n:9*n], stride=2) 
        
           self.bn = nn.BatchNorm2d(64 * block.expansion) 
        
           self.select = channel_selection(64 * block.expansion) 
        
           self.relu = nn.ReLU(inplace=True) 
        
           self.avgpool = nn.AvgPool2d(8)

IndexError: index 0 is out of bounds for dimension 0 with size 0

Dear author,

I am trying to prune a resnet-56 on cifar10 using network slimming.
python resprune.py --dataset cifar10 --depth 56 --percent 0.8 --model

~/results_def/resnet56/baseline/model_best.pth.tar --save ~/results_def/resnet56/pruned80/

Does this mean there is no path between input and output? Shouldn't it still work given every layer would be an identity mapping?

What should I do in case I want to reprooduce the results for aggresive pruning?

Difference on epochs in network-slimming

Hi, I found the default number of epochs in network-slimming(scratch training VGG-11) for Imagenet (which is 90 in the code) is different from the original paper, which is 60.

count_flops

count_flops：你的flops计算的代码有问题。

IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

Traceback (most recent call last):
File "main.py", line 166, in
train(epoch)
File "main.py", line 127, in train
avg_loss += loss.data[0]
IndexError: invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

Lottery ticket gradient masking

A quick question. In LTH experiments in https://github.com/Eric-mingjie/rethinking-network-pruning/blob/master/cifar/lottery-ticket/weight-level/lottery_ticket.py#L293 gradients are zeroed for the weights that are masked out. But gradients of these zeroed weights take part in the backward pass. In other words the backward pass taken seems not equivalent to a backward pass of the corresponding thin network initialized using the lottery ticket. Is this intentional, or maybe I misunderstood something? Thanks!

Network slimming for ANN on MNIST

How to do the pruning step for batchNorm1D layers in an ANN, where you would be using the weights directly rather than using a mask?
If possible, a sample code on the nn.Linear layer and the BatchNorm1d layer would be really helpful!

When I use the same code for batchNorm1D, I get :

Traceback (most recent call last):
File "MLPprune.py", line 156, in
end_mask = cfg_mask[layer_id_in_cfg]
IndexError: list index out of range

link not working

Hey,

The link for the weight level pruned resnet 50 @60% is the same for the finetuned model and scratch-E. I think the link for the finetuned version is wrong.

ResNet-50 | 60% | finetune | 76.09 | 92.91 | pytorch model (195 MB) <------ Wrong link
ResNet-50 | 60% | scratch-E | 73.69 | 91.61 | pytorch model (195 MB)

Could you update the link if you have it? I would like to run some experiments with the model.

Best,
Marton

Is it reasonable to get a threshold for all bn layers?

When calculating the threshold, the weight ordering of all bn layers is used. Is this reasonable?

Is there such a phenomenon:
① The first value of the network is closer to the image pixel value, and the last layer is closer to the category probability. bn's weight is not necessarily the same.
② There is a shortcut in the middle of the network. After the two convolution pixel values are superimposed, the weight parameter becomes larger. May affect bn's weight.

在计算阈值时，将使用所有bn层的权重排序。这合理吗？
是否存在这样的现象：
①网络最前面的数值，更靠近图像像素值，最后一层更靠近类别概率。bn的weight不一定分布相同。
②在网络中间有shortcut，两个卷积像素值叠加后，weight参数变大。可能会影响bn的weight。

issues
期待您的回复。十分感谢。

	self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1,
	bias=False)
	self.layer1 = self._make_layer(block, 16, n, cfg = cfg[0:3*n])
	self.layer2 = self._make_layer(block, 32, n, cfg = cfg[3n:6n], stride=2)
	self.layer3 = self._make_layer(block, 64, n, cfg = cfg[6n:9n], stride=2)
	self.bn = nn.BatchNorm2d(64 * block.expansion)
	self.select = channel_selection(64 * block.expansion)
	self.relu = nn.ReLU(inplace=True)
	self.avgpool = nn.AvgPool2d(8)