mit-han-lab / once-for-all Goto Github PK

View Code? Open in Web Editor NEW

1.8K 53.0 332.0 6.99 MB

[ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment

Home Page: https://ofa.mit.edu/

License: MIT License

Python 78.01% Shell 0.02% Jupyter Notebook 21.97%

tinyml edge-ai efficient-model acceleration nas automl

once-for-all's People

Contributors

Stargazers

Watchers

Forkers

zhengxiawu liuguoyou xiaochengcike sajid3 alwc robot-ai-machinelearning seeker1943 cavalleria huaizhengzhang tongyanjun mzahran001 lilujunai trendingtechnology wwxfromtju chaos1992 chengyuegongr bupt-renpei zechendev ultmaster duguyue100 santolina yuanyichuangzhi liqi0126 hong-333 zeitgeistqian curiouscat-7 imprld01 mornydew seanhtchoi pankajmehar arui1 ktho22 usedtobe97 killsking aptsunny marmotatzju davidocea ganeshkumartk infigenie huaweiyang88 aihill genetictools felixzhang7 perfmjs lyken17 hyzcn eycab christinaliang tcglarry yangtong1989 nirvanesque avesus jaedukseo engali94 ideaplexus sanyam07 sungangweon notreal1995 d123456ddq dutxubo dev233 vuiseng9 xrosliang jia-honghenrylee jurjsorinliviu xiaoyaozhuzi andy-wagner wanggcong trantorrepository liguang190223 happog bertmoons liamcli jac002020 mldl jasonrichard vivekbhadouria aoiang zhangliliang jpablomch mikelzc1990 ainilaha yjmade simon5u uxtl colorjam gengdavid edvardhua fromsystem 2015tdlp songya hixio-mh pc2005 gunjupark kien-vu dennistang742 tuananh1007 oodunsi1 haixiongli acapone13

once-for-all's Issues

About training of once-for-all network

Hi, thanks for your great work!
I am interested in training the once-for-all network but I met some problems when diving into your training code.
Line 198 in train_ofa_net.py loads a teacher model weights. Is this a trained teacher model and the training code only performs the progressive shrinking?
Besides, the arguments arg.task and arg.phase seem never changing during the training. Am I right? If it is, so I need to train multiple times with different arguments?
Thanks.

Some questions about accuracy predictor

Hi, I'm very interested in your works.

I want to use accuracy predictor about some other config. (like resnet based OFA ... and some)

I saw some tutorial codes about acc_predictor you uploaded, so I could understand how it look likes.
And I saw your paper's appendix about detail of accuracy predictor.

I have a question about how much train data for training accuracy predictor.

And when you were training acc_predictor , there are ground truths you measured using whole imagenet valid set.

How many ground truth are needed?

Also, I want to know hyper parameters about accuracy predictor's training.

I hope you answer my questions, Thank you

How to generate latency_table?

I want to generate latency_table on my own hardware. Can you give me some advices?

Looking for pre-trained weights for ofa_proxyless_d234_e346_k357_w1.0

Hi,
I am following your wonderful CVPR20 tutorial.
I am interested in getting lower latency than what I could obtain using ofa_proxyless_d234_e346_k357_w1.3.
So I was trying out ofa_proxyless_d234_e346_k357_w1.0 (width-1.0 instead of 1.3). However I am not able to find pre-trained network at, https://hanlab.mit.edu/files/OnceForAll/ofa_nets/. Is it possible to share with me?

Thanks

How to measure the latency correctly?

Hi, Thanks for your great work!
When I was testing the latency on V100, the results confused me.
I used the following code to measure the latency table.
torch.cuda.empty_cache() img_L = img_L.cuda() start.record() out = ofa_network(img_L) end.record() torch.cuda.synchronize() run_time.update(start.elapsed_time(end))
The img_L is one image.
Is this correct?

Cannot download supernet chechpoint

I try to download your supernet weights but it seems no connection?

ImportError: Extension horovod.torch

@zhijian-liu @Lyken17 @tonylins @mzahran001 @songhan ImportError: Extension horovod.torch has not been built. If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error

The effect of the `base_acc` in accuracy_predictor ?

Hi, I noticed that here(link1) add base acc to train accuracy predictor and set the last linear layer bias as False.
For the same batch data, they have the same bias.
If I do not add this bias, is the predictor will decrease? Or I missing any analysis in the paper?

link1 : https://github.com/mit-han-lab/once-for-all/blob/master/ofa/nas/accuracy_predictor/acc_predictor.py#L32

Some doubts about the predictior

I sample 800 net ids in ofa net, initial parameter is'https://hanlab.mit.edu/files/OnceForAll/ofa_nets/'
I want to use the subnets to test the predictor in ofa/tutorial/accuracy_predictor.py, the predictor load pretrained model
'https://hanlab.mit.edu/files/OnceForAll/tutorial/acc_predictor.pth'
the picture shows a strange result,just like this:

The red dot is my own predictor
Why aren't The blue points in a straight line?

Question about GPU Memory for OFA Progressive Shrinking.

Hi Han Cai, Thank you so much for responding about previous question very quickly.

I'm trying to train OFA Model(ofa_mbv3) using 4 Nvidia Titan V & 2 Titan RTX GPUs.

But There's a problems when validating subnet models.

I checked below code about progressive shrinking validate.

for setting, name in subnet_settings:
    run_manager.write_log('-' * 30 + ' Validate %s ' % name + '-' * 30, 'train', should_print=False)
    run_manager.run_config.data_provider.assign_active_img_size(setting.pop('image_size'))
    dynamic_net.set_active_subnet(**setting)
    run_manager.write_log(dynamic_net.module_str, 'train', should_print=False)

    run_manager.reset_running_statistics(dynamic_net)
    loss, top1, top5 = run_manager.validate(epoch=epoch, is_test=is_test, run_str=name, net=dynamic_net)
    losses_of_subnets.append(loss)
    top1_of_subnets.append(top1)
    top5_of_subnets.append(top5)
    valid_log += '%s (%.3f), ' % (name, top1)

Validating 1st loop (about 1st subnet) is no problem.
But when I try to validate 2nd subnet, Error("CUDA out of memory") happened.

My GPUs have 12GB(Titan V) , 24GB(Titan RTX) Memories each.

How big is your GPU memory?
Also, please let me know if there is any guessing or recommendation to solve this error.

Thank you so much.

args.valid_size is wrong?

Looks like args.valid_size in train_ofa_net.py is set to 10000. Is that right? Seems to me that target size is much smaller than that (~200)

once-for-all/ofa/imagenet_classification/data_providers/base_provider.py", line 42, in random_sample_valid_set
assert train_size > valid_size

Rationale for not having the first MB layer Dynamic in mbv3 backbone

Hi,

Is there a design rationale for not making the first bottleneck layer Dynamic? Instead, the first bottleneck layer is used as a simple Residual block (Link). I believe a similar setup was carried out in ProxylessNAS as well. I am interested to hear the insights on why.

Thanks,
Vinod

Fine-tune transformation matrix in "Elastic Depth" stage and "Elastic Width" stage in the code, but not in the paper

According to the paper:

The algorithm should fine-tune the transformation matrix in the "Elastic Kernel Size" stage while freezing the transformation matrix in the "Elastic Depth" stage and "Elastic Width" stage. But in the code (line 45 in dynamic_op.py), it will be fine-tuned anyway.

Could you tell me why?

what does MACs mean?

Soryy to ask such a simple question, but I can not find the solution anywhere. Could anyone help me ?

How to train on custom dataset

as the title, I want train on a small dataset.
What should I do?

Latency Estimator (lookup table) & evolutionary search code neural-network-twins

Hey, thanks for your amazing work!

can you share the Latency Estimator and the evolutionary search code based on the neural-network-twins to get a specialized sub-network?

Thanks

Why only 3x3 kernel size in resnet?

Question regarding implementation detail - re_organize_middle_weights

In channel selection for width control, the function re_organize_middle_weights in dynamic_layers. In line 144, the following operation is applied - importance[target_width:] = torch.arange(0, target_width - importance.size(0), -1).
I don't really understand this line. If importance is assumed to be sorted then it does nothing to the order of importance. If it is not - then important channels can effectively be discarded.
What am I missing?

Error when run train_ofa_net.py

Hi, this project is an excellent work about NAS. I am very interested in it and try it on my machine. But I get the following problem when running 'horovodrun -np 4 -H localhost:4 python train_ofa_net.py':

[1,1]:Traceback (most recent call last):
[1,1]: File "train_ofa_net.py", line 194, in
[1,1]: distributed_run_manager.broadcast()
[1,1]: File "/home/xiaobingt/xueshengke/code/once-for-all/ofa/imagenet_codebase/run_manager/distributed_run_manager.py", line 183, in broadcast
[1,1]: hvd.broadcast_parameters(self.net.state_dict(), 0)
[1,1]: File "/home/xiaobingt/horovod/env/lib/python3.7/site-packages/horovod/torch/init.py", line 476, in broadcast_parameters
[1,1]: handle = broadcast_async_(p, root_rank, name)
[1,1]: File "/home/xiaobingt/horovod/env/lib/python3.7/site-packages/horovod/torch/mpi_ops.py", line 449, in broadcast_async_
[1,1]: return _broadcast_async(tensor, tensor, root_rank, name)
[1,1]: File "/home/xiaobingt/horovod/env/lib/python3.7/site-packages/horovod/torch/mpi_ops.py", line 359, in _broadcast_async
[1,1]: tensor, output, root_rank, name.encode() if name is not None else _NULL)
[1,1]:RuntimeError: Internal error. Requested ReadyEvent with GPU device but not compiled with CUDA.

It seems this issuse comes from my horovod. But I have installed successfully 'horovod' and can run examples without error. I also googled but no soluntion has been found yet. Can you help me?

Here is my environment:

Cudnn 7.6.5
Cudatoolkit 10.1.243
Openmpi 4.0.5
Python 3.7.8
Pytorch 1.5.1
Tensorflow-gpu 2.1.1

Bug for the implementation of knowledge distillation?

Thanks for sharing your code!
I'm wondering if this is a bug for the implementation of knowledge distillation.
Since the cross_entropy_loss_with_soft_target already use nn.LogSoftmax,

once-for-all/imagenet_codebase/utils/pytorch_utils.py

Line 37 in 0807662

logsoftmax = nn.LogSoftmax()

Does it need to apply softmax on soft_logits here again? Thanks!

once-for-all/elastic_nn/training/progressive_shrinking.py

Line 115 in 0807662

soft_label = F.softmax(soft_logits, dim=1)

Question about args.dynamic_batchsize

Hi,

How does args.dynamic_batchsize work?

What should I do after train_ofa_net

I run train_ofa_net.py and there is three folders under 'exp/': 'kernel2kernel_depth', 'kernel_depth2kernel_depth_width', 'normal2kernel'. Then, what should I do next? There are 'checkpoint logs net.config net_info.txt run.config' under each exp subfolder after training. Anybody knows how should I deal with it?

I can not find any relations between the training exp results and 'eval_ofa_net.py'. Please help this poor kid. \doge

How to deploy to mobile?

Thanks for great work! This code uses a pytorch model but you mention that the models are deployed on mobile in tf-lite, do you convert a pytorch model with ONNX or implement it in tensorflow separately?

channel sorting for elastic width

Hi, thx for your work.
In the paper, for supporting elastic width, a channel sorting algorithm based on the norm of each channel was introduced. However, i can't find this part in the codes. Could anyone tell me about its location?

top5 performance

Hi and thanks for the amazing work,

What's the top5 accuracy on ImageNet of the model that achieved top1=80% reported in the paper?
This would help for my literature review where I only have top5 for some models.

Thanks,
Boris

subnet重训练代码

project里似乎只有supernet的训练代码，子网的重训练代码请问是否能提供？

Questions about training supernet

Hi,

Thanks for your time regarding to this issue.

I have some questions about OFA supernet training phase.

Will performance of supernet always surpass the performance of original model?
How should we modify the hyper parameter setting from original model task (LR, optimizer type)?
Is the performance of supernet the ceil of performances of subnets?

Thanks for your help and happy Chinese New Year!

Question about the calculation of importance(L1 Norm)

Thank you for your great job.

I have a question about the calculation of importance.
Here in Once for all, the importance is calculated by the input dimension.

once-for-all/ofa/imagenet_classification/elastic_nn/modules/dynamic_layers.py

Line 263 in cfa0722

    
           importance = torch.sum(torch.abs(self.point_linear.conv.conv.weight.data), dim=(0, 2, 3))

But in Pruning_filters_for_efficient_convnets, the importance is calculated by the output dimension.

https://github.com/tyui592/Pruning_filters_for_efficient_convnets/blob/00ec7b7ae9e8f9bd3973888590728477e73537d9/prune.py#L69

sum_of_kernel = torch.sum(torch.abs(kernel.view(kernel.size(0), -1)), dim=1)

Is there any intrinsic reason to calculated by the input dimension?

Thanks!

Question about progressive shrinking

Greetings
There is a function re_organize_middle_weights which resort the convolution weight. However, the sequence of x remain the same after this operation.
Thus, the weight is misordering to input x. Mismatch of weight and input will cause output changes. Is this a big problem?

In set_running_statistics, CPU is used by default to forward images

forward_model is created by deep copying incoming model
However, it's not deployed in any gpu devices.
It's time-consuming to calculate mean and variance by forwarding batch of images using cpu.
I think it's better to assign default device and deploy the copied one on it.

Evolution details

hi, thanks for your excellent work

How did the network architecture be encoded and decoded during the evolution?

After reading the description of the acc predictor in the paper, it seems that the kernel size and expansion of each layer are first ecoded. If a architecture is [3,4, ....., 0,0 ..... 3,6], another architecture is [3,4, ....., 7,4 ..... 3,6], there are two question in evolution:

What if [0,0] and [7,4] crossover [0,4]? This is not a normal gene.
If one stage is [1,1,0,0], the last two are skipped. If mutation is [1,1,0,1] during the evolution process, which the last layer is not skipped, but the third layer is skipped. (which is not in line with the rules.)

What is the role of 'reset_running_statistics' ?

On line 67 of progressive_shrinking.py, why do we need the 'reset_running_statistics' function to reset both the 'mean' and 'var' value of the batchnormal layer to the 'mean' and 'var' obtained from random 2000 images?
run_manager.reset_running_statistics(dynamic_net)

When I use ofa_resnet50 to Efficient Deployment in tutorial/ofa.ipynb, I met some errors.

first, I searched a network

Searching with note10 constraint (25): 100%|██████████| 500/500 [00:09<00:00, 51.03it/s]Found best architecture on note10 with latency <= 25.00 ms in 9.84 seconds! It achieves 81.71% predicted accuracy with 24.73 ms latency on note10.
Architecture of the searched sub-net:
DyConv(O32, K3, S2)
(DyConv(O32, K3, S1), Identity)
DyConv(O64, K3, S1)
max_pooling(ks=3, stride=2)
(3x3_BottleneckConv_in->768->256_S1, avgpool_conv)
(3x3_BottleneckConv_in->768->256_S1, Identity)
(3x3_BottleneckConv_in->1536->256_S1, Identity)
(3x3_BottleneckConv_in->768->256_S1, Identity)
(3x3_BottleneckConv_in->2048->512_S2, avgpool_conv)
(3x3_BottleneckConv_in->2048->512_S1, Identity)
(3x3_BottleneckConv_in->3072->512_S1, Identity)
(3x3_BottleneckConv_in->2048->512_S1, Identity)
(3x3_BottleneckConv_in->6144->1024_S2, avgpool_conv)
(3x3_BottleneckConv_in->3072->1024_S1, Identity)
(3x3_BottleneckConv_in->4096->1024_S1, Identity)
(3x3_BottleneckConv_in->6144->1024_S1, Identity)
(3x3_BottleneckConv_in->4096->1024_S1, Identity)
(3x3_BottleneckConv_in->4096->1024_S1, Identity)
(3x3_BottleneckConv_in->8192->2048_S2, avgpool_conv)
(3x3_BottleneckConv_in->6144->2048_S1, Identity)
(3x3_BottleneckConv_in->12288->2048_S1, Identity)
(3x3_BottleneckConv_in->12288->2048_S1, Identity)
MyGlobalAvgPool2d(keep_dim=False)
DyLinear(2048, 1000)

But, I think The middle dimension of the network searched is a bit untrustworthy

When I wanted to evaluate this sub-model, I met this error

Evaluating the sub-network with latency = 24.7 ms on note10
RuntimeError Traceback (most recent call last)
in
6 , net_config, latency = result
7 print('Evaluating the sub-network with latency = %.1f ms on %s' % (latency, target_hardware))
----> 8 top1 = evaluate_ofa_subnet(
9 ofa_network,
10 imagenet_data_path,
~/桌面/once-for-all-master/ofa/tutorial/imagenet_eval_helper.py in evaluate_ofa_subnet(ofa_net, path, net_config, data_loader, batch_size, device)
18 assert len(net_config['ks']) == 20 and len(net_config['e']) == 20 and len(net_config['d']) == 5
19 ofa_net.set_active_subnet(ks=net_config['ks'], d=net_config['d'], e=net_config['e'])
---> 20 subnet = ofa_net.get_active_subnet().to(device)
21 calib_bn(subnet, path, net_config['r'][0], batch_size)
22 top1 = validate(subnet, path, net_config['r'][0], data_loader, batch_size, device)
~/桌面/once-for-all-master/ofa/imagenet_classification/elastic_nn/networks/ofa_resnets.py in get_active_subnet(self, preserve_weight)
226 active_idx = block_idx[:len(block_idx) - depth_param]
227 for idx in active_idx:
--> 228 blocks.append(self.blocks[idx].get_active_subnet(input_channel, preserve_weight))
229 input_channel = self.blocks[idx].active_out_channel
230 classifier = self.classifier.get_active_subnet(input_channel, preserve_weight)
~/桌面/once-for-all-master/ofa/imagenet_classification/elastic_nn/modules/dynamic_layers.py in get_active_subnet(self, in_channel, preserve_weight)
540
541 # copy weight from current layer
--> 542 sub_layer.conv1.conv.weight.data.copy(
543 self.conv1.conv.get_active_filter(self.active_middle_channels, in_channel).data)
544 copy_bn(sub_layer.conv1.bn, self.conv1.bn.bn)

RuntimeError: The size of tensor a (768) must match the size of tensor b (88) at non-singleton dimension 0

I guess that Do I need to modify the code for resnet50 network. Please tell me how to modify . Thanks a lot

Latency Table when target hardware is not samsung note 10

Hi, this work is awesome!

Latency table looks released only for Samsung note 10.

Could you release another latency tables especially for 1080ti?

Thanks in advnace.

Best,
Hayeon Lee.

Could you release the script that trains the pretrained model `ofa_D4_E6_K7`?

Hi, thanks for your great work!

Could you release the script that trains the pretrained model ofa_D4_E6_K7, i.e., the full network?

How many subnets does knowledge distillation optimize?

I have a question that is not cleared in the paper. During knowledge distillation, do you optimize for all 10^19 networks? The elastic - nn portion of the code seems to point to that:

	subnet_settings = []
	for d in depth_list:
		for e in expand_ratio_list:
			for k in ks_list:
				for w in width_mult_list:
					for img_size in image_size_list:
						subnet_settings.append([{
							'image_size': img_size,
							'd': d,
							'e': e,
							'ks': k,
							'w': w,
						}, 'R%s-D%s-E%s-K%s-W%s' % (img_size, d, e, k, w)])

How is the accuracy of the teacher model (once-for-all model)?

Thank you for your excellent code!
You use teacher-student distilling method when training sub-models, how is the accuracy of the teacher model (kernel size is 7, expansion is 6 and 4 layers in each unit)?

cannot import MyRandomResizedCrop from ofa.imagenet_classification.data_providers.base_provider

Details about finetuning (25 / 75 epochs)

Thanks for sharing your code for this excellent work!

Could you reveal more details about how you finetune your specialized sub-network? I didn't find the code in the repo, but hyper-parameters like batch size, optimizer, learning rate, lr decay and weight decay will be also very helpful.

Thanks again.

Tutorial for deploying with FPGA

Hi,

Congratulations on this great job. I was amazed by your solution in CVPR2020 competition.

Is there any tutorial to use this work on a FPGA ZynqUltrascale ZU3EG or ZU9EG?

Best regards,

Jorge

validation accuracy during training is higher than validation offline

Hi,

I find a strange problem that when I train the teacher network, the final biggest network has accuracy 82.2%, but when I test the saved model the accuracy drops to 79.7%.

Can anybody help to explain this phenomenon?

Best,

Incorrect accuracy while testing the pretrained ofa network

Thanks for sharing your code.

I have some problems when I test the ofa pretrained network.

I build a ofa network using the code provided in the README.

from model_zoo import ofa_net
ofa_network = ofa_net('ofa_mbv3_d234_e346_k357_w1.0', pretrained=True)
    
ofa_network.set_active_subnet(ks=7, e=6, d=4)
subnet = ofa_network.get_active_subnet(preserve_weight=True)

# test the subset on the validation set of the ImageNet

When I set the parameter ks, e and d with different value, the accuracy of the ofa network becomes about 0 in some cases. I show the test results in the following:

Have I made some mistake while testing the ofa pretrained network?

Question about training for model `ofa_D4_E6_K7`

Hello
I want to train OFA model(ofa_mbv3) on 'Cifar100' or custom datasets.

so I want to get some training details about first supernet.

When I checked model in progressive-shrinking phase,
I saw F.Linear(Kernel transformer) layer's weights were also trained.

When I want to train First Supernet (ofa_D4_E6_K7), should I train there Kernel transform matrix?

And I was wondering If you had some information about OFA net training on other dataset(like cifar 10, 100), I want to know them.

Thank you all the time.

How to export one of the subnets?

I'd like to export as an ONNX file or as a pth file + net Class some of the subnets. How can I do it?

About model converter code (Pytorch to Tensorflow-Lite) for on-device inference latency test.

I want to use OFA - specialized model on my device.
so I need to convert speicialized ofa model(pytorch) to TF-Lite Model.

I tried to convert model with onnx, but some error occured...
(e.g. ValueError: Shape must be rank 2 but is rank 1 for MatMul ...)

And I want to get Latency Tables using my devices resources also.

How could I get them??
Thank you...

two questions about ofa

Thanks for sharing your excellent work. I hava two questions about ofa.

Different hardware platforms have different optimizations for op and We often choose efficient op according to differnt hardware platform, can ofa handle this situation when different hardware platform have different prefer op？
On mobile platforms, different camera sensor produce different data, so different training data for different hardware platform. when we usr ofa for a generative network, like srgan, which platform's training data should be used？

Applying this technique to detection problem (training on the COCO detection dataset)?

Dear @han-cai ,

Thanks for sharing your excellent work. I am just curious about how we can apply this technique to detection problem (training on the COCO detection dataset)? Appreciate your time.