mit-han-lab / once-for-all Goto Github PK
View Code? Open in Web Editor NEW[ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment
Home Page: https://ofa.mit.edu/
License: MIT License
[ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment
Home Page: https://ofa.mit.edu/
License: MIT License
Hi, thanks for your great work!
I am interested in training the once-for-all network but I met some problems when diving into your training code.
Line 198 in train_ofa_net.py loads a teacher model weights. Is this a trained teacher model and the training code only performs the progressive shrinking?
Besides, the arguments arg.task
and arg.phase
seem never changing during the training. Am I right? If it is, so I need to train multiple times with different arguments?
Thanks.
Hi, I'm very interested in your works.
I want to use accuracy predictor about some other config. (like resnet based OFA ... and some)
I saw some tutorial codes about acc_predictor you uploaded, so I could understand how it look likes.
And I saw your paper's appendix about detail of accuracy predictor.
I have a question about how much train data for training accuracy predictor.
And when you were training acc_predictor , there are ground truths you measured using whole imagenet valid set.
How many ground truth are needed?
Also, I want to know hyper parameters about accuracy predictor's training.
I hope you answer my questions, Thank you
I want to generate latency_table on my own hardware. Can you give me some advices?
Hi,
I am following your wonderful CVPR20 tutorial.
I am interested in getting lower latency than what I could obtain using ofa_proxyless_d234_e346_k357_w1.3.
So I was trying out ofa_proxyless_d234_e346_k357_w1.0 (width-1.0 instead of 1.3). However I am not able to find pre-trained network at, https://hanlab.mit.edu/files/OnceForAll/ofa_nets/. Is it possible to share with me?
Thanks
Hi, Thanks for your great work!
When I was testing the latency on V100, the results confused me.
I used the following code to measure the latency table.
torch.cuda.empty_cache() img_L = img_L.cuda() start.record() out = ofa_network(img_L) end.record() torch.cuda.synchronize() run_time.update(start.elapsed_time(end))
The img_L is one image.
Is this correct?
@zhijian-liu @Lyken17 @tonylins @mzahran001 @songhan ImportError: Extension horovod.torch has not been built. If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error
Hi, I noticed that here(link1) add base acc to train accuracy predictor and set the last linear layer bias as False.
For the same batch data, they have the same bias.
If I do not add this bias, is the predictor will decrease? Or I missing any analysis in the paper?
I sample 800 net ids in ofa net, initial parameter is'https://hanlab.mit.edu/files/OnceForAll/ofa_nets/'
I want to use the subnets to test the predictor in ofa/tutorial/accuracy_predictor.py, the predictor load pretrained model
'https://hanlab.mit.edu/files/OnceForAll/tutorial/acc_predictor.pth'
the picture shows a strange result,just like this:
The red dot is my own predictor
Why aren't The blue points in a straight line?
Hi Han Cai, Thank you so much for responding about previous question very quickly.
I'm trying to train OFA Model(ofa_mbv3) using 4 Nvidia Titan V & 2 Titan RTX GPUs.
But There's a problems when validating subnet models.
I checked below code about progressive shrinking validate.
for setting, name in subnet_settings:
run_manager.write_log('-' * 30 + ' Validate %s ' % name + '-' * 30, 'train', should_print=False)
run_manager.run_config.data_provider.assign_active_img_size(setting.pop('image_size'))
dynamic_net.set_active_subnet(**setting)
run_manager.write_log(dynamic_net.module_str, 'train', should_print=False)
run_manager.reset_running_statistics(dynamic_net)
loss, top1, top5 = run_manager.validate(epoch=epoch, is_test=is_test, run_str=name, net=dynamic_net)
losses_of_subnets.append(loss)
top1_of_subnets.append(top1)
top5_of_subnets.append(top5)
valid_log += '%s (%.3f), ' % (name, top1)
Validating 1st loop (about 1st subnet) is no problem.
But when I try to validate 2nd subnet, Error("CUDA out of memory") happened.
My GPUs have 12GB(Titan V) , 24GB(Titan RTX) Memories each.
How big is your GPU memory?
Also, please let me know if there is any guessing or recommendation to solve this error.
Thank you so much.
Looks like args.valid_size in train_ofa_net.py is set to 10000. Is that right? Seems to me that target size is much smaller than that (~200)
once-for-all/ofa/imagenet_classification/data_providers/base_provider.py", line 42, in random_sample_valid_set
assert train_size > valid_size
Hi,
Is there a design rationale for not making the first bottleneck layer Dynamic? Instead, the first bottleneck layer is used as a simple Residual block (Link). I believe a similar setup was carried out in ProxylessNAS as well. I am interested to hear the insights on why.
Thanks,
Vinod
According to the paper:
The algorithm should fine-tune the transformation matrix in the "Elastic Kernel Size" stage while freezing the transformation matrix in the "Elastic Depth" stage and "Elastic Width" stage. But in the code (line 45 in dynamic_op.py), it will be fine-tuned anyway.
Could you tell me why?
Soryy to ask such a simple question, but I can not find the solution anywhere. Could anyone help me ?
as the title, I want train on a small dataset.
What should I do?
Hey, thanks for your amazing work!
can you share the Latency Estimator and the evolutionary search code based on the neural-network-twins to get a specialized sub-network?
Thanks
In channel selection for width control, the function re_organize_middle_weights
in dynamic_layers
. In line 144, the following operation is applied - importance[target_width:] = torch.arange(0, target_width - importance.size(0), -1)
.
I don't really understand this line. If importance
is assumed to be sorted then it does nothing to the order of importance. If it is not - then important channels can effectively be discarded.
What am I missing?
Hi, this project is an excellent work about NAS. I am very interested in it and try it on my machine. But I get the following problem when running 'horovodrun -np 4 -H localhost:4 python train_ofa_net.py':
It seems this issuse comes from my horovod. But I have installed successfully 'horovod' and can run examples without error. I also googled but no soluntion has been found yet. Can you help me?
Here is my environment:
Thanks for sharing your code!
I'm wondering if this is a bug for the implementation of knowledge distillation.
Since the cross_entropy_loss_with_soft_target
already use nn.LogSoftmax
,
Does it need to apply softmax
on soft_logits
here again? Thanks!
Hi,
How does args.dynamic_batchsize
work?
I run train_ofa_net.py and there is three folders under 'exp/': 'kernel2kernel_depth', 'kernel_depth2kernel_depth_width', 'normal2kernel'. Then, what should I do next? There are 'checkpoint logs net.config net_info.txt run.config' under each exp subfolder after training. Anybody knows how should I deal with it?
I can not find any relations between the training exp results and 'eval_ofa_net.py'. Please help this poor kid. \doge
Thanks for great work! This code uses a pytorch model but you mention that the models are deployed on mobile in tf-lite, do you convert a pytorch model with ONNX or implement it in tensorflow separately?
Hi, thx for your work.
In the paper, for supporting elastic width, a channel sorting algorithm based on the norm of each channel was introduced. However, i can't find this part in the codes. Could anyone tell me about its location?
Hi and thanks for the amazing work,
What's the top5 accuracy on ImageNet of the model that achieved top1=80% reported in the paper?
This would help for my literature review where I only have top5 for some models.
Thanks,
Boris
project里似乎只有supernet的训练代码,子网的重训练代码请问是否能提供?
Hi,
Thanks for your time regarding to this issue.
I have some questions about OFA supernet training phase.
Will performance of supernet always surpass the performance of original model?
How should we modify the hyper parameter setting from original model task (LR, optimizer type)?
Is the performance of supernet the ceil of performances of subnets?
Thanks for your help and happy Chinese New Year!
Thank you for your great job.
I have a question about the calculation of importance.
Here in Once for all, the importance is calculated by the input dimension.
But in Pruning_filters_for_efficient_convnets, the importance is calculated by the output dimension.
sum_of_kernel = torch.sum(torch.abs(kernel.view(kernel.size(0), -1)), dim=1)
Is there any intrinsic reason to calculated by the input dimension?
Thanks!
Greetings
There is a function re_organize_middle_weights which resort the convolution weight. However, the sequence of x remain the same after this operation.
Thus, the weight is misordering to input x. Mismatch of weight and input will cause output changes. Is this a big problem?
forward_model is created by deep copying incoming model
However, it's not deployed in any gpu devices.
It's time-consuming to calculate mean and variance by forwarding batch of images using cpu.
I think it's better to assign default device and deploy the copied one on it.
hi, thanks for your excellent work
How did the network architecture be encoded and decoded during the evolution?
After reading the description of the acc predictor
in the paper, it seems that the kernel size
and expansion
of each layer are first ecoded. If a architecture is [3,4, ....., 0,0 ..... 3,6], another architecture is [3,4, ....., 7,4 ..... 3,6], there are two question in evolution:
What if [0,0] and [7,4] crossover [0,4]? This is not a normal gene.
If one stage is [1,1,0,0], the last two are skipped. If mutation is [1,1,0,1] during the evolution process, which the last layer is not skipped, but the third layer is skipped. (which is not in line with the rules.)
On line 67 of progressive_shrinking.py, why do we need the 'reset_running_statistics' function to reset both the 'mean' and 'var' value of the batchnormal layer to the 'mean' and 'var' obtained from random 2000 images?
run_manager.reset_running_statistics(dynamic_net)
Searching with note10 constraint (25): 100%|██████████| 500/500 [00:09<00:00, 51.03it/s]Found best architecture on note10 with latency <= 25.00 ms in 9.84 seconds! It achieves 81.71% predicted accuracy with 24.73 ms latency on note10.
Architecture of the searched sub-net:
DyConv(O32, K3, S2)
(DyConv(O32, K3, S1), Identity)
DyConv(O64, K3, S1)
max_pooling(ks=3, stride=2)
(3x3_BottleneckConv_in->768->256_S1, avgpool_conv)
(3x3_BottleneckConv_in->768->256_S1, Identity)
(3x3_BottleneckConv_in->1536->256_S1, Identity)
(3x3_BottleneckConv_in->768->256_S1, Identity)
(3x3_BottleneckConv_in->2048->512_S2, avgpool_conv)
(3x3_BottleneckConv_in->2048->512_S1, Identity)
(3x3_BottleneckConv_in->3072->512_S1, Identity)
(3x3_BottleneckConv_in->2048->512_S1, Identity)
(3x3_BottleneckConv_in->6144->1024_S2, avgpool_conv)
(3x3_BottleneckConv_in->3072->1024_S1, Identity)
(3x3_BottleneckConv_in->4096->1024_S1, Identity)
(3x3_BottleneckConv_in->6144->1024_S1, Identity)
(3x3_BottleneckConv_in->4096->1024_S1, Identity)
(3x3_BottleneckConv_in->4096->1024_S1, Identity)
(3x3_BottleneckConv_in->8192->2048_S2, avgpool_conv)
(3x3_BottleneckConv_in->6144->2048_S1, Identity)
(3x3_BottleneckConv_in->12288->2048_S1, Identity)
(3x3_BottleneckConv_in->12288->2048_S1, Identity)
MyGlobalAvgPool2d(keep_dim=False)
DyLinear(2048, 1000)
But, I think The middle dimension of the network searched is a bit untrustworthy
Evaluating the sub-network with latency = 24.7 ms on note10
RuntimeError Traceback (most recent call last)
in
6 , net_config, latency = result
7 print('Evaluating the sub-network with latency = %.1f ms on %s' % (latency, target_hardware))
----> 8 top1 = evaluate_ofa_subnet(
9 ofa_network,
10 imagenet_data_path,
~/桌面/once-for-all-master/ofa/tutorial/imagenet_eval_helper.py in evaluate_ofa_subnet(ofa_net, path, net_config, data_loader, batch_size, device)
18 assert len(net_config['ks']) == 20 and len(net_config['e']) == 20 and len(net_config['d']) == 5
19 ofa_net.set_active_subnet(ks=net_config['ks'], d=net_config['d'], e=net_config['e'])
---> 20 subnet = ofa_net.get_active_subnet().to(device)
21 calib_bn(subnet, path, net_config['r'][0], batch_size)
22 top1 = validate(subnet, path, net_config['r'][0], data_loader, batch_size, device)
~/桌面/once-for-all-master/ofa/imagenet_classification/elastic_nn/networks/ofa_resnets.py in get_active_subnet(self, preserve_weight)
226 active_idx = block_idx[:len(block_idx) - depth_param]
227 for idx in active_idx:
--> 228 blocks.append(self.blocks[idx].get_active_subnet(input_channel, preserve_weight))
229 input_channel = self.blocks[idx].active_out_channel
230 classifier = self.classifier.get_active_subnet(input_channel, preserve_weight)
~/桌面/once-for-all-master/ofa/imagenet_classification/elastic_nn/modules/dynamic_layers.py in get_active_subnet(self, in_channel, preserve_weight)
540
541 # copy weight from current layer
--> 542 sub_layer.conv1.conv.weight.data.copy(
543 self.conv1.conv.get_active_filter(self.active_middle_channels, in_channel).data)
544 copy_bn(sub_layer.conv1.bn, self.conv1.bn.bn)
RuntimeError: The size of tensor a (768) must match the size of tensor b (88) at non-singleton dimension 0
I guess that Do I need to modify the code for resnet50 network. Please tell me how to modify . Thanks a lot
Hi, this work is awesome!
Latency table looks released only for Samsung note 10.
Could you release another latency tables especially for 1080ti?
Thanks in advnace.
Best,
Hayeon Lee.
Hi, thanks for your great work!
Could you release the script that trains the pretrained model ofa_D4_E6_K7
, i.e., the full network?
I have a question that is not cleared in the paper. During knowledge distillation, do you optimize for all 10^19 networks? The elastic - nn portion of the code seems to point to that:
subnet_settings = []
for d in depth_list:
for e in expand_ratio_list:
for k in ks_list:
for w in width_mult_list:
for img_size in image_size_list:
subnet_settings.append([{
'image_size': img_size,
'd': d,
'e': e,
'ks': k,
'w': w,
}, 'R%s-D%s-E%s-K%s-W%s' % (img_size, d, e, k, w)])
Thank you for your excellent code!
You use teacher-student distilling method when training sub-models, how is the accuracy of the teacher model (kernel size is 7, expansion is 6 and 4 layers in each unit)?
Thanks for sharing your code for this excellent work!
Could you reveal more details about how you finetune your specialized sub-network? I didn't find the code in the repo, but hyper-parameters like batch size, optimizer, learning rate, lr decay and weight decay will be also very helpful.
Thanks again.
Hi,
Congratulations on this great job. I was amazed by your solution in CVPR2020 competition.
Is there any tutorial to use this work on a FPGA ZynqUltrascale ZU3EG or ZU9EG?
Best regards,
Jorge
Hi,
I find a strange problem that when I train the teacher network, the final biggest network has accuracy 82.2%, but when I test the saved model the accuracy drops to 79.7%.
Can anybody help to explain this phenomenon?
Best,
Thanks for sharing your code.
I have some problems when I test the ofa pretrained network.
I build a ofa network using the code provided in the README.
from model_zoo import ofa_net
ofa_network = ofa_net('ofa_mbv3_d234_e346_k357_w1.0', pretrained=True)
ofa_network.set_active_subnet(ks=7, e=6, d=4)
subnet = ofa_network.get_active_subnet(preserve_weight=True)
# test the subset on the validation set of the ImageNet
When I set the parameter ks
, e
and d
with different value, the accuracy of the ofa network becomes about 0 in some cases. I show the test results in the following:
Have I made some mistake while testing the ofa pretrained network?
Hello
I want to train OFA model(ofa_mbv3) on 'Cifar100' or custom datasets.
so I want to get some training details about first supernet.
When I checked model in progressive-shrinking phase,
I saw F.Linear(Kernel transformer) layer's weights were also trained.
When I want to train First Supernet (ofa_D4_E6_K7), should I train there Kernel transform matrix?
And I was wondering If you had some information about OFA net training on other dataset(like cifar 10, 100), I want to know them.
Thank you all the time.
I'd like to export as an ONNX file or as a pth file + net Class some of the subnets. How can I do it?
Hi
I want to use OFA - specialized model on my device.
so I need to convert speicialized ofa model(pytorch) to TF-Lite Model.
I tried to convert model with onnx, but some error occured...
(e.g. ValueError: Shape must be rank 2 but is rank 1 for MatMul ...)
And I want to get Latency Tables using my devices resources also.
How could I get them??
Thank you...
Thanks for sharing your excellent work. I hava two questions about ofa.
Dear @han-cai ,
Thanks for sharing your excellent work. I am just curious about how we can apply this technique to detection problem (training on the COCO detection dataset)? Appreciate your time.
Hi,
Thanks for the Amazing work.
I want to train the OFA network on our custom Dataset. How to do it for the same?
Looking forward to your reply.
Thanks,
Darshan
Hi, can you share the script to generate 'latency_table@XXX/YYY_lookup_table.yaml'
?I want to know the details.
This may out-performance than RegNet.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.