Comments (17)
Hey Claydon,
Given the fact that the batch size I use is just 4, it may not be a good solution to use multiple GPUs to accelerate the training. If I am correct, DataParallel is for these models with a large batch size. And your error might be that on some GPUs there is only one training image.
To accelerate the training, I would suggest you work on accelerating meta-test training, which is the major reason for the slow training of our model. In every meta-test step, we need to create new computation graphs as the meta-test model. This is extremely slow. Without the meta-test training stages, one epoch may only take around 20 mins but with meta-test, it takes 1+ hours.
Best wishes,
Xiao
from dgnet.
Hi Jiayi,
Thank you for your interest in our work.
The current code has the parameters without any fine-tuning. As in my experiments, I had to tune several hyperparameters to obtain better results for each case and I only saved the pre-trained models but did not save the hyperparameters for each case.
I can provide several tips here. You can change the resampling rate in the data loader for training and testing by varying 1.1 in "resize_order = re / 1.1" between 1.0 - 1.3. I also performed experiments with input sizes 224x224 and 288x288. Sometimes 288x288 gives better results but the GPU memory will cause issues. I also tried to tune the meta_step_size from 0.001 to 0.01. Also, # k_un = 1 # k1 = 20 # k2 = 2, the three training parameters matter.
I will try to find tuned codes at least for BCD->A such that you can play with it.
Best wishes,
Xiao
from dgnet.
Hi Xiao,
Thanks for your reminder!
You may kindly send the tuning code to my email: [email protected] or just make it public on hub If you can find it.
Jiayi
from dgnet.
Hi Jiayi,
I just started re-training. I will release the tuned version for BCD->A soon.
Feel free to talk with me at MICCAI 2021.
Best wishes,
Xiao
from dgnet.
Hi Xiao,
Thanks for your effort re-training the model. I attempted to tune the parameters and currently have several questions regarding the implementation details.
-
For changing "resize_order = re / 1.1", you said it was used to change the resampling rate in the dataloader, but after I looked through the code, I am wondering isn't it directly scaling the values of raw data?
-
For changing "# k_un = 1 # k1 = 20 # k2 = 2", I consider k_un to be the number of iterations for meta-train and meta-test steps. I am wondering the naming reasons for 'un' in 'k_un', 'un_imgs', 'un_reco_loss' and etc. Besides, I am wondering how changing k1 and k2 could affect inference results since they are simply used to record the training process in my view.
I hope that you can tell me where I got it wrong. Thanks in advance!
Jiayi
from dgnet.
Hi Jiayi,
The resampling rate is the order to rescale the images, which affects how large the anatomy of interest is.
"un" means unlabeled. k1 and k2 affect the learning rate decay.
Best wishes,
Xiao
from dgnet.
Hi Xiao,
Thank you for your reminder.
I noticed that you used 'scheduler.step(val_score)' twice in train_meta.py. Is that for accelerating learning rate decay? If it is, why not tune the step_size param inside lr_scheduler.StepLR?
Thanks.
Jiayi
from dgnet.
Hi Jiayi,
Oh yes, good point. I think it is a mistake when I copy from other versions to this public version.
There should be only one step. This will cause the LR to decay too quickly and the model will converge to a local minima quickly.
Best wishes,
Xiao
from dgnet.
Hi Xiao,
I am currently tuning the model myself according to the tips you gave. May I know the exact way for you to tune the parameters? Did you use grid search, random search, Bayesian optimization or other methods? Is there any priority among the different parameters that need to be tuned first?
Thanks in advance.
Jiayi
from dgnet.
Hi Jiayi,
Tuning the model is a little bit tricky. I tune the hyperparameters by checking the losses and visuals during training. You may first try to change a bit the resampling rate to see how it affects the results and keep k1 and k2 fixed first if you are just playing with it. If you want to consider our model as one baseline, I suggest you wait for the tuned version or I can share you with the well-trained model wights.
I found the results of 5% cases with the current version. It seems that for BCD->A 5% cases, the current parameters work well and the results are even better than the results I report in the paper. I am busy during this week as I am in MICCAI presenting this paper. I will put up more details of training hopefully in next weeks.
Best wishes,
Xiao
from dgnet.
Great! Thanks.
from dgnet.
Hi Xiao,
Kindly note that citation 10 in this paper might be wrong... You might want to cite the paper which proposes Dice loss but actually it isn't ...
Best,
Jiayi
from dgnet.
Hi Jiayi,
Thanks for that. I personally prefer to cite the very initial DICE paper. But yes, to be accurate, the following paper should be cited.
Milletari F, Navab N, Ahmadi SA, 2016. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. 2016 Fourth International Conference on 3D Vision, 565–571doi:10.1109/3DV.2016.79.
Best wishes,
Xiao
from dgnet.
Hi Jiayi,
I updated the code and training details. I fixed some issues like the double copied lr step code.
The current parameters are for BCD->A cases. You can train the models to see how it performs. Overall, to tune the model, you may want to change the initial learning rate (2e-5 to 5e-5), the number of training epochs (80 to 120 for 100% cases), training parameters (k1 and k2), and resampling rate. Our model has many parameters to tune, which is also a drawback of disentanglement. The results I reported in the paper may not be the potentially best results of our model as I did not have much time to tune before submission. I actually found better results for the 5% case with the parameters in this public version.
Best wishes,
Xiao
from dgnet.
Hi Xiao,
Thank you very much for your work. I will try to tune the model to see how it performs. Many thanks!
Best,
Jiayi
from dgnet.
Hi Xiao,
Thank you very much for your work. I will try to tune the model to see how it performs. Many thanks!
Best,
Jiayi
Hi, Yang,
I was wondering if the model could be trained in "DataParallel", since it is time-consuming to use a single GPU.
I turn "model.to(device)" to "model = torch.nn.DataParalle(model).cuda()" and the mistake is shown as below:
DataParallel multi-gpu RuntimeError: chunk expects at least a 1-dimensional tensor
How can I solve it. Can you give me some advice? Thanks in advance!
from dgnet.
Hi liu,
Thank you for your kind reply, I will try it.
Best regards,
Wang
from dgnet.
Related Issues (8)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dgnet.