First I want to thank you for your work! When I try to reproduce the result of ADE20k,

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Batch size, GPU number, performance without pretrained model for ADE20k training?,about fcdl94/mib

Comments (27)

fcdl94 commented on July 26, 2024 1

Hi @wuyujack !
i., ii.)The batch size is the same for all experiments and it is 24. Since I used 2 Titan RTX GPUs, I used a batch size of 12 on each (same setup on train/test).

iii.) Training time is harder to estimate since it depends on the setting. Let's say we are using the 100-50 on ADE20K, step 0 was nearly 12 hours, while step 1 was nearly 7 hours. (looking at your command, I saw you used 30 epochs for ADE, but as written in the paper, we used 60 epochs for that dataset).
Regarding Pascal-VOC and 15-5 setting, the training time was around 6 hours for step 0 and 40 minutes for step 1. Here we used 30 epochs.

iv.) I have no results to show regarding a model without the pretrained, sorry.

v.) VOC step 0 was:
19-1: 78.7 ± 0.8 mIoU
15-5: 80.4 ± 0.8 mIoU

ADE step 0 was (on Order A):
100-50: 42.6 ± 0.5 mIoU
50-50-50: 48.5 ± 0.5 mIoU

Hope it helps.

from mib.

wuyujack commented on July 26, 2024

@fcdl94 Thank you so much for your detailed response and it will be definitely helpful for me to replicate the experiments!

Yep, I noticed that I made a mistake on the training setting for ADE20k as well and I am going to re-run the experiments, thanks for pointing out!

I may close the issue now.

from mib.

wuyujack commented on July 26, 2024

Hi @fcdl94!

When I reproducing the 100-10 experiment on ADE20k, I still encounter out of memory issues when it starts from stage 3. I am using 4 x 1080Ti (48 GB), which should be the same GPU memory size as yours. This is weird since it works fine when it was on stage 0, 1, and 2, but the memory suddenly increases when it turns to stage 3, 4, 5.

command line:
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch --nproc_per_node=4 run.py --data_root data --batch_size 6 --dataset ade --name test_MIB_ade_100_10_lr_0.01_with_pretrained_4GPU --task 100-10 --step 3 --lr 0.001 --epochs 60 --method MiB

from mib.

wuyujack commented on July 26, 2024

The other issue is that when I am running some experiments (50, 100-50, 100-10) for ADE20k, it seems like some of the class indexes around 80-100 can not be learned (mIoU and acc is 0) or although it can be learned but the mIoU is almost 0. This may largely influence the total classes' average mIoU. For VOC's experiments, I also encounter the same issues, but it does not happen as frequent as the ADE20k.

Is it also common from your side that the learning of the later classes are unstable? @fcdl94

from mib.

fcdl94 commented on July 26, 2024

Hi @wuyujack!

So, regarding the GPU error. It's strange, I used the TitanX with 24 GB of memory, so as you said, same amount of memory. Actually, I don't remember if I used the Mixed Precision tool of AMP for those experiments, but I'm quite sure I don't. May be an overhead due to the communication among the GPU?
Regarding the memory increase in step 3,4,5 it's reasonable since the losses are computed on more logits, and maybe using 120 classes you are already filling all the memory you have available, so just adding 10 makes the out of memory issue.

Regarding the second comment, I also had the same issue on ADE20K. It is due to the fact that it orders provided are sorted by frequency of the classes (also order B is sorted in each step where more frequent classes appear first). So, the last 10/20 classes are the least represented, leading to a poor class mIoU.
On Pascal-VOC I did not remember having mIoU of zero unless using the 15-1 scenario, where it's very hard to remember past classes after learning the classes "sofa" or "train". But it's more due to the nature of the task than the methods.

from mib.

wuyujack commented on July 26, 2024

@fcdl94 Thanks for the reply! I have another question regarding the total number of training data you used for the ADE20k and VOC experiments.

For ADE20k 100-50 experiments, I count the train-0.npy and train-1.npy and I get 13452 and 6744 respectively, which means the total number of training data is 20196. It is not equal to the total number of ADE20k training datasets (data/ADEChallengeData2016/images/training) which has 20210. Do you remove 14 samples from the original training dataset?

And for VOC I also count the training samples you used in 15-5 experiments, the train-0.npy and train-1.npy have 8437 and 2145 samples respectively, which is the same from the total number of the train_aug.txt in `data/PascalVOC2012/splits/'.

from mib.

fcdl94 commented on July 26, 2024

As far as I remember, I excluded some images since they did not bring relevant I information (they where all 255).
However, I need to check it they were 14, otherwise it may be simply a problem due to the algorithm that splits the data.

However, I think these 14 images will not change the results significantly.

from mib.

wuyujack commented on July 26, 2024

@fcdl94 Thanks for the quick reply and I do think this will not influence the result of the paper. I was curious about the reason they were removed.

BTW, you mentioned:

Differently from [24], at each step we assume to have only labels for pixels of novel classes, while the old ones are labeled as background in the ground truth.

in Section 4.3. May I know about how you implement it on the fly?

from mib.

fcdl94 commented on July 26, 2024

So, the mapping has been made through a dictionary.
You can see it in line 254 of dataset/ade.
Basically, it recieves a list of classes to learn and it maps them in order to have classes among 0-N_CLASSES. If in the label there is a class not in the given list, it will mask it with masking_value, which is 255 for test (actually, it may be 0 to be coherent with the setting, however I did not masked any class for the reported results on the paper since they are taken on the last step) and 0 for training.

from mib.

wuyujack commented on July 26, 2024

@fcdl94 Thanks for the detailed explanation! BTW, I am using an 8 GPU server to do the experiments and for the VOC experiment, it only needs 2 of them to satisfy the memory requirement. Therefore, I try to run three duplicate VOC experiments with a different random seed, but since you are using the distributed framework, then an error may occur and the problem seems that they are all the experiments are running with the same local_rank then they are trying to use the same port for communication. Therefore, do you have any experience of running two or more experiments on the same GPU server under this distributed PyTorch framework?

from mib.

fcdl94 commented on July 26, 2024

I had the same issue on a 4GPUs workstation. My problem was the master_port value.
If you use the same default port in different scripts, the pytorch framework will throw an error.
Can you try to add --master_port X (X is a number for the port) before run_inc.py in the command line?
You should use different ports in different experiments.

from mib.

fcdl94 commented on July 26, 2024

Example:
python -m torch.distributed.launch --master_port 1994 --nproc_per_node 2 run.py --data_root <data_folder> --name <exp_name> .. other args ..

Look here for the docs https://pytorch.org/docs/stable/distributed.html. See in launch utility.

from mib.

wuyujack commented on July 26, 2024

@fcdl94 I try your advice by directly adding the --master_port to distinguish different task on the same node, and it works smoothly in my server, thank you very much!

from mib.

wuyujack commented on July 26, 2024

Hi @fcdl94 ! I see someone ask for the interpretation of table 1 in the previous issue (#20). Here I have another few more questions:

For table 2, is it also reported in the same manner as table 1 that all the results are recorded after 150 classes are trained, and then you calculate for each session of the incremental training?
And for the reported results of Table2, is it using the Order A or Order B, for example, is the 100-10 or 100-10b being reported in Table 2?
As the ADE20k training needs more than 12hours for step 0, for 100-10, did you reuse the pretrained model obtained after training for the step 0 and then perform different experiments on the step 1-5?
And May I know the step 0 performance for the FT, LwF, LwF-MC, ILT, MiB for the 100-10 and 100-10b, and 50-50 and 50-50b? I want to confirm my re-implementation is close to yours and I find that some of the methods like FT and LwF in step 0 have 1% better mIoU compared with LwF-MC (iCaRL).

from mib.

fcdl94 commented on July 26, 2024

Hi @wuyujack.

Yes, all the results are reported after all classes have been learned
It's the average of Order A and Order B
I used the same pretrained for 100-50 and 100-10. Moreover, I used the same pretrained on the first 100 classes for all the methods, with the exception of LWF-MC since it needs to use the Binary CE.
For FT, LwF, ILT, MiB I used the same pretrained as said before (will refer to it as FC next) and for LwF-MC I used the same training protocol but with BCE (It should be already implemented in this way in the code).
Unfortunately, I lost some partial results since I hadn't reported them on my G-sheet. These are what I found:
100-50 A: FT 43 mIoU, FT-bce 43.2 mIoU
50-50 A: FT 39.1 mIoU

I hope they can be sufficient for you, otherwise I will check on my old workstation if I can find some old logs, sorry.

from mib.

wuyujack commented on July 26, 2024

Hi @fcdl94, a million thanks for your reply and sorry for bothering you too much! Since I only ran the 100-10b before, I may re-run the 100-10 and then calculate the average of them. Currently, for all the methods except LWF you reported in Table 2, in 100-10b settings, I get the step 0 performance as follows:

MiB: 39.27 (not defaulted random seed)
ILT: 38.55
LWF-MC: 38.6
EWC: 39.14
FT: 39.17

As the split of the 100-10b has less training data than 100-10, I think the mIoU may increase after I add the 100-10 results for average. I may let you know when I get the final one. Based on these results, the methods are close enough, and it makes sense to use the same pretrained for all methods in the FC, as the forgetting problem performs differently among all the methods, and the same pretrained model may not influence the final conclusion.

BTW, for the step 1-5 or step 1-2 or step 1 in different incremental learning settings, did you use different random seed rather than using the default one (42) using in the step 0?

from mib.

fcdl94 commented on July 26, 2024

Ok, nice, let me know.

Just for clarity, it did not influence the results using the same pretrained since the mentioned methods only differ in the distillation loss, which is not applied in step 0.

As far as I remember, I used the same random seed for all the experiments. However, I also tried different seeds but the results were pretty close and that was the reason I reported results with just one seed.
Let me know if you find significant difference among seeds.

from mib.

wuyujack commented on July 26, 2024

@fcdl94 Thanks for the clarity and comments, and yep I agree with you. I also get very close performance for MiB when I am using a different random seed, which also implies that the method is robust. And I forget to tell you that the reported performance in my last comment is all run in 6 RTX2080 Ti, as I mentioned before that the out-of-memory issue can not be solved :(. But based on my observation, although I am running using a different batch size (4 instead of 12 in your paper) for each GPU, the performance is still close to yours. I am not sure whether it benefits from the distributed training mode such that the training becomes more stable, but at least it is still a piece of good news :)

BTW, if I want to self-split the VOC into 5-5-5, a three-step setting, do you have any suggestions for a quick implementation? I have re-run your ADE notebook line-by-line and get to know how you assign an image to a class when the image has several labels, your way makes sense to me too as you try to manage the number of samples you are using for each class to be balanced and also take enough samples for each minority class. But for VOC it seems like it has already been handle in your code.

from mib.

wuyujack commented on July 26, 2024

Hi @fcdl94, I just get the 100-10 results for MiB with the default random seed 42:

Step 0: 42.87% mIoU
Results calculated after step 5 training is completed:

1-100: 38.67%
101-110: 2.05%
111-120: 7.86%
121-130: 8.34%
131-140: 2.63%
141-150: 11.97%
Average of 150 classes: 27.97%

Command Line:

step 0: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port 1995 --nproc_per_node=6 run.py --data_root data --batch_size 4 --dataset ade --name test_MIB_ade_100-10_lr_0.01_with_pretrained_6GPU_60epoch --task 100-10 --step 0 --lr 0.01 --epochs 60 --method MiB

step 1: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port 1995 --nproc_per_node=6 run.py --data_root data --batch_size 4 --dataset ade --name test_MIB_ade_100-10_lr_0.01_with_pretrained_6GPU_60epoch --task 100-10 --step 1 --lr 0.001 --epochs 60 --method MiB

step 2: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port 1995 --nproc_per_node=6 run.py --data_root data --batch_size 4 --dataset ade --name test_MIB_ade_100-10_lr_0.01_with_pretrained_6GPU_60epoch --task 100-10 --step 2 --lr 0.001 --epochs 60 --method MiB

step 3: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port 1995 --nproc_per_node=6 run.py --data_root data --batch_size 4 --dataset ade --name test_MIB_ade_100-10_lr_0.01_with_pretrained_6GPU_60epoch --task 100-10 --step 3 --lr 0.001 --epochs 60 --method MiB

step 4: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port 1995 --nproc_per_node=6 run.py --data_root data --batch_size 4 --dataset ade --name test_MIB_ade_100-10_lr_0.01_with_pretrained_6GPU_60epoch --task 100-10 --step 4 --lr 0.001 --epochs 60 --method MiB

step5: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port 1995 --nproc_per_node=6 run.py --data_root data --batch_size 4 --dataset ade --name test_MIB_ade_100-10_lr_0.01_with_pretrained_6GPU_60epoch --task 100-10 --step 5 --lr 0.001 --epochs 60 --method MiB

Comments:

I need to highlight that the reproduction is run under 6 GPU with the same total batch size 24. As the batch size for each GPU is different here (4 instead of 12 for 2 RTX Titan), the performance might be influenced by the batch size itself.
If we only look at the average results of Order-A of 100-10, without averaging with the Order-B of 100-10, it exceeds the Table 2's all result with a large margin (27.97% vs. 25.9%, >2%), and this contributes to the well maintained of the 1-100 classes performance after step 1-5 training. However, if we look at the 101-110, 111-120, 121-130, 131-140, 141-150, respectively, all of them can not compare with the results shown in Table 2.

Currently, I can not conclude that the behind reason is mainly due to the batch size itself until I can get 2 RTX Titan to re-run the experiments. However, I still want to let you know to see whether you have further comments on it.

from mib.

fcdl94 commented on July 26, 2024

I think they are in line with my results. There may be some incongruency, but at least they are pretty close.

Order A

Meth	0-100	100-110	110-120	120-130	130-140	140-150	all
MiB(ours)	37.8%	1.3%	8.8%	9.9%	5.1%	10.5%	27.6%
MiB(yours)	38.67%	2.05%	7.86%	8.34%	2.63%	11.97%	27.97%

Order B

Meth	0-100	100-110	110-120	120-130	130-140	140-150	all
MiB(mine)	25.8%	19.6%	20.9%	15.8%	22.0%	26.9%	24.2%

Moreover, I don't think that using a different batch size per GPU does affect the results if the resulting BS is still 24.

from mib.

fcdl94 commented on July 26, 2024

For the 5-5-5 Split on Pascal-VOC, basically you need only to define the task on the tasks.py file.
The splitting of the dataset (only on Pascal) is made automatically the first time you run it. :)

from mib.

wuyujack commented on July 26, 2024

Thanks for the prompt reply and advice! I am still waiting for the Order B results of MiB for averaging and may get back to you again :)

from mib.

wuyujack commented on July 26, 2024

Hi @fcdl94, finally I get the Order B results with the default random seed and they are:

Meth	0-100	100-110	110-120	120-130	130-140	140-150	all
MiB(Yours)	25.80%	19.60%	20.90%	15.80%	22.00%	26.90%	24.20%
MiB(Mine)	26.60%	21.95%	19.18%	12.19%	19.78%	27.08%	24.41%

Now I totally agree with you that using a different batch size per GPU does not affect the results (though still a little bit different with yours) if the total batch size is the same :) Thank you so much for your help these days and also thank you for bringing your good work to the whole community!

from mib.

wuyujack commented on July 26, 2024

Hi @wuyujack.

Yes, all the results are reported after all classes have been learned

It's the average of Order A and Order B

I used the same pretrained for 100-50 and 100-10. Moreover, I used the same pretrained on the first 100 classes for all the methods, with the exception of LWF-MC since it needs to use the Binary CE.

For FT, LwF, ILT, MiB I used the same pretrained as said before (will refer to it as FC next) and for LwF-MC I used the same training protocol but with BCE (It should be already implemented in this way in the code).
Unfortunately, I lost some partial results since I hadn't reported them on my G-sheet. These are what I found:
100-50 A: FT 43 mIoU, FT-bce 43.2 mIoU
50-50 A: FT 39.1 mIoU

I hope they can be sufficient for you, otherwise I will check on my old workstation if I can find some old logs, sorry.

Hi @fcdl94 , sorry to bring up this comment too late. As you mentioned in this comment that the reported results of ADE20k are the average of order A and B, then for the 100-10 setting, are you calculating the average of the performance of 1-100 from Order A and Order B although they are containing different classes?

from mib.

fcdl94 commented on July 26, 2024

Yes, the idea is to normalize the results on different orders not to overfit a specific one, and to monitor the forgetting/intransigence problem, that was the only solution.

from mib.

wuyujack commented on July 26, 2024

Yes, the idea is to normalize the results on different orders not to overfit a specific one, and to monitor the forgetting/intransigence problem, that was the only solution.

Hi @fcdl94 , thanks for your previous reply, and could you also provide the performance of 100-50 and 50-50-50 for MiB under order A and B separately?

from mib.

fcdl94 commented on July 26, 2024

Hi @wuyujack,
These are all the numbers I got. Hope it's helpful :)

from mib.

Batch size, GPU number, performance without pretrained model for ADE20k training? about mib HOT 27 CLOSED

Comments (27)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent