Hi, Thanks for your excellent work. I am not sure the batchsize in your paper is s

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi， thanks for the advice <a class="user-mention notranslate" data-hovercard-type="use

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks a lot for your patience reply <a class="user-mention notranslate" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

training setup about villa HOT 9 CLOSED

zhegan27 commented on May 27, 2024

training setup

from villa.

Comments (9)

youngfly11 commented on May 27, 2024

Hi, yixuan;

I met the same question as you. I cannot use the same batch_size to reproduce the results.

But I am curious about how fast the training speed is in your machine? Have you tried the pretraining code (not the vaq fine-tuning)? I found that the pretraining is very slow

Thanks
Yongfei

from villa.

zhegan27 commented on May 27, 2024

Sorry for the late response. I will answer your questions one by one below.

a) Yes, your understanding is correct. 3072 refers to the total tokens, not real batch sizes. Sorry for the confusion.

b) You should keep num_tokens * num_GPUs * num_Grad_Accu the same. So, in this case, if we run the code under 3072 tokens, 8 GPUs and 4 Grad Accu., then, when changing to 1024 tokens, using the same 8 GPUs, your Grad Accu. should be 12. This will help you reproduce the results.

c) The config json file cannot exactly reproduce the best VQA results in our paper. In experiments, we observed that when setting "conf_th" to a smaller number (which controls how many bounding boxes we have for an image), it will result in a better performance. However, the provided default image features set "conf_th" to 0.2, which means that we need to host a new set of image features with a smaller "conf_th". We will try to host this after the new year holiday.

Adv. Lr. is not a very sensitive hyper-parameter. You can try different values, but generally, it will result in similar performance. I will try to find the exact config file to reproduce our best results.

Hope it helps. Thanks for your interest in our code.

Best,
Zhe

from villa.

zhegan27 commented on May 27, 2024

@youngfly11 , sorry that you found your pre-training very slow. Typically, it will be 2x slower than standard UNITER pre-training. However, we did not really measure this, as we ran the code on our GPU clusters. Did you try standard UNITER pre-training? Is it also slow? We did not find the pre-training slow in our experiments.

For exactly reproducing the best VQA results in our paper, we will try to provide it after the holiday.

Best,
Zhe

from villa.

yixuan-qiao commented on May 27, 2024

But I am curious about how fast the training speed is in your machine? Have you tried the pretraining code (not the vaq fine-tuning)? I found that the pretraining is very slow

Hi @youngfly11,

I haven't tried pre-training stage. Maybe some fine-tuning speed info can offer you additional help. I fine-tuned on 4 V100(16G) using the train-vqa-large-8gpu-adv.json which takes about 20h. Besides, a painful and sad truth is that i haven't found the best config to reproduce the best results in paper until now.--!

from villa.

yixuan-qiao commented on May 27, 2024

Hi， thanks for the advice @zhegan27

Sorry to interrupt your holiday. I am not mean to :-)
I just wonder know in your paper, you use batchsize 3072, grad.acc 5, training steps 5000 for VQA task, is this parameter setting on a single Titan RTX GPUs or maybe 8 machines? i found grad.acc is a sensitive hyper-parameter, i try [12, 16, 24], got very different performance. maybe need other scale because of the number of machines?

from villa.

zhegan27 commented on May 27, 2024

@youngfly11 @yixuan-qiao , finetuning UNITER-large with adversarial training on the VQA dataset using 20 hours is reasonable, as adversarial training itself is more heavier than standard training.

@yixuan-qiao sorry that you have not been able to reproduce our best VQA results. I am very happy to help you on this.

When doing experiments back then, I have been running many experiments on difference settings, such as under 4, 8, or 16 GPUs. To partially solve your concern, I have dug out the best config file that we use to obtain the best results in our paper (test-dev/std: 74.69/74.87). This corresponds to 72.92 accuracy on our internal dev set. The config and log file is provided in ./reproducibility-vqa folder. In the config file, "conf_th": 0.075, which the current provided image features do not support, as most of our experiments use "conf_th" as 0.2. I will try to provide this feature.

Q: Is this parameter setting on a single Titan RTX GPUs or maybe 8 machines?
A: It corresponds to 8 machines in my impression. I will try to run the code myself again to double check this. But definitely it is not for a single Titan RTX.

Q: I found grad.acc is a sensitive hyper-parameter, I try [12, 16, 24], got very different performance. Maybe need other scale because of the number of machines?
A: This is true. If you have fewer machines, please try larger Grad. Acc. steps. Generally, keep num_tokens * num_GPUs * num_Grad_Accu the same in order to obtain similar results.

I will come back to this when I have more free time. For other experiments such as VCR, the results should be much easier to reproduce. Thanks and Happy New Year.

Best,
Zhe

from villa.

yixuan-qiao commented on May 27, 2024

Thanks a lot for your patience reply @zhegan27
I will try reproduce image features with conf_th 0.075 first, It would be great if you have time to share :-).
Within your new hps.json, some parameters are set to different values compare to my experiments, especially the learning rate decay schedule and training steps. You probably use vqa_schedule from MCAN, i will also try it.

Thanks a lot. Let's keep in touch. Looking forward to your update. :-)
Happy New Year!!!

from villa.

zhegan27 commented on May 27, 2024

@yixuan-qiao, the image features and config files that can be used to reproduce our VQA best results have been updated in the repo 2 days ago. Please have a check, and let us know if any further questions. Thank you.

Best,
Zhe

from villa.

yixuan-qiao commented on May 27, 2024

@zhegan27, thanks for that, by now i can reproduce the single best performance model. many thanks!!! :-)

from villa.

training setup about villa HOT 9 CLOSED

Comments (9)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent