zhegan27 / villa Goto Github PK

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

Home Page: https://arxiv.org/pdf/2006.06195.pdf

License: MIT License

Dockerfile 0.33% Python 96.86% Shell 2.81%

vision-and-language adversarial-training pretraining visual-question-answering neurips-2020

villa's People

Contributors

Stargazers

Watchers

Forkers

bruinxiong kapitsa2811 jaeyun95 pinglmlcv youngergao weili-nlp pzzhang chrisbyd sabersabersaber dmcinerney threestonessl nofear18

villa's Issues

About the reproduction of VCR experiment results

Hi，
Thanks for your great work!
When i use the following command to train a model, it seems can't reach the expected results in the paper.
horovodrun -np 1 python train_vcr_adv.py --config config/train-vcr-base-4gpu-adv.json \ --output_dir vcr/output_base
Only use one GPU，I got these results
100%|##########| 8000/8000 [4:58:12<00:00, 1.98s/it][1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - ============Step 8000============= [1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - 1280000 examples trained at 71 ex/s [1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - =========================================== [1,0]<stderr>: [1,0]<stderr>:09/10/2021 08:48:59 - INFO - __main__ - start running validation... [1,0]<stderr>: [[[[1,0]<stderr>:09/10/2021 08:54:06 - INFO - __main__ - validation finished in 307 seconds, score_qa: 72.28 score_qar: 75.06 score: 54.35

I am confused that this result is a few percentage points different from the one mentioned in the paper.
What should i do? Thanks in advance!!!

RK Villa

VQA pre-processing

I'd like to apply this model to my own VQA-like dataset.
However, the dataset is in json format (like the original VQA dataset), so I need to convert it to lmdb file format.
So, if you have the code to convert the original VQA data to lmdb format, could you please provide the code?
Specifically, how did you calculate the "target" values in the text lmdb?

Features of img_pos_feat

Hello,

I noticed that img_pos_feat have 7 features. I assumed that 4 of them are coordinates of the boxes. What are the other 3? Is there a code where I can see how 7 features were derived?

How to extract features to do image retrieval

Thank you for this amazing piece of work.

I'm interested in using VILLA or UNITER to do image retrieval.

I'd like to pre-extract features from VILLA for a folder of images and then retrieve them at inference time by using a text query.

I note that in your paper you publish image retrieval and text retrieval metrics.

I've run the code as noted in the UNITER repo:

# text annotation preprocessing
bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann

# image feature extraction (Tested on Titan-Xp; may not run on latest GPUs)
bash scripts/extract_imgfeat.sh $PATH_TO_IMG_FOLDER $PATH_TO_IMG_NPY

# image preprocessing
bash scripts/create_imgdb.sh $PATH_TO_IMG_NPY $PATH_TO_STORAGE/img_db

Most of the scripts and examples I can see in the repo require both images and text to be presented to the model.

Do you have any examples or advice on how to get text-only representations/features that could be used to then retrieve images by their pre-encoded features?

Thanks for any help or guidance you can provide.

Checkpoints of Villa models to run on validation set

Hello,

Thanks for your work and available code. I have downloaded your checkpoints using
download_pretrained.sh

It downloaded several VILLA models, where one of them is villa-base.pt. Then I would like to run the validation on the checkpoint model as

python train_vqa_adv.py --config config/train-vqa-base-1gpu-adv.json --checkpoint saved_data/pretrained/villa-base.pt  --valid_steps 1

However, I noticed that when model is loaded from the checkpoint, weights of self.vqa_output are not updated. What would be your suggestion if I want to take your best model and use it to run on a validation set?

The number of trainable parameters

Hello,it's a great work!Can you tell me what is the number of trainable parameters for the model fine-tuning retrieval task, using Uniter_base

As the epoch increased, so did the GPU memory

Hi ,
Thanks for your great work!
When I fine tuning the VQA ,I met the problems that:
As the epoch increased, so did the GPU memory,Eventually,It will exceed the GPU's highest memory which causes the stopping.

And when using multiple GPUs for training, GPU0 uses more internal memory than any other.

This problem has been bothering me for a long time, and I want to ask do you know what is the reason?

Thanks for your reply~:)

Visualization of text-to-image attention

Hi
Thanks for your work and shared source code!

Is there any script for visualizing text-to-image attention like on Figure 4 in your paper?

training setup

Hi,
Thanks for your excellent work. I am not sure the batchsize in your paper is same as it in the code? In code, 3072 refers to total tokens, corresponding to about real 32 examples each iteration.

a) Maybe 32(real batchsize)*8(Grad. Accu) is dominant factor?
b) Our V100 machine (16G) can not process the 3072 tokens, so maybe 1024 tokens(about 8 real examples), 8 Gpus, 4(Grad. Accu) is another workable plan?
c) Besides, the train-vqa-large-8gpu-adv.json you released can reproduce the paper result? Some parameters seem to be set differently from the paper (e.g. Adv .Lr ..)

We deeply hope to reproduce your best results in our limited resource scenario. Thank a lot.

Cannot find txt pre-processing file prepro.py

Hi,

Could you please provide the text preprocessing part of VILLA?

I cannot find prepro.py in scripts/create_txtdb.sh

Thank you!

When will the adversarial training code of pretraining in indomain dataset be released?

Hi, zhe;

Thanks for your excellent work. Recently I want to reproduce some results in Villa and conduct pre-training on indomain datasets. I am curious about whether it is possible to mimic the adversarial training codes in train_vqa_adv.py to pretraining stage simply? Is there any specific configuration for adversarial training in pretraining stage?