asdf0982 / vqa-mfb.pytorch Goto Github PK

This project is out of date, I don't remember the details inside...

Python 100.00%

vqa-mfb.pytorch's Introduction

Multi-modal Factorized Bilinear Pooling (MFB) for VQA

This is an unofficial and Pytorch implementation for Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering and Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering.

The result of MFB-baseline and MFH-baseline can be replicated.(Not able to replicate MFH-coatt-glove result, maybe a devil hidden in detail.)

The author helped me a lot when I tried to replicate the result. Great thanks.

The official implementation is based on pycaffe is available here.

Requirements

Python 2.7, pytorch 0.2, torchvision 0.1.9, tensorboardX

Result

Datasets\Models	MFB	MFH	MFH+CoAtt+GloVe (FRCN img features)
VQA-1.0	58.75%	59.15%	68.78%

MFB and MFH refer to MFB-baseline and MFH-baseline, respectively.
The results of MFB and MFH are trained with train sets, tested with val sets, using ResNet152 pool5 features. The result of MFH+CoAtt+GloVe is trained with train+val sets, tested with test-dev sets.

Training from Scratch

$ python train_*.py

Most of the hyper-parameters and configrations with comments are defined in the config.py file.
Pretrained GloVe word embedding model (the spacy library) is required to train the mfb/h-coatt-glove model. The installation instructions of spacy and GloVe model can be found here.

Citation

If you find this implementation helpful, please consider citing:

@article{yu2017mfb,
  title={Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Fan, Jianping and Tao, Dacheng},
  journal={IEEE International Conference on Computer Vision (ICCV)},
  year={2017}
}

@article{yu2017beyond,
  title={Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Xiang, Chenchao and Fan, Jianping and Tao, Dacheng},
  journal={arXiv preprint arXiv:1708.03619},
  year={2017}
}

vqa-mfb.pytorch's People

Contributors

Stargazers

Watchers

vqa-mfb.pytorch's Issues

maybe some bug in the codes

The shape of feature map extracted using resnet is 2048*14*14, but the input image channel is 2048, is this a bug?

RuntimeError: invalid argument 2: out of range at /pytorch/torch/lib/THC/generic/THCTensor.c:23

this error occurred when I ran the code.can you help me ?

This is a confirmation question about MFB Baseline. According to the paper there should be two layers of LSTM with 1024-D hidden units each, but in the implementation only one LSTM layer with 1024-D is used. Kindly confirm.
Also 2 layer lstm means stacking two lstm layers on top of each other ?
Thanks

Why don't you release VQA-2.0 result?

Hi Wang, thanks for you helpful code.

Did you try to replicate your 'MFH+CoAtt+GloVe (FRCN img features)' on VQA-2.0 ?
btw, when replicating the result of 68.78% in VQA-1.0, did you train on VG+vqa-train+vqa+val or just simply train on vqa-train+vqa-val? maybe that 's the reason you not able to replicate on MFH-coatt-glove?

Question about model performance of MFH_Coatt_Glove

Hi, I have perprocess the Genome and VQA dataset based on instructions here (https://github.com/akirafukui/vqa-mcb/tree/master/preprocess) and ran 'train_mfh_coatt_glove.py'. However the accuracy only achieved 62.36% at the best, still far from the 68.78%. So could you please give me some suggestions on how to gain a better accuracy? I would truly appreciated.

FYI, I only changed the setting of IMG_FEAT_SIZE in the 'config.py' from 100 to 196, as in the data preprocss part it extracts the 14*14 image features.

Could you let me know, each output in dataloader?

I feel hard to understand what is the meaning (and shape) of each outputs in Dataloader.

 return word, word_length, feature, answer, glove_matrix, epoch

Could you explain?

L2_Normalization

Hi @asdf0982, thanks for your helpful code. I have a question. Specifically, I think it's more reasonable to replace the line 78-80 of mfb_coatt_glove.py with iatt_iq_l2=F.normalize(iatt_iq_sqrt,dim=1) to do the L2_normalization operation. Do you think so?

Did you try to use SAN attention instead of conv-attention?

Hi，
Could you tell me which attention is better between SAN attention and conv-attention

Thank you very much!

the dimension of img_feat looks strange in mfh_baseline , I found the model can't process an image feature vector whose size is (batch_size, 2048, 14, 14)

After processing preprocess module from https://github.com/akirafukui/vqa-mcb , I get image feature files . But when I try to transfer it to the data_provider , I find the numpy array to save img feature have a wrong size. So I resize this array from (batch_size2048) to (batch_size20481414) . But it still had something wrong when the feature array was transferred to the forward function in nn Module of mfh_baseline.py .
How did you change the feature array from 20481414 into 2048?