Coder Social home page Coder Social logo

vqa-mfb.pytorch's Introduction

Multi-modal Factorized Bilinear Pooling (MFB) for VQA

This is an unofficial and Pytorch implementation for Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering and Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering.

Figure 1: The MFB+CoAtt Network architecture for VQA.

The result of MFB-baseline and MFH-baseline can be replicated.(Not able to replicate MFH-coatt-glove result, maybe a devil hidden in detail.)

The author helped me a lot when I tried to replicate the result. Great thanks.

The official implementation is based on pycaffe is available here.

Requirements

Python 2.7, pytorch 0.2, torchvision 0.1.9, tensorboardX

Result

Datasets\Models MFB MFH MFH+CoAtt+GloVe (FRCN img features)
VQA-1.0 58.75% 59.15% 68.78%
  • MFB and MFH refer to MFB-baseline and MFH-baseline, respectively.
  • The results of MFB and MFH are trained with train sets, tested with val sets, using ResNet152 pool5 features. The result of MFH+CoAtt+GloVe is trained with train+val sets, tested with test-dev sets.

Figure 2: MFB-baseline result

Figure 3: MFH-baseline result

Training from Scratch

$ python train_*.py

  • Most of the hyper-parameters and configrations with comments are defined in the config.py file.
  • Pretrained GloVe word embedding model (the spacy library) is required to train the mfb/h-coatt-glove model. The installation instructions of spacy and GloVe model can be found here.

Citation

If you find this implementation helpful, please consider citing:

@article{yu2017mfb,
  title={Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Fan, Jianping and Tao, Dacheng},
  journal={IEEE International Conference on Computer Vision (ICCV)},
  year={2017}
}

@article{yu2017beyond,
  title={Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Xiang, Chenchao and Fan, Jianping and Tao, Dacheng},
  journal={arXiv preprint arXiv:1708.03619},
  year={2017}
}

vqa-mfb.pytorch's People

Contributors

asdf0982 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

vqa-mfb.pytorch's Issues

maybe some bug in the codes

The shape of feature map extracted using resnet is 2048*14*14, but the input image channel is 2048, is this a bug?

Question about MFB Baseline

This is a confirmation question about MFB Baseline. According to the paper there should be two layers of LSTM with 1024-D hidden units each, but in the implementation only one LSTM layer with 1024-D is used. Kindly confirm.
Also 2 layer lstm means stacking two lstm layers on top of each other ?
Thanks

Why don't you release VQA-2.0 result?

Hi Wang, thanks for you helpful code.

Did you try to replicate your 'MFH+CoAtt+GloVe (FRCN img features)' on VQA-2.0 ?
btw, when replicating the result of 68.78% in VQA-1.0, did you train on VG+vqa-train+vqa+val or just simply train on vqa-train+vqa-val? maybe that 's the reason you not able to replicate on MFH-coatt-glove?

Question about model performance of MFH_Coatt_Glove

Hi, I have perprocess the Genome and VQA dataset based on instructions here (https://github.com/akirafukui/vqa-mcb/tree/master/preprocess) and ran 'train_mfh_coatt_glove.py'. However the accuracy only achieved 62.36% at the best, still far from the 68.78%. So could you please give me some suggestions on how to gain a better accuracy? I would truly appreciated.

FYI, I only changed the setting of IMG_FEAT_SIZE in the 'config.py' from 100 to 196, as in the data preprocss part it extracts the 14*14 image features.

L2_Normalization

Hi @asdf0982, thanks for your helpful code. I have a question. Specifically, I think it's more reasonable to replace the line 78-80 of mfb_coatt_glove.py with iatt_iq_l2=F.normalize(iatt_iq_sqrt,dim=1) to do the L2_normalization operation. Do you think so?

the dimension of img_feat looks strange in mfh_baseline , I found the model can't process an image feature vector whose size is (batch_size, 2048, 14, 14)

After processing preprocess module from https://github.com/akirafukui/vqa-mcb , I get image feature files . But when I try to transfer it to the data_provider , I find the numpy array to save img feature have a wrong size. So I resize this array from (batch_size2048) to (batch_size20481414) . But it still had something wrong when the feature array was transferred to the forward function in nn Module of mfh_baseline.py .
How did you change the feature array from 20481414 into 2048?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.