Coder Social home page Coder Social logo

arpit196 / mcb-model-for-vqa Goto Github PK

View Code? Open in Web Editor NEW

This project forked from omar-florez/mcb-model-for-vqa

2.0 1.0 0.0 1.85 MB

This is an explanation and implementation of MCB for Visual QA with Tensorflow.

Python 81.43% Shell 0.67% C++ 17.90%

mcb-model-for-vqa's Introduction

mcb-model-for-vqa

This is an implementation of Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Groudning with Tensorflow.
A theorical background of this paper is in mcb_for_vqa.pdf.
Thanks to Yunseok Jang, Hyugjin Ko, and Sangeon Park.

Codes

Before Training(Caffe is used)

  • build_train_datasets.py : build MSCOCO annotations for train(with idx of image feature vector, question, and idx of answer) and feature vectors. Also build word2ix and ix2word for questions and answers.
  • build_val_datasets.py : build MSCOCO annotations for valid(with image_id, idx of image feature vector, question, question_id) and feature vectors.
  • build_genome_datasets.py : build Visual Genome annotations for train & valid.

Models and Train & Test codes(Tensorflow is used)

  • config.py : configuration file (support model without Attention & with Attention, support dataset MSCOCO & Visual Genome)
  • model
    • vqamodel.py : an abstract model for VQA
    • mcb_vqamodel.py : a MCB model which do not have attention mapping
    • mcbAtt_vqamodel.py : a MCB model which have attention mapping
    • concat_vqamodel.py : a non-bilinear model(Use concatenate to make feature)
    • CBP : Module for compact bilinear pooling from here.
  • train.py : a code for training
  • test.py : a code for test
  • vqaEvaluation, vqaTools : Metric Evaluation Tool for VQA dataset from here.

MSCOCO Datasets

MSCOCO Datasets are available on here. Only Real Images and OpenEnded questions are used. 3 questions per an image, 10 answers per a question. I only used 50000 images and 150000 questions. Split training & validation set with 9 : 1 rate.

  • In questions, 10525 different words exist and most frequent 5000 words are selected.(threshold 3) In paper, 13K~20K words were selected.
  • In answers, 50697 different words exist and most frequent 5000 words are selected.(threshold 9) In paper, 3000 words were selected.
  • Length of questions
    • 09 : 126520, 1019 : 8465, 20~29 : 16
    • Consider only 20 words in front when embedding questions
  • Use answers with confidence 'yes' only : Total 1049879 numbers of (Image, Question, Answer) are used.

VISUAL GENOME Datasets

Visual Genome Datasets are available on here. The answers have multiple length but only answers with one word are used. Many questions per an image, 1 answer per a question. I used 68990 images and 611209 questions for training and 9417 images and 35032 questions for validation.

  • In questions, 17112 different words exist and most frequent 8000 words are selected.(threshold 3).
  • In answers, 32686 different words exit and most frequent 5000 words are selected.(threshold 3)
  • Length of questions
    • 09 : 1185284, 1019 : 42965, 20~29 : 37

You might also want to look at

  • Other MCB models
    • A. Fukui et al. Multimodal compact bilinear pooling for visual question answering and visual grounding. 2016.
  • Bilinear pooling
    • T.-Y. Lin et al. Bilinear CNN models for fine-grained visual recognition. 2015.
    • J. Carreira et al. Semantic segmentation with second-order pooling. 2012.
  • Compact bilinear pooling & Count sketch
    • Y. Gao et al. Compact bilinear pooling. 2016.
    • N. Pham and R. Paph. Fast and scalable polynomial kernels via explicit feature maps. 2013.
    • M. Charikar et al. Finding frequent items in data streams. 2002.
    • K. Q. Weinberger et al. Feature hashing for large scale multitask learning. 2009
    • R. Pagh. Compressed matrix multiplication. 2012.
  • Models referenced for Visual Grounding
    • L. A. Hendricks et al. Generating Visual Explanations. 2016.
    • A. Rohrbach et al. Grounding of Textual Phrases in Images by Reconstruction. 2016.
  • More for MCB for Visual QA

mcb-model-for-vqa's People

Contributors

shmsw25 avatar arpit196 avatar

Stargazers

 avatar  avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.