Coder Social home page Coder Social logo

ml-lab / whats_in_a_question Goto Github PK

View Code? Open in Web Editor NEW

This project forked from sidgan/whats_in_a_question

0.0 3.0 0.0 4.51 MB

CVPR'17 Spotlight: What’s in a Question: Using Visual Questions as a Form of Supervision

Home Page: http://sidgan.me/whats_in_a_question/

License: GNU General Public License v3.0

Python 17.04% Shell 2.21% Jupyter Notebook 29.66% Lua 51.08%

whats_in_a_question's Introduction

What’s in a Question: Using Visual Questions as a Form of Supervision

This is the code for the CVPR'17 spotlight paper, What’s in a Question: Using Visual Questions as a Form of Supervision.

[Arxiv] [bib] [Github] [Project Page]

What’s in a Question: Using Visual Questions as a Form of Supervision

Abstract

Collecting fully annotated image datasets is challenging and expensive. Many types of weak supervision have been explored: weak manual annotations, web search results, temporal continuity, ambient sound, and others. We focus on one particular unexplored mode: visual questions that are asked about images. Our work is based on the key observation that the question itself provides useful information about the image (even without the answer being available). For instance, the question “what is the breed of the dog?” informs the computer that the animal in the scene is a dog and that there is only one dog present. We make three contributions: (1) we provide an extensive qualitative and quantitative analysis of the information contained in human visual questions, (2) we propose two simple but surprisingly effective modifications to the standard visual question answering models that allows it to make use of weak supervision in the form of unanswered questions associated with images, and (3) we demonstrate that a simple data augmentation strategy inspired by our insights results in a 7:1% improvement on the standard VQA benchmark.

The trained models attain the following scores on the test-dev of the MS COCO VQA v1.0 dataset.

Model Name Overall Other Number Yes/No
iBOWIMG-2x 62.80 53.11 37.94 80.72

There are three tasks described in the paper:

1. Image Descriptions

We analyze whether the visual questions contain enough information to provide an accurate description of the image using the Seq2Seq model. See Image Descriptions README for detailed description for each file.

2. Object Classification

Visual questions can provide information about the object classes that are present in the image. E.g., asking “what color is the bus?” indicates the presence of a bus in the image. See Object Classification README for detailed description for each file.

Training/fine-tuning the image features (caffe)

Fine-tuning modifies only the last layer of a network to give the application-specific number of outputs. For fine-tuning we start with the parameters initially learnt on the ImageNet images, and then fine-tune with MS COCO images. All caffe related code for fine-tuning the models is present in the caffe directory. See caffe README for detailed description for each file.

3. Visual Question Answering

Visual Question Answering is, given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Visual questions focus on different areas of an image, including background details and underlying context. We utilize not just the target question, but also the unanswered questions about a particular image. See Visual Question Answering README for detailed description for each file.

Acknowldegements

This code is based on Simple Baseline for Visual Question Answering by Bolei Zhou and Yuandong Tian.

Please cite us if you use our code:

@inproceedings{GanjuCVPR17,
author = {Siddha Ganju and Olga Russakovsky and Abhinav Gupta},
title = {What's in a Question: Using Visual Questions as a Form of Supervision},
booktitle = {CVPR},
year = {2017}
}

whats_in_a_question's People

Contributors

sidgan avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.