gt-vision-lab / vqa_lstm_cnn Goto Github PK

Train a deeper LSTM and normalized CNN Visual Question Answering model. This current code can get 58.16 on OpenEnded and 63.09 on Multiple-Choice on test-standard.

Python 37.55% Lua 62.45%

vqa_lstm_cnn's People

Contributors

Stargazers

Watchers

Forkers

caomw yiiwood eriche2016 avisingh599 jnhwkim xsongx sumehta arijitray1993 gregorysenay xurantju abhshkdz deepakmuralidharan wait1988 honnibal daerduocarey ml-lab kevark wanjinchang amoliu little1tow benjamwhite mydude cv-ip chagge yashwant omar-florez sohuren shashankg7 xiaozhuka kevinwenya benjaminhess dhruvbatra somaticapi huxiaobing8 yadavankit ilovecv stevenlol manli009 elgins yourtone zbxzc35 chingyaoc digideskio jhashanti keishinkickback gailysun arasharchor chubbymaggie yangpa vyraun mofadic aekanshkansal1 suraj-deshmukh liyougeng walkoncross shu-xin shekharravi hyzcn codeaudit iij0 xhzhao suhmily boluoyu geekvc ronghanghu levilian learneasy2016 hassyma ka2007 ykwon0407 lemonnight zhencang kennthshang iqbal-chowdhury dimplesl utkarshojha manirupa waheedabro ajaycharan wtdeng shubhampachori12110095 varunagrawal zpolina kushalkafle jeff-ye yyf17 deshanadesai virajprabhu jiths afcarl tinyloop richardwang96 tushargupta01 jiangmengqi ram-iyer generalmahoraga sanjass xingchengxu shailzajolly akkalbist55

vqa_lstm_cnn's Issues

How to process the multiple choice answer

Hi,
I am confused that how to use the multiple choice answer in the multiple-choice task when training and evaluate the model?

Can we process the multiple choice answer the same as the open-ended task?

Thanks.

Some problems while implementing with tensorflow

Hi guys, I recently attempt to implement this repository with tensorflow.
However the accuracy only reached about 47% though I've checked again and again.
Since I am not that kind of familiar with lua, can anyone help figure out what's the problem?
Here's my code:
https://github.com/JamesChuanggg/vqa-tf/blob/master/model_VQA.py
I simply follow each step of the original lua code.

Unsupported marker type 0xf0

hi! when i run file: prepro_img.lua, and process for imgs, when it processing for the num 70000+ of train2014, raise the error : " Unsupported marker type 0xf0" ? how to solve it?

UNk Token

i am wondering why we use unk token in test set ,will it affects the results ,and also if i hava a validation set used for early stopping .The encoded question must use the training vocab or add validation new words to vocab?

Abstract scene parameters num_ans and num_output

Hi,

I am trying to use this model for abstract scenes multiple choice answers and wanted to confirm the parameters for preprocessing and training.
In prepro.py is it okay that I leave num_ans to be 1000, and in train.lua leave num_output to be 1000?

Thank you!

Script to evaluate new image using model saved from VQA_LSTM_CNN

Hi all,

We were able to save the model and ran eval.lua for evaluating questions on validation images.

Now, we wanted to use the model to answer questions about a new image. If such code is already available, we would love to use it. Otherwise we shall write the code ourselves and share it back.

Thanks,
Abhinav

libcudnn.so.4 not found even I had run ' luarocks install CuDNN'

envy@ub1404envy:~/os_prj/github/_QA/VT-vision-lab/VQA_LSTM_CNN$ th eval.lua -input_img_h5 data_img.h5 -input_ques_h5 data_prepro.h5 -input_json data_prepro.json -model_path model/lstm.t7
{
out_path : "result/"
batch_size : 500
model_path : "model/lstm.t7"
gpuid : 7
input_ques_h5 : "data_prepro.h5"
rnn_size : 512
common_embedding_size : 1024
input_img_h5 : "data_img.h5"
input_encoding_size : 200
input_json : "data_prepro.json"
img_norm : 1
backend : "cudnn"
num_output : 1000
rnn_layer : 2
}
nil
/home/envy/torch/install/bin/luajit: /home/envy/torch/install/share/lua/5.1/trepl/init.lua:384: /home/envy/torch/install/share/lua/5.1/trepl/init.lua:384: /home/envy/torch/install/share/lua/5.1/cudnn/ffi.lua:1279: 'libcudnn (R4) not found in library path.
Please install CuDNN from https://developer.nvidia.com/cuDNN
Then make sure files named as libcudnn.so.4 or libcudnn.4.dylib are placed in your library load path (for example /usr/local/lib , or manually add a path to LD_LIBRARY_PATH)

stack traceback:
[C]: in function 'error'
/home/envy/torch/install/share/lua/5.1/trepl/init.lua:384: in function 'require'
eval.lua:49: in main chunk
[C]: in function 'dofile'
...envy/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

my CUDA install at

envy@ub1404envy:~/os_prj/github/_QA/VT-vision-lab/VQA_LSTM_CNN$ ll /usr/local/cuda-7.5/targets/x86_64-linux/lib/
total 791192
drwxr-xr-x 3 root root 4096 Feb 11 00:06 ./
drwxr-xr-x 4 root root 4096 Dec 7 05:47 ../
-rw-r--r-- 1 root root 28585480 Aug 15 2015 libcublas_device.a
lrwxrwxrwx 1 root root 16 Aug 15 2015 libcublas.so -> libcublas.so.7.5*
lrwxrwxrwx 1 root root 19 Aug 15 2015 libcublas.so.7.5 -> libcublas.so.7.5.18*
-rwxr-xr-x 1 root root 23938736 Aug 15 2015 libcublas.so.7.5.18*
-rw-r--r-- 1 root root 28220076 Aug 15 2015 libcublas_static.a
-rw-r--r-- 1 root root 322936 Aug 15 2015 libcudadevrt.a
lrwxrwxrwx 1 root root 16 Aug 15 2015 libcudart.so -> libcudart.so.7.5*
lrwxrwxrwx 1 root root 19 Aug 15 2015 libcudart.so.7.5 -> libcudart.so.7.5.18*
-rwxr-xr-x 1 root root 383336 Aug 15 2015 libcudart.so.7.5.18*
-rw-r--r-- 1 root root 720192 Aug 15 2015 libcudart_static.a
-rwxr-xr-x 1 root root 11172416 Feb 11 00:06 libcudnn.so*
-rwxr-xr-x 1 root root 11172416 Feb 11 00:06 libcudnn.so.6.5*
-rwxr-xr-x 1 root root 11172416 Feb 11 00:06 libcudnn.so.6.5.48*
-rw-r--r-- 1 root root 11623922 Feb 11 00:06 libcudnn_static.a
lrwxrwxrwx 1 root root 15 Aug 15 2015 libcufft.so -> libcufft.so.7.5*
lrwxrwxrwx 1 root root 18 Aug 15 2015 libcufft.so.7.5 -> libcufft.so.7.5.18*
-rwxr-xr-x 1 root root 111231960 Aug 15 2015 libcufft.so.7.5.18*
-rw-r--r-- 1 root root 115104400 Aug 15 2015 libcufft_static.a
lrwxrwxrwx 1 root root 16 Aug 15 2015 libcufftw.so -> libcufftw.so.7.5*

would you tell me more about the parameter and dataset?

To get the image features, run

$ th prepro_img.lua -input_json data_prepro.json -image_root path_to_image_root -cnn_proto path_to_cnn_prototxt -cnn_

no clue

im=im*255;
im2=im:clone()
im2[{{3},{},{}}]=im[{{1},{},{}}]-123.68
im2[{{2},{},{}}]=im[{{2},{},{}}]-116.779
im2[{{1},{},{}}]=im[{{3},{},{}}]-103.939

hello,
could someone plz explain to me this part of the code

Fail to repeat the accuracy of the pretrained VGG model

I download the pretrained mode here: https://filebox.ece.vt.edu/~jiasenlu/codeRelease/vqaRelease/train_val/pretrained_lstm_train-val_test

and the corresponding features here:
https://filebox.ece.vt.edu/~jiasenlu/codeRelease/vqaRelease/train_val/data_train-val_test.zip

There is no error when i run eval.lua.
After i put the result files to here: https://github.com/VT-vision-lab/VQA
I came across the following error:

"
loading VQA annotations and questions into memory...
0:00:07.128280
creating index...
index created!
Loading and preparing results...
Traceback (most recent call last):
File "vqaEvalDemo.py", line 31, in
vqaRes = vqa.loadRes(resFile, quesFile)
File "../../VQA/PythonHelperTools/vqaTools/vqa.py", line 165, in loadRes
'Results do not correspond to current VQA set. Either the results do not have predictions for all question ids in annotation file or there is atleast one question id that does not belong to the question ids in the annotation file.'
AssertionError: Results do not correspond to current VQA set. Either the results do not have predictions for all question ids in annotation file or there is atleast one question id that does not belong to the question ids in the annotation file.
"

Issue while trying to run the evaluation script

I have run the eval.lua to generate the result json files and then used the VQA tools to calculate the accuracy. I have tried using both the evaluate.py script and also the running the vqaEvalDemo.py from the VQA folder. Both of them give the following error.

loading VQA annotations and questions into memory...
0:00:19.813268
creating index...
index created!
Loading and preparing results...     
Traceback (most recent call last):
  File "evaluate.py", line 5, in <module>
    from vqaEvalDemo import evaluate
  File "/workspace/VQA_LSTM_CNN/VQA/PythonEvaluationTools/vqaEvalDemo.py", line 31, in <module>
    vqaRes = vqa.loadRes(resFile, quesFile)
  File "/workspace/VQA_LSTM_CNN/VQA/PythonHelperTools/vqaTools/vqa.py", line 174, in loadRes
    'Results do not correspond to current VQA set. Either the results do not have predictions for all question ids in annotation file or there is atleast one question id that does not belong to the question ids in the annotation file.'
AssertionError: Results do not correspond to current VQA set. Either the results do not have predictions for all question ids in annotation file or there is atleast one question id that does not belong to the question ids in the annotation file.

I made sure the Questions and annotations files are exactly the same as the ones used in the training. And they are also the same ones available on the visualqa.org.

setting for abstract?

Score reported for abstract images using this code is 65 on CodaLab leaderboard.

Yet, running the code as is (which is pre-set for coco dataset),

I got 55 on validation set of abstract dataset using the evaluation tool provided.

Since the difference is pretty large, I'm assuming that setting (e.g. batch size, iterations, learning rate, layer..etc.) should be quite different from real(coco) dataset.

I was wondering if the setting to achieve the reported score on abstract dataset (i.e. how the code should be modified) can be shared.

require 'cunn' and 'cutorch' in CPU mode

The eval.lua script expects a gpuid (-1 for CPU) option, but running in CPU model also requires loading the 'cutorch' and 'cunn' packages. This leads to an error on startup itself and simply commenting those out and running the code in CPU mode fails on loading the pretrained_model_lstm.t7 file. I am expecting there is a workaround for this?

PS : I have cunn and cutorch packages installed (no GPU though)

Number of pretrained image features not matching with number of images in COCO

Hi,

I was trying to use the pre-trained image features you provide, but looking at the shape of the hdf5 it seems to be (82459, 4096), instead of (82783, 4096) - there are 82783 images in the COCO dataset.

Which images are the ones that have been removed?

Thanks

th train.lua -backend nn failed!

envy@ub1404envy:/os_prj/github/_QA/VQA_LSTM_CNN$ ll
total 5387680
drwxrwxr-x 7 envy envy 4096 Feb 18 12:33 ./
drwxrwxr-x 5 envy envy 4096 Feb 18 00:37 ../
drwxrwxr-x 4 envy envy 4096 Feb 15 17:29 data/
-rw-rw-r-- 1 envy envy 2014627936 Feb 18 12:32 data_img.h5
-rw-rw-r-- 1 envy envy 2014627936 Dec 14 00:03 data_img.h5-ori
-rw-rw-r-- 1 envy envy 84335736 Feb 18 12:03 data_prepro.h5
-rw-rw-r-- 1 envy envy 9169211 Feb 18 12:03 data_prepro.json
-rw-rw-r-- 1 envy envy 716074236 Dec 16 14:45 data_train_val.zip
-rwxrwxr-x 1 envy envy 9395 Dec 29 19:26 eval.lua*
-rwxrwxr-x 1 envy envy 741 Dec 29 19:26 evaluate.py*
drwxrwxr-x 8 envy envy 4096 Dec 29 19:26 .git/
drwxrwxr-x 2 envy envy 4096 Dec 29 19:26 misc/
drwxrwxr-x 2 envy envy 4096 Feb 17 00:18 model/
-rw-rw-r-- 1 envy envy 3005 Feb 18 12:31 path_to_cnn_prototxt.lua
-rwxrwxr-x 1 envy envy 3403 Dec 29 19:26 prepro_img.lua*
-rwxrwxr-x 1 envy envy 9279 Dec 29 19:26 prepro.py*
-rw-rw-r-- 1 envy envy 53612941 Dec 14 19:57 pretrained_lstm_train.t7
-rw-rw-r-- 1 envy envy 49743190 Dec 16 14:04 pretrained_lstm_train_val.t7.zip
-rwxrwxr-x 1 envy envy 3625 Dec 29 19:26 readme.md*
drwxrwxr-x 2 envy envy 4096 Feb 17 00:18 result/
-rwxrwxr-x 1 envy envy 10759 Dec 29 19:26 train.lua*
-rw-rw-r-- 1 envy envy 574671192 Sep 24 2014 VGG_ILSVRC_19_layers.caffemodel
-rw-rw-r-- 1 envy envy 2715 Feb 18 12:05 yknote---log--1
envy@ub1404envy:/os_prj/github/_QA/VQA_LSTM_CNN$ th train.lua -backend nn
{
learning_rate_decay_every : 50000
batch_size : 500
gpuid : 0
common_embedding_size : 1024
input_img_h5 : "data_img.h5"
input_encoding_size : 200
learning_rate_decay_start : -1
input_json : "data_prepro.json"
num_output : 1000
input_ques_h5 : "data_prepro.h5"
rnn_size : 512
max_iters : 150000
checkpoint_path : "model/"
save_checkpoint_every : 25000
learning_rate : 0.0003
img_norm : 1
backend : "nn"
rnn_layer : 2
seed : 123
}
DataLoader loading h5 file: data_prepro.h5
DataLoader loading h5 file: data_img.h5
Building the model...
shipped data function to cuda...
/home/envy/torch/install/bin/luajit: train.lua:200: index out of range at /home/envy/torch/pkg/torch/lib/TH/generic/THTensorMath.c:156
stack traceback:
[C]: in function 'index'
train.lua:200: in function 'next_batch'
train.lua:247: in function 'opfunc'
/home/envy/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop'
train.lua:303: in main chunk
[C]: in function 'dofile'
...envy/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
envy@ub1404envy:~/os_prj/github/_QA/VQA_LSTM_CNN$

VQA_LSTM_CNN going out of memory on Titan X with 12 GB RAM

Hi,

We triend running the code and have Titan X with 12 GB RAM. But we are getting following error message. What could be the possible reason for going Out of Memory ?

cuda runtime error (2) : out of memory at /home/ankit/torch/extra/cutorch/lib/THC/generic/THCStorage.cu:40
stack traceback:
[C]: at 0x7fb62f736820
[C]: in function '__add'
train.lua:276: in function 'opfunc'
/home/ankit/torch/install/share/lua/5.1/optim/rmsprop.lua:32: in function 'rmsprop'
train.lua:303: in main chunk
[C]: in function 'dofile'
...nkit/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670

run prepro_img.lua failed

envy@ub1404envy:/os_prj/github/_QA/VQA_LSTM_CNN$ th prepro_img.lua -backend nn -input_json data_prepro.json -image_root data_prepro.h5 -cnn_proto model/ -cnn_model VGG_ILSVRC_19_layers.caffemodel
{
backend : "nn"
image_root : "data_prepro.h5"
cnn_proto : "model/"
batch_size : 10
input_json : "data_prepro.json"
gpuid : 1
out_name : "data_img.h5"
cnn_model : "VGG_ILSVRC_19_layers.caffemodel"
}
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 574671192
Successfully loaded VGG_ILSVRC_19_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv3_4: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv4_4: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
conv5_4: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
processing 82459 images...
/home/envy/torch/install/bin/luajit: /home/envy/torch/install/share/lua/5.1/image/init.lua:650: attempt to call method 'nDimension' (a nil value)
stack traceback:
/home/envy/torch/install/share/lua/5.1/image/init.lua:650: in function 'scale'
prepro_img.lua:51: in function 'loadim'
prepro_img.lua:95: in main chunk
[C]: in function 'dofile'
...envy/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00406670
envy@ub1404envy:/os_prj/github/_QA/VQA_LSTM_CNN$

envy@ub1404envy:~/os_prj/github/_QA/VQA_LSTM_CNN$ tree
.
├── data
│   ├── annotations
│   │   ├── mscoco_train2014_annotations.json
│   │   ├── mscoco_val2014_annotations.json
│   │   ├── MultipleChoice_mscoco_test2015_questions.json
│   │   ├── MultipleChoice_mscoco_test-dev2015_questions.json
│   │   ├── MultipleChoice_mscoco_train2014_questions.json
│   │   ├── MultipleChoice_mscoco_val2014_questions.json
│   │   ├── OpenEnded_mscoco_test2015_questions.json
│   │   ├── OpenEnded_mscoco_test-dev2015_questions.json
│   │   ├── OpenEnded_mscoco_train2014_questions.json
│   │   └── OpenEnded_mscoco_val2014_questions.json
│   ├── vqa_preprocessing.py
│   ├── vqa_raw_test.json
│   ├── vqa_raw_train.json
│   └── zip
│   ├── Annotations_Train_mscoco.zip
│   ├── Annotations_Val_mscoco.zip
│   ├── Questions_Test_mscoco.zip
│   ├── Questions_Train_mscoco.zip
│   └── Questions_Val_mscoco.zip
├── data_prepro.h5
├── data_prepro.json
├── data_train_val.zip
├── eval.lua
├── evaluate.py
├── misc
│   ├── LSTM.lua
│   ├── netdef.lua
│   └── RNNUtils.lua
├── model
├── path_to_cnn_prototxt.lua
├── prepro_img.lua
├── prepro.py
├── pretrained_lstm_train.t7
├── pretrained_lstm_train_val.t7.zip
├── readme.md
├── result
├── train.lua
├── VGG_ILSVRC_19_layers.caffemodel
├── vgg_ilsvrc_19_layers_deploy-prototxt
├── vgg_ilsvrc_19_layers_deploy-prototxt.lua
├── vgg_ilsvrc_19_layers_deploy-prototxt.lua.lua
├── yknote---log--1
└── yknote---log--2

6 directories, 39 files

Providing feedback through correct answer

How to implement this feedback mechanism in NN?

VQA preprocessing on OpenEnded Questions?

Firstly, this is really helpful! Thanks for making this public and reproducible.

I see that that the steps and scripts to reproduce results on the multiple choice type of questions are clearly written. Do you plan to make public the scripts to do the same on the open-ended type questions too?

Specifically, I was curious to know how one would go about preprocessing using data/vqa_preprocessing.py and preproc.py on open-ended type of questions which works with your model.

JPG is actually a PNG

COCO_val2014_000000320612.jpg is apparently a PNG and will make image preprocessing break at (quite literally) the last minute. This is more of a PSA than anything else since the problem is detected too far along in the pipeline to practically fix.

Ideas for NLP pre-processing and feature engineering

Hi all,

I'm excited to do some work on the text processing side of the Visual QA task. I develop the spaCy NLP library. I think we should be able to get some extra accuracy, with some extra NLP logic on the question parsing side. We'll see.

The first thing I'd like to try is mapping out of vocabulary words to similar tokens, using a word2vec model. For instance, let's say the word colour is OOV. Seems easy to map this to color.

Input: What colour is his shirt?
Tokens: ["What", "colour", "is", "his", "shirt", "?"]
Transform: ["What", "color", "is", "his", "shirt", "?"]

I think this input normalization trick is novel, but it makes sense to me for this problem. It lets you exploit pre-trained vectors without interfering with the rest of your model choices.

I think the normalization could be taken a bit further, by using the POS tagger and parser to compute context-specific keys, so that the replacement could be more exact (sense2vec). I think just the word replacement is probably okay though.

It's also easy to calculate auxiliary features with spaCy. It's easy to train a question classifier, of course. I'm not sure the model is making many errors of that type, though.

If I had to say one thing was unsatisfying about the model, I'd say it's the multiclass classification output. Have you tried having the model output a vector, and using it to find a nearest neighbour?

Trained model gets low accuracy on VQA server

Hello,

I trained the model as you described for 150K epochs, for some reason I'm getting only 49% accuracy, while using you pretrained parameters gives 58% as you mentioned. Does anyone have an idea what could cause such drop in performance?
Thanks!

Might be remove the second term of output in LSTM

It might be a bit confusing to have two output and did not use the second output

Bugs in filtering and encoding questions in prepro.py

In Line 83 :
question = [w if wtoi.get(w,len(wtoi)) != len(wtoi) else 'UNK' for w in txt]

and Line 145:
if atoi.get(img['ans'],len(atoi)) != len(atoi):

You need to check if
question = [w if wtoi.get(w,len(wtoi)+1) != len(wtoi)+1 else 'UNK' for w in txt] and
if atoi.get(img['ans'],len(atoi)+1) != len(atoi)+1: since your indices begin with 1.

Number of training picture

Hi all
I just want to make sure is the number of training picture is 82783?(from the VQA website)
since i found that in data_prepro.h5 or data_prepro.json, the number is 82460?

with h5py.File(h5_data_path,'r') as hf:
        tem = hf.get('img_pos_train')
        train_data['img_list'] = np.array(tem)

when I check the data:

np.unique(train_data['img_list']).shape[0]
>> 82460

Do I miss something?

out of memory

Hi! The memory of my Gpu is 8G, When i run the train.lua on a single Gpu, it raise an error "out of memory". is there any way to solve it? I have 2 Gpus in total.

How to cite the model?

Hi,

We want to use this model as a baseline that we are comparing against, so how should we cite it?

Thanks for open-sourcing the code,
Ilija