peteanderson80 / bottom-up-attention Goto Github PK

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

Home Page: http://panderson.me/up-down-attention/

License: MIT License

CMake 0.91% Makefile 0.22% HTML 0.06% CSS 0.08% Jupyter Notebook 63.93% C++ 25.32% Shell 0.47% Python 6.52% Cuda 2.10% MATLAB 0.29% C 0.08% Dockerfile 0.02%

vqa visual-question-answering captioning-images faster-rcnn caffe image-captioning mscoco mscoco-dataset

bottom-up-attention's People

Contributors

Stargazers

Watchers

Forkers

robotzheng ronghanghu bdfyb eustcpl wanesta zgsxwsdxg yufengm lilhope sithlqf shykoe leezqcst janelou wangheda logonod jdc08161063 kekedan liu-hai-yang wenxuanliu zhoulian lemonnight alicepeter xiangliu886 researcher2003pro jwyang zbxzc35 parisots vijayvee little1tow leviswind allenanthony shubhampachori12110095 vanpersie32 kuanih aodiwei raghparihar chanyn gdlsdfz mdamo chenxistephen jryongithub zhangzhizz yyf17 fword team-aupair airyym btjhjeon ardapekis dimplesl hyeonwoonoh okanlv jl0623 kuanghuei yancyycwong yuzcccc sidianshuia wangweilai1 matrixplayer hsuwanting stevenji pzzhang xiangchenchao spartag117 weimianli afcarl happyzhouch ml-lab kimisissi lzd0825 artist100 lewiszhao wywywy01 cyhbrilliant raghavgoyal14 zhiqinzhan daqingliu youngerboy michael-hsu jiangmengqi gechen dreadlord1984 jpchen2012 anirband thecharm ricklentz joon-park92 deftruth maybefeicun yuanezhou sadafgulshad1 ycremar gulugulujiang basselali1 xuliangfrdc song-heng xf15 hyzcn forence lzh990711 l1haodong fwtan

bottom-up-attention's Issues

How to load .tsv datas in Tensorflow?

I am trying to use your .tsv datas as image features for image captioning. However, I have no idea how to load the .tsv data so as to random batch sample the features items and match correlated captions. I find a way to solve it that is to turn .tsv into .json format, and make the .json file to be a dict with "image_ids" as the keys. However, the .json file to too large to load. In fact, I am even failed to generate the json file since the lack of Memory. I am also failed to use tensorflow's textlinereader.

So,
How to load .tsv datas rightly?
Looking forward to your help!
Thank you!

./lib/nms/gpu_nms.so: undefined symbol: _Py_ZeroStruct

Hi,when i running demo.ipynb, occur ./lib/nms/gpu_nms.so: undefined symbol: _Py_ZeroStruct. I have perform make.

I run python demo.py failed!

I was download the model file first. when I run demo.py get message like this:
I1017 11:02:41.250684 40048 net.cpp:131] Top shape: 1 2 126 14 (3528)
I1017 11:02:41.250689 40048 net.cpp:139] Memory required for data: 117482412
I1017 11:02:41.250692 40048 layer_factory.hpp:77] Creating layer rpn_cls_prob_reshape
I1017 11:02:41.250699 40048 net.cpp:86] Creating Layer rpn_cls_prob_reshape
I1017 11:02:41.250704 40048 net.cpp:408] rpn_cls_prob_reshape <- rpn_cls_prob
I1017 11:02:41.250713 40048 net.cpp:382] rpn_cls_prob_reshape -> rpn_cls_prob_reshape
I1017 11:02:41.250741 40048 net.cpp:124] Setting up rpn_cls_prob_reshape
I1017 11:02:41.250747 40048 net.cpp:131] Top shape: 1 18 14 14 (3528)
I1017 11:02:41.250751 40048 net.cpp:139] Memory required for data: 117496524
I1017 11:02:41.250756 40048 layer_factory.hpp:77] Creating layer proposal
F1017 11:02:41.250799 40048 layer_factory.hpp:81] Check failed: registry.count(type) == 1 (0 vs. 1) Unknown layer type: Python (known types: AbsVal, Accuracy, ArgMax, BNLL, BatchNorm, BatchReindex, Bias, BoxAnnotatorOHEM, Concat, ContrastiveLoss, Convolution, Crop, Data, Deconvolution, Dropout, DummyData, ELU, Eltwise, Embed, EuclideanLoss, Exp, Filter, Flatten, HDF5Data, HDF5Output, HingeLoss, Im2col, ImageData, InfogainLoss, InnerProduct, InnerProductBlob, Input, LRN, LSTM, LSTMUnit, Log, MVN, MemoryData, MultinomialLogisticLoss, PReLU, PSROIPooling, Parameter, Pooling, Power, RNN, ROIPooling, ReLU, Reduction, Reshape, SPP, Scale, Sigmoid, SigmoidCrossEntropyLoss, Silence, Slice, SmoothL1Loss, SmoothL1LossOHEM, Softmax, SoftmaxWithLoss, SoftmaxWithLossOHEM, Split, TanH, Threshold, Tile, WindowData)
*** Check failure stack trace: ***
Aborted (core dumped)

I recreate the pretrained feature files with 36 features per image, but the rois num of some images does not have 36?

I want recreate visual genome feature with 36 features per image.
but I find some images roi nums <36；
how can I make rois nums >=36;

where the five nums means:
rois.shape, max_conf.shape, np.argsort(max_conf)[::-1].shape,img.h,img.w

roisnum <36,it cannot get 36 features per image.

Illustrations for object and attribute predictions

Hello @peteanderson80

I was wondering how to generate the figure showing the object and attribute predictions for salient image regions with bounding boxes and labels (like the figure in this repo).

when I training the model , I find the loss_bbox is nan?

I use the vg dataset to train the model.But after 0 iter,the loss_bbox is nan.
I do not modify your code.
this is my training log:

It is yours trianing log:

Can sombody share the Pretrained features?

I used Chrome to download the features, but the speed is lower than 50kb/s and when downloading 50~100MB, it would interupt and when I redownload it will start from the begining.....
I feel helpless...

why using the VG data to train the Faster-RCNN model

Thanks for sharing the models and features. I have tried the feature for VQA with my own model, really surprising results indeed :)
I have two questions as follows:

As the VQA datasets is based on the images of MSCOCO, will it be better to train the faster rcnn model on the COCO dataset directly?
Could a better object detection model, e.g., R-fcn or Deformable R-fcn further improve the VQA performance?

Training Scripts

It is a nice work and the feature improves a lot ! Could you please share the detailed training process for the final model?

cannot train locally with the error "AttributeError: type object 'NCCL' has no attribute 'new_uid'"

File "./tools/train_net_multi_gpu.py", line 109, in
max_iter=args.max_iters, gpus=gpus)
File "/home/jzheng/PycharmProjects/bottom-up-attention/tools/../lib/fast_rcnn/train_multi_gpu.py", line 233, in train_net_multi_gpu
uid = caffe.NCCL.new_uid()
AttributeError: type object 'NCCL' has no attribute 'new_uid'

What should I do with this error?

Compilation Error

i got the following error:

token ""CUDACC_VER is no longer supported. Use CUDACC_VER_MAJOR, CUDACC_VER_MINOR, and CUDACC_VER_BUILD instead."" is not valid in preprocessor expressions

i've updated eigen already and still get the same error...
what should i do?

It seems I cannot download the trainval features from onedrive?

Can anyone give me another links?

example images without bounding boxes

First of all, thanks for making this project public.

In the paper and this repo's README, two example images are used to show your model's prediction qualitatively. (images with bike and oven) Since I can't figure out where those images are in the dataset (VG or COCO), is it possible for you to provide where those images are?

Thanks in advance.

Generation of 1600-400-20 vocab files

I run the setup_vg.py script with the max_objects/attributes/relation set to 1600/400/20 respectively. However, the generated vocabulary files are slightly different from the ones provided in the 1600-400-20 folder. Is there any manual post-procedure?

How to visualize the attention

The downloads links are too slow?

Am I the only one experiencing slow download of the zipped features? Previous links used to be fast.

GloVe word embedding for top down attention model

Thank you for the fascinating paper and code!

I notice that for the image captioning model, you decided to use the standard vocabulary and train the word embedding matrix from scratch. So I've been wondering if I apply the approach mentioned in your previous paper about Constrained Beam Search (pretrained GloVe vectors with expanded vocabulary), will it improve the performance of the model?

Test 2014 adaptive features are not getting prepared

I tried creating numpy files from test2014 variable box features using the following read_csv file in this git. but it says there's an error in padding while decoding from string to numpy array.

item[field] = np.frombuffer(base64.decodestring(item[field]),
dtype=np.float32).reshape((item['num_boxes'],-1))

it says, incorrect padding within decodestring function. I tried adding ''=" at the end but then the dimensions of resultant numpy array mismatches with num_boxes. This error occurs for every tsv file in test2014 adaptive feature set.

Tried debugging the code:
hRPQAAAAAAAAAAAAAAALSF8D0AAAAAAAAAAEK5hkBUkI4/ubE7PwAAAABMs648eTKRO2Xq5z1mSOE+aKcrPwAAAAAAAAAAAAAAABsb3T7nhK49jVvEPirGgjqzJrM+AAAAAJLICj8B0G4+HmhvPvccLz4AAAAAq37BOoivl0AtCwg8AAAAAAAAAAAAAAAAtsQoPgAAAADbHCo7AAAAAAAAAACSgow/AAAAAFYsqT+fwoM9AAAAAIEkFkHFf6U8AAAAAAAAAACspfQ+AAAAAAAAAACbh5E9AAAAAF/CRz8AAAAAAAAAAGXGa0BfWrs/FetIPKe0RD8AAAAAzPLROwAAAAAAAAAAAAAAAHy/Rz7JO49A/5cPP8bSlz4AAAAANOKEQAAAAAAAAAAAAAAAAAAAA437192 length of string=237423

/home/juan_fernandez/scripts/read_tsv.py(74)()
-> pdb.set_trace()
(Pdb) c
Traceback (most recent call last):
File "scripts/read_tsv.py", line 74, in
pdb.set_trace()
File "/home/juan_fernandez/anaconda2/envs/py27/lib/python2.7/base64.py", line 328, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

Any help would be appreciated.

What features are used to train a VQA model? DO you use only 2048-dimension features?

In your code, the image_id, image_h, image_w, num_boxes, boxes, features were extracted and saved. But in your paper, it seems that only features are used to present the image. Do you use the embedding of the predicted classes or bbox to train a VQA model?

Hardware Requirement, does eval need that 12G GPU memory?

F0906 11:11:48.665238   945 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0)  out of memory

i got this error when running demo.ipynb with a new picture size 416x449, but example pictures running success.

how to set larger batch size in training????

I modify the BATCH_SIZE from 64 to 192 in faster_rcnn_end2end_resnet.yml, but I get error :

F1206 14:38:02.929175 195836 loss_layer.cpp:19] Check failed: bottom[0]->num() == bottom[1]->num() (32 vs. 96) The data and label should have the same number.

I think I miss some thing in configure setting. May be another parameter should also be changed corresponding to the BATCH_SIZE.

Image caption

hi,can you release your image caption implement version?

Could you please point out the main code of this paper?

@peteanderson80 Thank you!!
Have trouble to find them

Why choose "SoftmaxWithLoss" for "loss_attr"

Hi @peteanderson80 , Thanks for your sharing.
There are some objects has multiple attributes, why choose "SoftmaxWithLoss" for "loss_attr"?

why the feature is the same when I recreate image features?

I want to recreate 36 features one images.
But I find pool5 in generate_tsv.py, all dim is the same!!!
why pool feature?not the region feature.

all dims in pool5 are the same

Have you tried training the object detection and the VQA model in an "end to end" fashion ?

Hi guys 😅 😅 I was just wondering, Have you tried doing that? If not, Do you expect the accuracy to go down if you do that?

About test results

Hi, I just run the test code using your trained resnet101 model on the test set. I got the following numbers on object detection task:

Mean AP = 0.0146
Weighted Mean AP = 0.1799
Mean Detection Threshold = 0.328

The mean AP (1.46%) is far from the number (10.2%) you reported in the table at the bottom of readme. The weighted mean AP is a bit higher than the number you reported. I am wondering whether there is a typo in your table.

thanks!

How to extract specfied region's feature?

Hi, current code can extract detected region's features, can i specify a region and extract its feature, for example, the union region of two detected regions?

Tensorflow version ?

Hi,

I have a question. I would like to know whether you will be releasing a tensorflow version of your code ?

Struggling for installation

Hello,

I am heavily failing to build the shipped Caffe with anaconda, mainly linking errors w.r.t google protobuf. Been struggling for like 7-8 hours and I am pretty close to give up.

So the question is what are the modifications shipped in the caffe/ folder? Can't we really use upstream Caffe?

not able to define function in lib/fast_rcnn/test.py

I want to define a function in lib/fast_rcnn/test.py . I implemented new function in test.py and imported test in demo.ipnb. When I access new function as test.new_function() it throw error module object has not attribute as new_function. How can we define new function test.py ?

binascii.Error: Incorrect padding

Got binascii.Error: Incorrect padding when reading image 300104 from test2014/test2014_resnet101_faster_rcnn_genome.tsv.1 with tools/read_tsv.py. Anything wrong?

Traceback (most recent call last):
  File "read_tsv.py", line 64, in <module>
    read_and_save(os.path.join(in_dir, in_file), out_dir)
  File "read_tsv.py", line 45, in read_and_save
    item['features'] = np.frombuffer(base64.decodestring(item['features']), dtype=np.float32).reshape((item['num_boxes'], -1))
  File "/usr/lib64/python2.7/base64.py", line 321, in decodestring
    return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

which proto is for generate_tsv.py？

I want to use your model to get features, but I don't know what proto do I need for generate_tsv.py.
Following the "test part", I choose "models/vg/ResNet-101/faster_rcnn_end2end_final/test.prototxt", but it report error :
cudnn.hpp:122] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM
I will run this model in XMedia Wikipedia and Pascal. I want more introduction about this part please.
btw, do I need to resize the img shape to 224,224,3 ?
Thank you

Running evaluation script on CPU

Hello,

is it possible to run the evaluation script on CPU? Should I still install Caffe in the way proposed on the readme of this repository?

Thanks,
Claudio

Required gpu memory

Hello
I am trying to use the pretrained model to extract image features, I have GTX 1070(8 gb) and I get out of memory error when I use the network on one image, I suspect this is a caffe issue regarding memory mangament, what do you suggest to solve this issue without decreasing performance?

image captioning task

hi, peteanderson:
I have paid close attention to you for a long time in the Cross-modal field. And, your bottom-up& top-down work really made greatly improved than other works. According to your paper, i accomplished the top-down algorithm for image captioning. But i could not reproduce your 'CIDER loss'. So i just use the 'cross entropy loss' . If possible, i hope you will put your image captioning code on github.
Wish you will rollout another wonderful article in CVPR 2018.

Is this just the bottom-up attention or does it include the full VQA model?

If it's just the attention, do you plan on releasing the VQA model?

How l2-normalization over feature is implemented ?

this paper states that L2 normalization of the image features is crucial for good performance. However, you just use pool5 data, which is average pooled to become a 2048 vector in generate_tsv.py

I'm wondering if you have implemented L2-normalization over feature or not. If you did, please inform me how you do it. Thanks a lot~

ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/tuple'

Hi , I run generate_tsv.py to generate pretrained features,but i meet a problem.

here is the problem :
**Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, self._kwargs)
File "generate_tsv.py", line 157, in generate_tsv
net = caffe.Net(prototxt, caffe.TRAIN, weights=weights)
File "/home/lijinze/bottom-up-attention-master/tools/../lib/rpn/anchor_target_layer.py", line 27, in setup
layer_params = yaml.load(self.param_str)
File "/usr/local/lib/python2.7/dist-packages/yaml/init.py", line 72, in load
return loader.get_single_data()
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 39, in get_single_data
return self.construct_document(node)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 48, in construct_document
for dummy in generator:
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 398, in construct_yaml_map
value = self.construct_mapping(node)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 208, in construct_mapping
return BaseConstructor.construct_mapping(self, node, deep=deep)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 133, in construct_mapping
value = self.construct_object(value_node, deep=deep)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 88, in construct_object
data = constructor(self, node)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 414, in construct_undefined
node.start_mark)
ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/tuple'
in "", line 2, column 11:
'scales': !!python/tuple [4, 8, 16, 32]

Could you do me a favor?Thank you very much!

Class name of detection result of MSCOCO

Would you please provide the class name of detection result of MSCOCO?

Problem in running demo.ipynb

net = caffe.Net(prototxt, caffe.TEST, weights=weights)
Traceback (most recent call last):
File "", line 1, in
Boost.Python.ArgumentError: Python argument types in
Net.init(Net, str, int)
did not match C++ signature:
init(boost::python::api::object, std::string, std::string, int)
init(boost::python::api::object, std::string, int)

When I run make -j8 && make pycaffe error.

CXX src/caffe/internal_thread.cpp
CXX src/caffe/layer.cpp
CXX src/caffe/blob.cpp
CXX src/caffe/syncedmem.cpp
CXX src/caffe/solver.cpp
CXX src/caffe/layer_factory.cpp
CXX src/caffe/data_transformer.cpp
CXX src/caffe/layers/hdf5_data_layer.cpp
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from src/caffe/syncedmem.cpp:3:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/syncedmem.o] Error 1
make: *** Waiting for unfinished jobs....
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from src/caffe/blob.cpp:7:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/blob.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from ./include/caffe/layer.hpp:12,
from src/caffe/layer_factory.cpp:8:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/layer_factory.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from ./include/caffe/layer.hpp:12,
from src/caffe/layer.cpp:1:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/layer.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from ./include/caffe/layer.hpp:12,
from ./include/caffe/layers/hdf5_data_layer.hpp:10,
from src/caffe/layers/hdf5_data_layer.cpp:17:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/layers/hdf5_data_layer.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from src/caffe/data_transformer.cpp:10:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/data_transformer.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from ./include/caffe/layer.hpp:12,
from ./include/caffe/net.hpp:12,
from ./include/caffe/solver.hpp:7,
from src/caffe/solver.cpp:6:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/solver.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from src/caffe/internal_thread.cpp:5:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/internal_thread.o] Error 1

activation of the relation prediction

@peteanderson80
I saw the option "HAS_RELATION" in the cfg file. I turned it on and add a top[6] data for the proposal_target_layer and set the param num_rel_classes to 21(I am not sure if this is correct for the vg_1600-400-20 dataset) and start training, I got the following error:

File "/home/work/bottom-up-attention/tools/../lib/roi_data_layer/minibatch.py", line 55, in get_minibatch
    "Generation of gt_relations doesn't accomodate dropping objects"
AssertionError: Generation of gt_relations doesn't accomodate dropping objects

Is there something wrong with my setting?

read_tsv file error

when I use default setting, which r+b, to open tsv file, error occurs like
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
in line "for item in reader:"
when I use r+
TypeError: expected bytes-like object, not str. occur at np.frombuffer(base64.decodestring(item[field]),
my environment in windows. should it matter?

ImportError: No module named gpu_nms

I can not find gpu_nms.py in nms. Need help!

generate_tsv.py problem ImportError: /usr/lib/libcblas.so.3: undefined symbol: ATL_zger2u

Hi, I am using your code generate_tsv.py to generate features, but met the problem as followed:

./caffe/python/caffe/pycaffe.py", line 13, in
from ._caffe import Net, SGDSolver, NesterovSolver, AdaGradSolver,
ImportError: /usr/lib/libcblas.so.3: undefined symbol: ATL_zger2u

what is the problem?

make -j8 && make pycaffe both successed

How about the image caption model?

Hello, the attribute extraction net using bottom-up attention you proposed is impressing! It indeed boosts the image caption performance in the paper. Besides the attention model, I am also interested in the caption model designed in the paper. In the paper, you mentioned that your caption model achieves performance comparable to start-of-art on most evaluation metrics. So to compare with my own model, can you provide your captioning model implementation code? :)

run tools/demo.ipynb error

[libprotobuf ERROR google/protobuf/text_format.cc:274] Error parsing text-format caffe.NetParameter: 6305:21: Message type "caffe.LayerParameter" has no field named "roi_pooling_param".
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0216 21:21:02.787189 22074 upgrade_proto.cpp:90] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /bottom-up-attention/models/vg/ResNet-101/faster_rcnn_end2end/test.prototxt
*** Check failure stack trace: ***
Aborted

could you please give me a solution to fact this error? Thank you!!

the datasets are too big to be downloaded

Could you please list the platform version?

When I make pycaffe follow your configuration, it always arise some annoying conflicts. Could you please list the platform used in your code?

Besides, my environment is :

Ubuntu16.04
CUDA 8
CUDNN 5.0
Opencv2.4.13
mkl2016
NCCL

Is that worked?
Thanks~