ruotianluo / disccaptioning Goto Github PK

View Code? Open in Web Editor NEW

110.0 6.0 21.0 164 KB

Code for Discriminability objective for training descriptive captions(CVPR 2018)

Python 97.73% Shell 1.32% HTML 0.95%

disccaptioning's Introduction

Discriminability objective for training descriptive captions

This is the implementation of paper Discriminability objective for training descriptive captions.

Requirements

Python 2.7 (because there is no coco-caption version for python 3)

PyTorch 1.0 (along with torchvision)

java 1.8 for (coco-caption)

Downloads

Clone the repository

git clone --recursive https://github.com/ruotianluo/DiscCaptioning.git

Data split

In this paper we use the data split from Context-aware Captions from Context-agnostic Supervision. It's different from standard karpathy's split, so we need to download different files.

Download link: Google drive link

To train on your own, you only need to download dataset_coco.json, but it's also suggested to download cocotalk.json and cocotalk_label.h5 as well. If you want to run pretrained model, you have to download all three files.

coco-caption

cd coco-caption
bash ./get_stanford_models.sh
cd annotations
# Download captions_val2014.json from the google drive link above to this folder
cd ../../

The reason why we need to replace the captions_val2014.json is because the original file can only evaluate images from the val2014 set, and we are using rama's split.

Pre-computed feature

In this paper, for retrieval model, we use outputs of last layer of resnet-101. For captioning model, we use the bottom-up feature from https://arxiv.org/abs/1707.07998.

The features can be downloaded from the same link, and you need to compress them to data/cocotalk_fc and data/cocobu_att respectively.

Pretrained models.

Download pretrained models from link. Decompress them into root folder.

To evaluate on pretrained model, run:

bash eval.sh att_d1 test

The pretrained models can match the results shown in the paper.

Train on you rown

Preprocessing

Preprocess the captions (skip if you already have 'cocotalk.json' and 'cocotalk_label.h5'):

$ python scripts/prepro_labels.py --input_json data/dataset_coco.json --output_json data/cocotalk.json --output_h5 data/cocotalk

Preprocess for self-critical training:

$ python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train

Start training

First train a retrieval model:

bash run_fc_con.sh

Second, pretrain the captioning model.

bash run_att.sh

Third, finetune the captioning model with cider+discriminability optimization:

bash run_att_d.sh 1 (1 is the discriminability weight, and can be changed to other values)

Evaluate

bash eval.sh att_d1 test

Citation

If you found this useful, please consider citing:

@InProceedings{Luo_2018_CVPR,
author = {Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},
title = {Discriminability Objective for Training Descriptive Captions},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}

Acknowledgements

The code is based on ImageCaptioning.pytorch

disccaptioning's People

Contributors

Stargazers

Watchers

disccaptioning's Issues

IOError: Failed to interpret file 'data/cocotalk_fc/100000.npy' as a pickle.

Hi, ruotian. When I ran ‘bash run_fc_con.sh’, the training is terminated by the above error.
I have checked the '100000.npy ', which is 1.8G and cannot be loaded by numpy. So I wonder if this file is not right.

Issues in FCModel

When I try to train the retrieval model using bash run_fc_con.sh, there is a problem before saving model. Actually, line 147 of FCModel.py xt = self.new_img_embed(fc_feats[k:k+1], fc_feats_d.chunk(batch_size)[k]).expand(beam_size, self.input_encoding_size) results in the issue, because it reminds that there is no attribute named new_img_embed. Also, fc_feats_d is an error, because the pycharm says that "Unresolved reference".

But after switching --caption_model fc to --caption_model att2in2 in run_fc_con.sh file, the issue will be solved, so I think the inference of FC model could be wrong.

BTW, (1) when training retrieval model, why do you need to use the caption model to generate captions?
(2) the image features is extracted before training the model, so I think you do not fine-tune the CNN, right?

Training curve of reinforcement learning

When I train the model with RL using run_att_d.sh, the CIDEr score got a significant drop,
I saw that you have provided the curve of training VSE model#11,
I wonder would you like to provide the training curve of RL stage?
Thank you very much :D

How to train on TopDown model?

when I try to run on TopDown model,
I got the following error:

File "/home/code/DiscCaptioning/models/AttModel.py", line 476, in forward
att_lstm_input = torch.cat([prev_h, fc_feats, xt], 1)
RuntimeError: zero-dimensional tensor (at position 0) cannot be concatenated

Would you please tell me which parts of code I need to modify so I can train on the TopDown model?

Make sure the vse opt are the same !!!!!

After training the retrieval model with "bash run_fc_con.sh", I pretrain the captioning model with "bash run_att.sh". However, it is not successful, with the following error:

DataLoader loading json file: data/cocotalk.json
vocab size is 9487
DataLoader loading h5 file: data/cocotalk_fc data/cocobu_att data/cocotalk_label.h5
max sequence length in data is 16
read 123287 image features
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test
Make sure the vse opt are the same !!!!!
Make sure the vse opt are the same !!!!!
Make sure the vse opt are the same !!!!!
Make sure the vse opt are the same !!!!!
...

key caption_generator.core.a2c.weight in model.state_dict() not in loaded state_dict
key caption_generator.core.h2h.bias in model.state_dict() not in loaded state_dict
key caption_generator.att_embed.0.weight in model.state_dict() not in loaded state_dict
...
key caption_generator.core.h2h.weight in model.state_dict() not in loaded state_dict
Read data: 0.360612869263
/home/jzheng/PycharmProjects/DiscCaptioning/misc/utils.py:123: UserWarning: volatile was removed (Variable.volatile is always False)
if isinstance(x, Variable) and volatile!=x.volatile:
/home/jzheng/PycharmProjects/DiscCaptioning/models/AttModel.py:514: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
weight = F.softmax(dot) # batch * att_size
/home/jzheng/PycharmProjects/DiscCaptioning/models/AttModel.py:125: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
output = F.log_softmax(self.logit(output))
/home/jzheng/PycharmProjects/DiscCaptioning/models/AttModel.py:131: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
self._loss['xe'] = loss.data[0]
/home/jzheng/PycharmProjects/DiscCaptioning/models/JointModel.py:122: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
self._loss['loss_cap'] = loss_cap.data[0]
iter 0 (epoch 0), train_loss = 9.190, time/batch = 0.157
loss_cap = 9.190 loss = 9.190 cap_xe = 9.190 loss_vse = 0.000
Read data: 0.176112890244
iter 1 (epoch 0), train_loss = 8.774, time/batch = 0.129
loss_cap = 8.774 loss = 8.774 cap_xe = 8.774 loss_vse = 0.000
Read data: 0.179361104965
iter 2 (epoch 0), train_loss = 8.381, time/batch = 0.116
loss_cap = 8.381 loss = 8.381 cap_xe = 8.381 loss_vse = 0.000
...

Because of this error, the model is not learning.

FileNotFoundError: [Errno 2] No such file or directory: 'cider/data/cocotalk_fc\\391895.npy Terminating BlobFetcher

After using a whole afternoon to fix 87 bugs in this repo one by one, finally I'm able to start training, but not surprisingly, I got stuck by another issue. This time I have no more idea about how to fix it, as there exists barely any related Q&A on Google (e.g. What is the 391895.npy? Where to find it? What may replace it? Or how to bypass this appear-to-be-easy FileNotFoundError?). Would anyone kindly give a hint of how I might sort out this issue? Many thanks.

The code to trigger this issue (after following all the previous steps): bash run_fc_con.sh
The error message:

DataLoader loading json file: data/cocotalk.json
vocab size is 9487
DataLoader loading h5 file: data/cocotalk_fc data/cocobu_att data/cocotalk_label.h5
max sequence length in data is 16
read 123287 image features
assigned 113287 images to split train
assigned 5000 images to split val
assigned 5000 images to split test

Traceback (most recent call last):
File "train.py", line 242, in
train(opt)
File "train.py", line 109, in train
data = loader.get_batch('train')
File "D:\Project\DiscCaptioning\dataloader.py", line 138, in get_batch
ix, tmp_wrapped = self._prefetch_process[split].get()
File "D:\Project\DiscCaptioning\dataloader.py", line 264, in get
tmp = self.split_loader.next()
File "D:\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 521, in next
data = self._next_data()
File "D:\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "D:\Anaconda3\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\Anaconda3\lib\site-packages\torch\utils\data_utils\fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "D:\Project\DiscCaptioning\dataloader.py", line 205, in getitem
return (np.load(os.path.join("cider/" + self.input_fc_dir, str(self.info['images'][ix]['id']) + '.npy')), np.zeros((1,1)), ix)
File "D:\Anaconda3\lib\site-packages\numpy\lib\npyio.py", line 428, in load
fid = open(os_fspath(file), "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'cider/data/cocotalk_fc\\391895.npy'
Terminating BlobFetcher

(After downloading and unzipping the cocotalk_fc.tar from Google drive, I obtain a folder cocotalk_fc with structure like this:
cocotalk_fc (the root folder obtained from zipping) -> cocotalk_fc (a binary file)
and after I cut them to under the directory cider, the path to the end of this directory branch is like this:
DiscCaptioning -> cider -> cocotalk_fc (the root folder after zipping) -> cocotalk_fc (a binary file)
Also, it won't help if you simply rename the binary cocotalk_fc file to 391895 or 391895.npy, which would still throw the same error. I'm hence stuck.)

Which are the negatives for the retrieval model.

The negatives used for the retrieval model are all the rest images of the entire batch? In your paper you mentioned B images, how many are those images?

ValueError: sampler should be an instance of torch.utils.data.Sampler

pytorch 1.0.0

bash eval.sh att_d1 test

Traceback (most recent call last):
File "eval.py", line 146, in
vars(opt))
File "/content/DiscCaptioning/eval_utils.py", line 92, in eval_split
data = loader.get_batch(split)
File "/content/DiscCaptioning/dataloader.py", line 137, in get_batch
ix, tmp_wrapped = self._prefetch_process[split].get()
File "/content/DiscCaptioning/dataloader.py", line 256, in get
self.reset()
File "/content/DiscCaptioning/dataloader.py", line 235, in reset
collate_fn=lambda x: x[0]))
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 805, in init
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/sampler.py", line 146, in init
.format(sampler))
ValueError: sampler should be an instance of torch.utils.data.Sampler, but got sampler=[......]
Terminating BlobFetcher

What's the difference between Rama's split and Karpathy's split?

In Rama's paper, they said that they use the split from Karpathy's paper,
but in your README.md you mention that it's different from standard Karpathy's split.

I wonder what's the differences between these 2 splits, thanks!

att_masks

excuse me, would you mind explaining the function about the att_mask?

how come evaluation result is very bad, Bleu_4 is 0.000, Meteor is 0.009 . BTW, how to generate caption on customized dataset

okenization...
PTBTokenizer tokenized 99344 tokens at 721579.12 tokens per second.
PTBTokenizer tokenized 16786 tokens at 239467.86 tokens per second.
setting up scorers...
computing Bleu score...
{'reflen': 15132, 'guess': [15165, 13543, 11921, 10299], 'testlen': 15165, 'correct': [30, 0, 0, 0]}
ratio: 1.00218080888
Bleu_1: 0.002
Bleu_2: 0.000
Bleu_3: 0.000
Bleu_4: 0.000
computing METEOR score...
METEOR: 0.009
computing Rouge score...
ROUGE_L: 0.002
computing CIDEr score...
CIDEr: 0.001
computing SPICE score...
Parsing reference captions
Parsing test captions
SPICE evaluation took: 2.144 s
SPICE: 0.002
loss: {'loss': tensor(31.6388, device='cuda:0'), 'cap_xe': tensor(31.6419, device='cuda:0'), 'retrieval_loss_greedy': tensor(7.4241, device='cuda:0'), 'retrieval_sc_loss': tensor(1.00000e-03 *
-3.1324, device='cuda:0'), 'loss_vse': tensor(0., device='cuda:0'), 'loss_cap': tensor(31.6419, device='cuda:0'), 'retrieval_loss': tensor(7.6047, device='cuda:0')}
{u'SPICE_Object': '0.006404463463649654', u'SPICE_Cardinality': '0.0', u'SPICE_Attribute': '0.0', 'CIDEr': '0.001079661462843171', u'SPICE_Size': '0.0', 'Bleu_4': 1.04439324421061e-15, 'Bleu_3': 2.3054219540753186e-14, 'Bleu_2': 1.208598304910465e-11, 'Bleu_1': 0.001978239366963272, u'SPICE_Color': '0.0', 'ROUGE_L': '0.001795472073475935', 'METEOR': 0.009059195566343728, u'SPICE_Relation': '0.0', 'SPICE': '0.0024048127567198488'}
Terminating BlobFetcher

Is this evaluating the image caption model? It looks like the retrieval model.
image 474190: woods conditioner china memorial scraper sash bringing woods interstate sunroof distant
image 277907: woods pairs china listed want listed bringing woods crowd
image 43033: woods hanging service woods peep dinosaurs cooking wonder
image 542103: woods conditioner china memorial gooey bringing cooking gain woody adorable
image 356116: woods majestically rice bringing cooking gain woody woods peep
image 538581: woods hanging service woods windsurfer dinosaurs cooking weeds woody woods windsurfer
image 359354: woods hanging effects woods silver dinosaurs woods silver
image 457146: woods captive honk bringing retrieve china woods holds
image 75305: woods majestically honk lots woods goofing woody woods silver
image 249968: woods troll honk bringing cooking fir china woods bubble foreheads
image 480451: woods hanging catchers woods tightly hollow bringing woods tightly hitting
image 379596: woods hangings china pouches want pouches bringing woods goofing
image 322362: woods patch benched honk bringing woods holds woody woods overgrowth gains
image 495233: woods conditioner china memorial honk bringing woods caddy musical woods overgrowth gains
image 366948: woods conditioner china lipstick rice dinosaurs woods mirrors
image 332833: woods burrito levels honk bringing cooking lock woody woods keypad
image 512346: woods hanging service woods draining buddhist dinosaurs woods peek
evaluating validation preformance... 2049/5000 (31.236956)

infos_att_d1.pkl

Thanks for your works! Could you provide " infos_att_d1.pkl" for us?

关于每张图片使用几个句子

您好！我在跑您的代码时候，发现opt文件中默认每张图片使用一个句子，但是在最后联合训练的脚本中，又专门指定了每张图片使用一个句子，但是其他两个脚本并没有指定。我对此感到很困惑，请问您在实现的时候，训练自检索模型、预训练caption模型和最后的联合训练每张图片分别用了几个句子呢？

when bash run_att_d.sh ,it broke

Unable to download pretrained models

Thanks for the implementation.

I downloaded the pre-trained models but the file is not a folder which eval.sh code is asking for and the downloaded file is in a file format which is not accessible.

I think the file gets corrupted after downloading from the drive.

IOError: [Errno 20] Not a directory: 'log_att_d1/infos_att_d1.pkl'

python: can't open file 'scripts/prepro_ngrams.py': [Errno 2] No such file or directory

To train on our own, do we have to pre-process for self-critical model? If so, I think a python file is missing here in this repo.

evaluate error: KeyError: 'att_masks', not att_masks in data

File "/home/jzheng/PycharmProjects/DiscCaptioning/eval_utils.py", line 114, in eval_split
data['att_masks'][np.arange(loader.batch_size) * loader.seq_per_img]]
KeyError: 'att_masks'
There's no att_masks key in the data dict. Neither labels and masks. Am I missing sth?

I'm testing on val2014 dataset.

Similar work

I think your work is very similar to "Deep Reinforcement Learning-based Image Captioning with Embedding Reward". I wonder what is the difference with them?

f30k-caption?

hello,where can I download the f30k-caption in your annFile = 'f30k-caption/annotations/dataset_flickr30k.json'?when I use the flickr30,the eval code takes wrong.Is there the eval code for flickr30?

the retrieval loss doesn't converge well

Hello, luo
when I pretrain the VSEFCmodel, the vse_loss doesn't converge well , just around 51.2. is there some mistakes in my experiments, how about your vse_loss when you pretrain VSEFCmodel?

Traceback (most recent call last): File "train.py", line 250, in <module> train(opt) File "train.py", line 163, in train if opt.evaluation_retrieval: AttributeError: 'Namespace' object has no attribute 'evaluation_retrieval' Terminating BlobFetcher

While I was training with the command "bash run_fc_con.sh", the training was terminated because the following error:

Traceback (most recent call last):
File "train.py", line 250, in
train(opt)
File "train.py", line 163, in train
if opt.evaluation_retrieval:
AttributeError: 'Namespace' object has no attribute 'evaluation_retrieval'
Terminating BlobFetcher

However, in the opt.py file, I don't see the argument of "evaluation_retrieval".