This is a PyTorch implementation of Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning.
This code is all written in Python. You will need a GPU to train the model.
You also need to install the following package in order to sucessfully run the code.
- Torch
- torchvision
- h5py
- scipy
- tqdm
- NLTK
You can feel free to choose MSCOCO, Flicker8k or Flicker30k as your dataset.
You might want to use the following command to download the MSCOCO dataset:
wget http://images.cocodataset.org/zips/train2014.zip
wget http://images.cocodataset.org/zips/val2014.zip
We will use Andrej Karpathy's training, validation, and test splits. To download the zip file, you can use the following command:
wget http://cs.stanford.edu/people/karpathy/deepimagesent/caption_datasets.zip
In order to preprocess the data on the MSCOCO dataset, you can use the following command:
mkdir coco_folder
python create_input_files.py -d coco -i [YOUR-IMAGE-FLODER]
Use the following command to training the model on MSCOCO dataset:
python train.py -d coco
For comparison, you may also want to train the model with soft attention (paper):
python train.py -d coco -a
You can feel free to choose different beam sizes during evaluation. Use the following command to compute all BLEU (i.e. BLEU-1 to BLEU-4) scores:
python eval.py -d coco -cf [PATH-TO-CHECKPOINT] -b 5
Note that the best checkpoint in training process is based on the BLEU-4 score.
For captioning on your own image, you can use the following command:
python caption.py
If you use this code as part of any published research, please acknowledge the following paper:
@misc{Lu2017Adaptive,
author = {Lu, Jiasen and Xiong, Caiming and Parikh, Devi and Socher, Richard},
title = {Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning},
journal = {CVPR},
year = {2017}
}
The code is developed based on a-PyTorch-Tutorial-to-Image-Captioning.