Coder Social home page Coder Social logo

image-captioning-3's Introduction

image-captioning

Implementations for image captioning models in PyTorch, different types of attention mechanisms supported. Currently only provides pretrained ResNet152 and VGG16 with batch normalization as encoders.

Model supported:
FC from "show and tell"
Att2all from "show and tell"
Att2in from "Self-critical Sequence Training for Image Captioning"
Spatial attention from "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning"
Adaptive attention from "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning"

Evaluate captions via capeval/, which is derived from tylin/coco-caption with minor changes for a better Python 3 support

Requirements

  • MSCOCO original dataset, please put them in the same directory, e.g. COCO2014/, and modify the COCO_ROOT in configs.py, you can get them here:
  • Instead of using random split, Karpathy's split is required, please put it in the COCO_PATH
  • PyTorch v0.3.1 or newer with GPU support.
  • TensorBoardX

Usage

1. Preprocessing

First of all we should preprocess the images and store them locally. Specifying phases is available if parallel processing is required. All preprocessed images are stored in HDF5 databases in COCO_ROOT

python preprocess.py

2. Extract image features

Extract the image features offline by the encoder and store them locally. Currently only ResNet152 and VGG16 with batch normalization are supported.

python extract.py --pretrained=resnet --batch_size=10 --gpu=0

3. Training the model

Training can be performed only after the image features are extracted. If training on the full dataset is desired, please specify the train_size as -1 Immediate evaluation with beam search after training is also available, please set the flag as true. The scores are stored in scores/

python train.py --train_size=100 --val_size=10 --test=10 --epoch=30 --verbose=10 --learning_rate=1e-3 --batch_size=10 --gpu=0 --pretrained=resnet --attention=none --evaluation=true

4. Offline evaluation

After the training is over, an offline evaluation can be performed. All generated captions are stored in results/

python evaluation.py --train_size=100 --test_size=10 --num=3 --batch_size=10 --gpu=10 --pretrained=resnet --attention=none --encoder=<path_to_encoder> --decoder=<path_to_decoder>

Note that the train_size must match the size of images for training

5. Visualize attention weights

For the model with attention.

python show_attention.py --phase=test --pretrained=resnet --train_size=-1 --val_size=-1 --test_size=-1 --num=10 --encoder=<path_to_encoder> --decoder=<path_to_decoder> --gpu=0

Results

Good captions

alt text

Okay captions

alt text

Bad captions

alt text

Attention

Good results

alt text alt text alt text

Bad results

alt text

Performance

Model BLEU-1 BLEU-2 BLEU-3 BLEU-4 CIDEr
Baseline (Nearest neighbor) 0.48 0.281 0.166 0.1 0.383
FC 0.720 0.536 0.388 0.286 0.805
Att2in 0.732 0.553 0.402 0.296 0.837
Att2all 0.732 0.554 0.403 0.296 0.838
Spatial attention 0.725 0.537 0.389 0.287 0.812
Adaptive attention 0.716 0.524 0.379 0.278 0.808
NeuralTalk2 0.625 0.45 0.321 0.23 0.66
Show and Tell 0.666 0.461 0.329 0.27 -
Show, Attend and Tell 0.707 0.492 0.344 0.243 -
Adaptive Attention 0.742 0.580 0.439 0.266 1.085
Neural Baby Talk 0.755 - - 0.347 1.072

best models:

Model train_size test_size learning_rate weight_decay batch_size beam_size dropout
FC -1 -1 2e-4 0 512 7 0
Att2in -1 -1 5e-4 1e-4 256 7 0
Att2all -1 -1 5e-4 1e-4 256 7 0
Spatial attention -1 -1 2e-4 1e-4 256 7 0
Adaptive attention -1 -1 2e-4 1e-4 256 7 0

image-captioning-3's People

Contributors

daveredrum avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.