Coder Social home page Coder Social logo

image-caption-pytorch's Introduction

Neural image captioning with PyTorch


Implement neural image captioning models with PyTorch based on encoder-decoder architecture.

The dataset is Flikr8k, which is small enough for computing budget and quickly getting the results. Within the dataset, there are 8091 images, with 5 captions for each image. Thus it is prone to overfit if the model is too complex. The official source is broken, another links for the dataset could be here and here

The model architecture is described as below. The encoder network for the image is Resnet-101 (could be loaded from torchvision). The decoder is basically a LSTM-based language model, with the context vector (encoded image feature) as the initial hidden/cell state of the LSTM [1]. Attentive model is also implemented [2].

The model is trained by SGD with momentum. The learning rate starts from 0.01 and is divided by 10 as stuck at a plateau. The momentum of 0.9 and the weight decay of 0.001 are used.

The model [1] can obtain relatively reasonable descriptions, with the BLEU-1 test score 35.7.

Examples

Images Captions
Two dogs play in the grass.
A person is kayaking in the boat.
A boy is splashing in a pool.
Two people sit on a dock by the water.
A soccer player in a red uniform is running with a soccer ball in front of a crowd.
A snowboarder is jumping off a hill.
A brown dog is playing with a ball in the sand.
A boy in a blue shirt is running through a grassy field.
A group of people dressed in colorful costumes.

Dependencies

Pytorch 0.4.1

Reference

[1] Show and Tell: A Neural Image Caption Generator (https://arxiv.org/abs/1411.4555)
[2] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (https://arxiv.org/abs/1502.03044)

image-caption-pytorch's People

Contributors

yurayli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

image-caption-pytorch's Issues

Tired many times to figure out the error but could not. Can you please give me some idea on this?

RuntimeError: Error(s) in loading state_dict for CaptionModel_B:
size mismatch for rnn.embed.weight: copying a param with shape torch.Size([9080, 50]) from checkpoint, the shape in current model is torch.Size([8990, 50]).
size mismatch for rnn.linear.weight: copying a param with shape torch.Size([9080, 160]) from checkpoint, the shape in current model is torch.Size([8990, 160]).
size mismatch for rnn.linear.bias: copying a param with shape torch.Size([9080]) from checkpoint, the shape in current model is torch.Size([8990]).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.