Coder Social home page Coder Social logo

image-captioning-rnn's Introduction

Image Captioning with RNN

Dataset

An LSTM network is trained on the VizWiz-Captions dataset. It consists of 23,431 training images with 117,155 training captions; 7,750 validation images with 38,750 validation captions; 8,000 test images with 40,000 test captions (unavailable to developers; reserved for the image captioning competition). Here, a subset of the train images (8000 images in the ‘train1’ dataset) is used to train the RNN; and the validation images are used to validate the model performance.

Approach

Certain data pre-processing steps such as converting all letters to lowercase, eliminating special characters, removing numbers from captions are done, and a vocabulary consisting of all the unique words across the train captions is created. Special tokens – ‘startseq’ and ‘endseq’ are added to each caption to indicate the beginning and ending of captions. A pre-trained GLOVE model is used for creating the word embedding matrix is taken as the input.

A pre-trained InceptionV3 model on the ImageNet dataset (transfer learning) is used to extract a 2048 length vector for each image (this is shown as Input 2 in the image above). A generator model is used to train images, in a sequential manner using partial captions as input; when the model is given ‘startseq’ as input it tries to predict the next word in the caption as its immediate output, and so on.

The outputs from the embedding layer are given to the LSTM layer which is then merged into a single tensor with the outputs from the dense layer of the image input layer. Dropout layers are added to prevent the model from overfitting. The final dense layer is of size 2514 since the vocabulary that we have created for the training set is of the same shape, and the word with the maximum probability is selected from the softmax layer. All models are trained for a total of 20 epochs.

GLOVE Word Embedding model - https://www.kaggle.com/datasets/incorpes/glove6b200d

Directory

Dataset/ 
train/
├─ data/
├─ descriptions.txt
├─ encoded_train_images.pkl
├─ img_ids.txt
├─ annotations_train1.txt
val/
├─ data/
├─ descriptions.txt
├─ encoded_test_images.pkl
├─ img_ids.txt
├─ annotations_val.txt
glove.6B.200d.txt
model_weights/
image_captioning.ipynb

Find the saved model weights here

Results

Training Performance - Optimizing the RNNs

Validation Performance - Average of BLEU scores over the validation set

image-captioning-rnn's People

Contributors

nirmal-25 avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.