Image Captioning Project

A deep Learning models that takes as input an image and describes its contents.

Concepts used : Convolutional Neural Networks, Recurrent Neural Networks, Transfer Learning, Word embeddings, Image and Text Processing, Multi Layered Perceptrons, Backpropagation, Gradient Descent, and many more...

Language used : Python

Libraries used : Tensorflow, Keras, Numpy, Pandas, re, etc.

For transfer Learning, used glove embeddings and resnet50 model.

REAL WORLD APPLICATIONS OF IMAGE CAPTIONING

In self driving cars, it can be used to properly caption scene around a car and give a boost to the self driving system.
It can serve as an aid to the blind, wherein we can first convert the scene into text abd then text into voice. This can help guide them on roads and crowded places.
In CCTV cameras, alongwith viewing the world, if we can also generate relevant captions, then we can raise alarms if some malicious activity takes place. Malicious cativity could be detected based on generated captions.
It can help make Google Image Search as good as Google Search. Every image could be first converted to a caption and then search could be performed for other similar images.

DATA DESCRIPTION (Dataset used - Flickr8K)

A total of around 8000 images are there in Flickr8k dataset, divided into training and testing sets. Each image is given 5 different captions by 5 different humans, to account for the fact that an image can be described multiple ways.

METHODOLOGY ADOPTED

STEP 1 :

Words will be generated one at a time in order to generate complete sentences. To generate each word, we provide 2 types of inputs :

Image
Part of the sentence that has already been predicted so that the model can use the context and predict the next word.

STEP 2 : PREPROCESSING TEXT DATA

We add 2 special tokens to each caption that represents start of sentence and end of sentence.
Then we make multiple data samples for each caption and image pair.