Coder Social home page Coder Social logo

reshalfahsi / image-captioning-mobilenet-llama3 Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 0.0 3.65 MB

Image Captioning With MobileNet-LLaMA 3

Jupyter Notebook 100.00%
image-captioning llama3 mobilenetv3 pytorch pytorch-lightning image-text kv-cache rotary-position-embedding cnn grouped-query-attention rms-norm transformer flickr8k-dataset nlp

image-captioning-mobilenet-llama3's Introduction

Image Captioning With MobileNet-LLaMA 3

colab
architecture MobileNet V3 + LLaMA 3 architecture.

Image captioning is one of the problems in computer vision, constituting two kinds of modalities, i.e., image and text. Given a particular image, a caption regarding it is automatically generated. One can easily leverage a CNN-based architecture to draw the numerical representation out of the image. When interacting with the text, the long-range dependencies method has to be employed. Uplifted by the recent success of LLaMA 3, this project utilizes its computational block called the LLaMA 3 Transformer block. This block comprises RMSNorm, Grouped Multi-Query Attention, Feed Forward SwiGLU, and Rotary Position Embedding. Anyhow, in the original implementation, the Transformer block was only used as the decoder. In this project, the Transformer block is used as both the encoder and the decoder. In the encoder, before image data is funneled into the architecture, a CNN-based architecture, MobileNet-V3, is leveraged, acting similarly to the text embedding. Therefore, this architecture is dubbed MobileNet-LLaMA 3. To get knowledge on the performance of the model, the Flickr-8k dataset is used. The dataset is separated into the train, validation, and test sets in the 80-10-10 rule. Quantitatively, the performance of the model is measured via the ROGUE score, to be precise, the ROGUE-1 F-measure.

Experiment

Proceed to this notebook to vacate and answer your confusion and questions about this project by contemplating each line of code.

Result

Quantitative Result

The MobileNet-LLaMA3 performance on the test set is quantitatively displayed by the following table.

Test Metric Score
ROUGE-1 F-measure 36.69%

Loss Curve

loss_curve
Loss curves of the MobileNet-LLaMA 3 model on the train and validation sets.

Qualitative Result

The following image shows the qualitative results of MobileNet-LLaMA 3 on the test set.

qualitative
The image-caption pairs yielded from MobileNet-LLaMA 3.

The MobileNet-LLaMA 3 model is also assessed in the wild.

qualitative
The result of MobileNet-LLaMA 3 in the wild.

Citation

Feel free to cite this repository:

@misc{mobilenet-llama3,
   title = {Image Captioning With MobileNet-LLaMA 3},
   url = {https://github.com/reshalfahsi/image-captioning-mobilenet-llama3},
   author = {Resha Dwika Hefni Al-Fahsi},
}

Credit

image-captioning-mobilenet-llama3's People

Contributors

reshalfahsi avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.