Coder Social home page Coder Social logo

apple / ml-mobileclip Goto Github PK

View Code? Open in Web Editor NEW
458.0 15.0 25.0 450 KB

This repository contains the official implementation of the research paper, "MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training" CVPR 2024

License: Other

Python 96.80% Shell 3.20%

ml-mobileclip's Introduction

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

This is the official repository of MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. (CVPR 2024) Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel. The repository contains code for inference, training, and evaluation of MobileCLIP models trained on DataCompDR datasets.

Accuracy vs latency figure.

  • Update 2024/06/13: Releasing the code and scripts to train using OpenCLIP on DataCompDR datasets. See training/.
  • Update 2024/06/13: MobileCLIP models and DataCompDR datasets are now available on HuggingFace in MobileCLIP/DataCompDR Collection.

Highlights

  • Our smallest variant MobileCLIP-S0 obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller.
  • MobileCLIP-S2 obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples.
  • MobileCLIP-B(LT) attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336.

Getting Started

Setup

conda create -n clipenv python=3.10
conda activate clipenv
pip install -e .

To download pretrained checkpoints follow the code snippet below

source get_pretrained_models.sh   # Files will be downloaded to `checkpoints` directory.

Usage Example

To models from the official repo, follow the code snippet below

import torch
from PIL import Image
import mobileclip

model, _, preprocess = mobileclip.create_model_and_transforms('mobileclip_s0', pretrained='/path/to/mobileclip_s0.pt')
tokenizer = mobileclip.get_tokenizer('mobileclip_s0')

image = preprocess(Image.open("docs/fig_accuracy_latency.png").convert('RGB')).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

OpenCLIP Support

Our models are now natively supported in OpenCLIP. To use MobileCLIP models in OpenCLIP, setup your environment as shown below,

conda create -n clipenv python=3.10
conda activate clipenv

pip install git+https://github.com/mlfoundations/open_clip
pip install git+https://github.com/huggingface/pytorch-image-models

To run inference, see example below,

import open_clip
from mobileclip.modules.common.mobileone import reparameterize_model
 
model, _, preprocess = open_clip.create_model_and_transforms('MobileCLIP-S2', pretrained='datacompdr')
tokenizer = open_clip.get_tokenizer('MobileCLIP-S2')

# For inference/model exporting purposes, please reparameterize first
model.eval() 
model = reparameterize_model(model)

# ... follow examples in open_clip repo ...

Variants currently available on OpenCLIP, [('MobileCLIP-S1', 'datacompdr'), ('MobileCLIP-S2', 'datacompdr'), ('MobileCLIP-B', 'datacompdr'), ('MobileCLIP-B', 'datacompdr_lt')]

Evaluation

To reproduce results, we provide script to perform zero-shot evaluation on ImageNet-1k dataset. To evaluate on all the 38 datasets, please follow instructions in datacomp.

# Run evaluation with single GPU
python eval/zeroshot_imagenet.py --model-arch mobileclip_s0 --model-path /path/to/mobileclip_s0.pt

Please refer to Open CLIP Results to compare with other models.

Model # Seen
Samples (B)
# Params (M)
(img + txt)
Latency (ms)
(img + txt)
IN-1k Zero-Shot
Top-1 Acc. (%)
Avg. Perf. (%)
on 38 datasets
Pytorch Checkpoint (url)
MobileCLIP-S0 13 11.4 + 42.4 1.5 + 1.6 67.8 58.1 mobileclip_s0.pt
MobileCLIP-S1 13 21.5 + 63.4 2.5 + 3.3 72.6 61.3 mobileclip_s1.pt
MobileCLIP-S2 13 35.7 + 63.4 3.6 + 3.3 74.4 63.7 mobileclip_s2.pt
MobileCLIP-B 13 86.3 + 63.4 10.4 + 3.3 76.8 65.2 mobileclip_b.pt
MobileCLIP-B (LT) 36 86.3 + 63.4 10.4 + 3.3 77.2 65.8 mobileclip_blt.pt

Note: MobileCLIP-B(LT) is trained for 300k iterations with constant learning rate schedule and 300k iterations with cosine learning rate schedule.

Citation

If you found this code useful, please cite the following paper:

@InProceedings{mobileclip2024,
  author = {Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel},
  title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2024},
}

Acknowledgements

Our codebase is built using multiple opensource contributions, please see ACKNOWLEDGEMENTS for more details.

ml-mobileclip's People

Contributors

fartashf avatar pavank-apple avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.