pku-yuangroup / languagebind Goto Github PK

View Code? Open in Web Editor NEW

601.0 13.0 48.0 19.05 MB

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Home Page: https://arxiv.org/abs/2310.01852

License: MIT License

Python 99.14% Shell 0.86%

language-central multi-modal pretraining zero-shot

languagebind's Introduction

【ICLR 2024 🔥】LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

If you like our project, please give us a star ⭐ on GitHub for latest update.

💡 I also have other vision-language projects that may interest you ✨.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, Li Yuan

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models
Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan

📰 News

[2024.01.27] 👀👀👀 Our MoE-LLaVA is released! A sparse model with 3B parameters outperformed the dense model with 7B parameters.
[2024.01.16] 🔥🔥🔥 Our LanguageBind has been accepted at ICLR 2024! We earn the score of 6(3)8(6)6(6)6(6) here.
[2023.12.15] 💪💪💪 We expand the 💥💥💥 VIDAL dataset and now have 10M video-text data. We launch LanguageBind_Video 1.5, checking our model zoo.
[2023.12.10] We expand the 💥💥💥 VIDAL dataset and now have 10M depth and 10M thermal data. We are in the process of uploading thermal and depth data on Hugging Face and expect the whole process to last 1-2 months.
[2023.11.27] 🔥🔥🔥 We have updated our paper with emergency zero-shot results., checking our ✨ results.
[2023.11.26] 💥💥💥 We have open-sourced all textual sources and corresponding YouTube IDs here.
[2023.11.26] 📣📣📣 We have open-sourced fully fine-tuned Video & Audio, achieving improved performance once again, checking our model zoo.
[2023.11.22] We are about to release a fully fine-tuned version, and the HUGE version is currently undergoing training.
[2023.11.21] 💥 We are releasing sample data in DATASETS.md so that individuals who are interested can further modify the code to train it on their own data.
[2023.11.20] 🚀🚀🚀 Video-LLaVA builds a large visual-language model to achieve 🎉SOTA performances based on LanguageBind encoders.
[2023.10.23] 🎶 LanguageBind-Audio achieves 🎉🎉🎉state-of-the-art (SOTA) performance on 5 datasets, checking our ✨ results!
[2023.10.14] 😱 Released a stronger LanguageBind-Video, checking our ✨ results! The video checkpoint have updated on Huggingface Model Hub!
[2023.10.10] We provide sample data, which can be found in assets, and emergency zero-shot usage is described.
[2023.10.07] The checkpoints are available on 🤗 Huggingface Model.
[2023.10.04] Code and demo are available now! Welcome to watch 👀 this repository for the latest updates.

😮 Highlights

💡 High performance, but NO intermediate modality required

LanguageBind is a language-centric multimodal pretraining approach, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics.

The following first figure shows the architecture of LanguageBind. LanguageBind can be easily extended to segmentation, detection tasks, and potentially to unlimited modalities.

⚡️ A multimodal, fully aligned and voluminous dataset

We propose VIDAL-10M, 10 Million data with Video, Infrared, Depth, Audio and their corresponding Language, which greatly expands the data beyond visual modalities.

The second figure shows our proposed VIDAL-10M dataset, which includes five modalities: video, infrared, depth, audio, and language.

🔥 Multi-view enhanced description for training

We make multi-view enhancements to language. We produce multi-view description that combines meta-data, spatial, and temporal to greatly enhance the semantic information of the language. In addition we further enhance the language with ChatGPT to create a good semantic space for each modality aligned language.

🤗 Demo

Local demo. Highly recommend trying out our web demo, which incorporates all features currently supported by LanguageBind.

python gradio_app.py

Online demo. We provide the online demo in Huggingface Spaces. In this demo, you can calculate the similarity of modalities to language, such as audio-to-language, video-to-language, and depth-to-image.

🚀 Main Results

Video-Language

LanguageBind achieves state-of-the-art (SOTA) performance on four datasets, * donates the results of full tuning.

Multiple Modalities

Video-Language, Infrared-Language, Depth-Language, and Audio-Language zero-shot classification, * donates the results of full tuning.

We report text-to-audio results for retrieval, * donates the results of full tuning.

Emergency results

🛠️ Requirements and Installation

Python >= 3.8
Pytorch >= 1.13.1
CUDA Version >= 11.6
Install required packages:

git clone https://github.com/PKU-YuanGroup/LanguageBind
cd LanguageBind
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

🐳 Model Zoo

The names in the table represent different encoder models. For example, LanguageBind/LanguageBind_Video_FT represents the fully fine-tuned version, while LanguageBind/LanguageBind_Video represents the LoRA-tuned version.

You can freely replace them in the recommended API usage. We recommend using the fully fine-tuned version, as it offers stronger performance.

Modality	LoRA tuning	Fine-tuning
Video	LanguageBind_Video	LanguageBind_Video_FT
Audio	LanguageBind_Audio	LanguageBind_Audio_FT
Depth	LanguageBind_Depth	-
Thermal	LanguageBind_Thermal	-

Version	Tuning	Model size	Num_frames	HF Link	MSR-VTT	DiDeMo	ActivityNet	MSVD
LanguageBind_Video	LoRA	Large	8	Link	42.6	37.8	35.1	52.2
LanguageBind_Video_FT	Full-tuning	Large	8	Link	42.7	38.1	36.9	53.5
LanguageBind_Video_V1.5_FT	Full-tuning	Large	8	Link	42.8	39.7	38.4	54.1
LanguageBind_Video_V1.5_FT	Full-tuning	Large	12	Coming soon
LanguageBind_Video_Huge_V1.5_FT	Full-tuning	Huge	8	Link	44.8	39.9	41.0	53.7
LanguageBind_Video_Huge_V1.5_FT	Full-tuning	Huge	12	Coming soon

🤖 API

We open source all modalities preprocessing code. If you want to load the model (e.g. LanguageBind/LanguageBind_Thermal) from the model hub on Huggingface or on local, you can use the following code snippets!

Inference for Multi-modal Binding

We have provided some sample datasets in assets to quickly see how languagebind works.

import torch
from languagebind import LanguageBind, to_device, transform_dict, LanguageBindImageTokenizer

if __name__ == '__main__':
    device = 'cuda:0'
    device = torch.device(device)
    clip_type = {
        'video': 'LanguageBind_Video_FT',  # also LanguageBind_Video
        'audio': 'LanguageBind_Audio_FT',  # also LanguageBind_Audio
        'thermal': 'LanguageBind_Thermal',
        'image': 'LanguageBind_Image',
        'depth': 'LanguageBind_Depth',
    }

    model = LanguageBind(clip_type=clip_type, cache_dir='./cache_dir')
    model = model.to(device)
    model.eval()
    pretrained_ckpt = f'lb203/LanguageBind_Image'
    tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir/tokenizer_cache_dir')
    modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type.keys()}

    image = ['assets/image/0.jpg', 'assets/image/1.jpg']
    audio = ['assets/audio/0.wav', 'assets/audio/1.wav']
    video = ['assets/video/0.mp4', 'assets/video/1.mp4']
    depth = ['assets/depth/0.png', 'assets/depth/1.png']
    thermal = ['assets/thermal/0.jpg', 'assets/thermal/1.jpg']
    language = ["Training a parakeet to climb up a ladder.", 'A lion climbing a tree to catch a monkey.']

    inputs = {
        'image': to_device(modality_transform['image'](image), device),
        'video': to_device(modality_transform['video'](video), device),
        'audio': to_device(modality_transform['audio'](audio), device),
        'depth': to_device(modality_transform['depth'](depth), device),
        'thermal': to_device(modality_transform['thermal'](thermal), device),
    }
    inputs['language'] = to_device(tokenizer(language, max_length=77, padding='max_length',
                                             truncation=True, return_tensors='pt'), device)

    with torch.no_grad():
        embeddings = model(inputs)

    print("Video x Text: \n",
          torch.softmax(embeddings['video'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Image x Text: \n",
          torch.softmax(embeddings['image'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Depth x Text: \n",
          torch.softmax(embeddings['depth'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Audio x Text: \n",
          torch.softmax(embeddings['audio'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
    print("Thermal x Text: \n",
          torch.softmax(embeddings['thermal'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())

Then returns the following result.

Video x Text: 
 [[9.9989331e-01 1.0667283e-04]
 [1.3255903e-03 9.9867439e-01]]
Image x Text: 
 [[9.9990666e-01 9.3292067e-05]
 [4.6132666e-08 1.0000000e+00]]
Depth x Text: 
 [[0.9954276  0.00457235]
 [0.12042473 0.8795753 ]]
Audio x Text: 
 [[0.97634876 0.02365119]
 [0.02917843 0.97082156]]
Thermal x Text: 
 [[0.9482511  0.0517489 ]
 [0.48746133 0.5125386 ]]

Emergency zero-shot

Since languagebind binds each modality together, we also found the emergency zero-shot. It's very simple to use.

print("Video x Audio: \n", torch.softmax(embeddings['video'] @ embeddings['audio'].T, dim=-1).detach().cpu().numpy())
print("Image x Depth: \n", torch.softmax(embeddings['image'] @ embeddings['depth'].T, dim=-1).detach().cpu().numpy())
print("Image x Thermal: \n", torch.softmax(embeddings['image'] @ embeddings['thermal'].T, dim=-1).detach().cpu().numpy())

Then, you will get:

Video x Audio: 
 [[1.0000000e+00 0.0000000e+00]
 [3.1150486e-32 1.0000000e+00]]
Image x Depth: 
 [[1. 0.]
 [0. 1.]]
Image x Thermal: 
 [[1. 0.]
 [0. 1.]]

Different branches for X-Language task

Additionally, LanguageBind can be disassembled into different branches to handle different tasks. Note that we do not train Image, which just initialize from OpenCLIP.

Thermal

import torch
from languagebind import LanguageBindThermal, LanguageBindThermalTokenizer, LanguageBindThermalProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Thermal'
model = LanguageBindThermal.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindThermalTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
thermal_process = LanguageBindThermalProcessor(model.config, tokenizer)

model.eval()
data = thermal_process([r"your/thermal.jpg"], ['your text'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

Depth

import torch
from languagebind import LanguageBindDepth, LanguageBindDepthTokenizer, LanguageBindDepthProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Depth'
model = LanguageBindDepth.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindDepthTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
depth_process = LanguageBindDepthProcessor(model.config, tokenizer)

model.eval()
data = depth_process([r"your/depth.png"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

Video

import torch
from languagebind import LanguageBindVideo, LanguageBindVideoTokenizer, LanguageBindVideoProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Video_FT'  # also 'LanguageBind/LanguageBind_Video'
model = LanguageBindVideo.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
video_process = LanguageBindVideoProcessor(model.config, tokenizer)

model.eval()
data = video_process(["your/video.mp4"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

Audio

import torch
from languagebind import LanguageBindAudio, LanguageBindAudioTokenizer, LanguageBindAudioProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Audio_FT'  # also 'LanguageBind/LanguageBind_Audio'
model = LanguageBindAudio.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindAudioTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
audio_process = LanguageBindAudioProcessor(model.config, tokenizer)

model.eval()
data = audio_process([r"your/audio.wav"], ['your audio.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

Image

Note that our image encoder is the same as OpenCLIP. Not as fine-tuned as other modalities.

import torch
from languagebind import LanguageBindImage,  LanguageBindImageTokenizer,  LanguageBindImageProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Image'
model = LanguageBindImage.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
image_process = LanguageBindImageProcessor(model.config, tokenizer)

model.eval()
data = image_process([r"your/image.jpg"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

💥 VIDAL-10M

The datasets is in DATASETS.md.

🗝️ Training & Validating

The training & validating instruction is in TRAIN_AND_VALIDATE.md.

👍 Acknowledgement

OpenCLIP An open source pretraining framework.
CLIP4Clip An open source Video-Text retrieval framework.
sRGB-TIR An open source framework to generate infrared (thermal) images.
GLPN An open source framework to generate depth images.

🔒 License

The majority of this project is released under the MIT license as found in the LICENSE file.
The dataset of this project is released under the CC-BY-NC 4.0 license as found in the DATASET_LICENSE file.

✏️ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@misc{zhu2023languagebind,
      title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment}, 
      author={Bin Zhu and Bin Lin and Munan Ning and Yang Yan and Jiaxi Cui and Wang HongFa and Yatian Pang and Wenhao Jiang and Junwu Zhang and Zongwei Li and Cai Wan Zhang and Zhifeng Li and Wei Liu and Li Yuan},
      year={2023},
      eprint={2310.01852},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

✨ Star History

🤝 Contributors

languagebind's People

Contributors

Stargazers

Watchers

Forkers

eltociear binzhu-ece abhimanyu891998 604bdd zongdaoming big-data-ai jiangkl8bigai 2132660698 liunix61 wendingzhulu alexisxty o0mahan0o kac487 xiechengmude nonomal rochemedia leiwangr qhli kim-hojoon sorokinvld li563042811 stratonit llziss4ai danozworld biswarup-choudhury-oxa strategist922 zhaopufeng haroldchen19 techthiyanes llmot nguyenquangtan edvardhua bingliangli songym2020 joefioresi718 paperwave bentri23 dandre0102 lennartmoritz kavindie heet2201 mrsndmn lorenzbaraldi msw6468 pang-yatian okaybody10 bpiyush

languagebind's Issues

Why don't to share the parameters backbone between Image and Video?

In the code, the image and video encoder are initialized from the same model, but trained separately. Does it make performance better?

NameError: name 'get_audio_anno' is not defined

self.id2path_cap, self.ids = get_audio_anno()

gpu资源

Thanks for your wonderful work.
I am very excited about your idea. May I ask the computation budget used to train the largest Imagebind model? How many GPU hour do you use?

how to load LanguageBind/LanguageBind_Video_Huge_V1.5_FT model

Using LanguageBindVideoTower(video_tower, args=video_tower_cfg, cache_dir='', **kwargs) doesn't work. How do I adjust the CLIPVisionTransformer to fit the LanguageBind_Video_Huge_V1.5_FT model

Can you share the NYU-D dataset you used for evaluation, e.g. how to split the dataset?

Seeing excessive GPU memory usage during inference

Hi,
Great work and thanks for open sourcing, I was trying your model on 150 video clips and audio clips, each clip is of length 5 seconds. Below is a screenshot of the code I am using. Here, the array, video_clips and audio_files are of size 150. During the embedding generation, the GPU consumes more than 8 GB of memory and the embedding generation stops. I tried the exact same sample with imageBind, but that seems to work fine during inference and embedding generation. Any idea if I am doing something wrong?

device = 'cuda:0'
device = torch.device(device)
clip_type = ('video', 'audio')
model = LanguageBind(clip_type=clip_type)
model = model.to(device)
model.eval()
pretrained_ckpt = f'lb203/LanguageBind_Video'

tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir/tokenizer_cache_dir')
modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type}

inputs = {
    'video': to_device(modality_transform['video'](video_clips), device),
    'audio': to_device(modality_transform['audio'](audio_files), device),
}

inputs['language'] = to_device(tokenizer(transcriptions_list, max_length=77, padding='max_length',
                                         truncation=True, return_tensors='pt'), device)

with torch.no_grad():
    embeddings = model(inputs)

finetuning on a classification task

Hey, I have some data of images and videos and i want these to get alligned with text. My usecase is just a binary classification. So, my texts are nothing but two sentences - 'The data is live' , 'The data is non live'. So, basically i wanted to increase my model's performance by utilising a multi-modality model. How do i do this? Any resources?

GPU sources

Thanks for the job!

May I know how many GPU sources you used to train the foundation model?

pretraining details

Great work!
I'd like to learn more about the details of the pretraining process mentioned: "During the pretraining process, all modalities gradually align with the language modality through contrastive learning."
Could you clarify if this pretraining process is equivalent to LoRA fine-tuning? In other words, during the pretraining phase, are parameters updated for the video encoder, infrared encoder, depth encoder, and audio encoder using the four types of data contained in VIDAL-10M, namely, video-language data, infrared-language data, depth-language data, and audio-language data, through contrastive learning?

Can I change embeddings['image'].shape from 768 to 1024?

I want to use pretrained weights to inference, but I need embeddings['image'].shape from 768 to 1024.
How to do that?

How to Initialize the multi-modal encoders & training from scratch

Great work! I have noticed in figure 3 of your paper that the multi-modal encoders weights are frozen when doing the Multi-modal Joint Learning. Do you mean they are frozen during all the training time and you only use LoRA to adjust the multi-modal encoders?

If so, how do you initialize their weights? Are they also initialized from pretrained OpenCLIP vision encoder?

Furthermore, are there any pretrain steps in your work? Can I train LanguageBind from scratch or I can only use LoRA to finetune it?

confusion about VIDAL-10M video-text data

Thanks for your effort pushing MLLM into the next stage. Recently, I want to follow your work, and download VIDAL-10M video-text data id2title_folder_raw_ofa_mplug_gpt_sound10076613.json.

I found it contain around 10M video-text, I have following question wish you could give me some hints.

what's the difference between this 10M video-text and 3M video-text mentioned in your ICLR paper.
Regarding to this 10M video-text, I found many video's raw(including title and hashtag) contains some words like youtube, shorts. Take youtube ID LbxMRY4_W10 for example, its raw is I kicked this ball higher than Ja Morant can jump! #shorts #youtubeshorts #youtube #shortclips. But in your paper, you mention "we removed irrelevant words and hashtags, such as ”youtube”, ”fyp”, ”shorts”, etc".

Thanks in advance.

Inquiry on Unimodal Fine-Tuning with Locked Image in LanguageBind

视频特征的提取支持动态帧数吗，效果相对于8帧会有下降或者变差吗

Add flash attention 2

As explore the code, and in my knowledge (please correct if there are something wrong), the current code do not have flash attention in training but instead that the vanilla attention
I think flash attention is a low hanging fruit when training and eval will be faster but still the same result
Do you have any plan to apply flash attention to your code?

batch inference

Hi,

Is there any code snippets to test languagebind audio with large batch in gpus?

cannot run the code train

When run the code train, I use the sample TextVideo with the data is MSRVTT, to implement, run the config

CACHE_DIR= '/root/.cache'
TRAIN_DATA = '/content/MSRVTT_data.json'
# this script is for 640 total batch_size (n(16) GPUs * batch_size(10) * accum_freq(4))
%cd /content/LanguageBind
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=1 --node_rank=0 --nproc_per_node 1 \
    -m main  \
    --train-data ${TRAIN_DATA} \
    --train-num-samples 1000 \
    --clip-type "vl" \
    --do_train \
    --lock-text --lock-image --text-type "mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 16 \
    --lr 1e-4 --coef-lr 1 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 8 --force-patch-dropout 0.3 \
    --epochs 16 --batch-size 10 --accum-freq 4 --warmup 20 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
    --do_eval \
    --val_vl_ret_data "msrvtt"

However, when run, the bug look like

LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has 
been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.

How I can fix it?

About download weights

Why do I download weights again every time I run inference.py ?

What's the difference between LanguageBind and LLaVA-1.5

Hello! Your LanguageBind is amazing! But I'm new to multimodality, and I was wondering what's the difference between LanguageBind and LLaVA-1.5? Should I use LLaVA-1.5 or LanguageBind if I want my model to have more reasoning power while handling multimodal input (currently, text, image, and video are the three modes at most)? Considering that LanguageBind may be a better choice if other modes are to be added in the future, can LanguageBind be easily combined with LLaVA-1.5, LLaMA, or etc.? I'd like to hear your views on these issues.

The length of text that the text encoder can handle

import torch
from languagebind import LanguageBindVideo, LanguageBindVideoTokenizer, LanguageBindVideoProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Video_FT'  # also 'LanguageBind/LanguageBind_Video'
model = LanguageBindVideo.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
video_process = LanguageBindVideoProcessor(model.config, tokenizer)

model.eval()
data = video_process(["your/video.mp4"], ['your text.'], return_tensors='pt')
with torch.no_grad():
    out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

In this code, what is the maximum length of your text? If it exceeds 77, will it be truncated directly?

Any plans to use Long-CLIP to extend text input token limit?

If i read your paper right, you have frozen the CLIP text encoder and only aligned the other modalities.
Do you think a pretrained Long-CLIP model could be used as a drop in replacement for LanguageBind to extend the token limit?

VIT-H model release

Great job! When is the release date for the Huge model planned?

用于特征提取对齐，选用输出为什么参数

import torch
from languagebind import LanguageBindImage, LanguageBindImageTokenizer, LanguageBindImageProcessor

pretrained_ckpt = 'LanguageBind/LanguageBind_Image'
model = LanguageBindImage.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
tokenizer = LanguageBindImageTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir')
image_process = LanguageBindImageProcessor(model.config, tokenizer)

model.eval()
data = image_process([r"your/image.jpg"], ['your text.'], return_tensors='pt')
with torch.no_grad():
out = model(**data)

print(out.text_embeds @ out.image_embeds.T)

您好, 请问我如果加载LanguageBind_Image模型，用于图像和文本特征的提取对齐，那么我是用 out.text_embed和 out.image_embeds 这两个进行后续的工作吗？比如后续进行融合分类。

Choice of Vit-L over Vit-H

Hi
Thanks for the great work.
Imagebind uses Vit-H, so I'm supervised that you were able to achieve better performance using Vit-L only. Have you tried to explore Vit-H under your setting? I see in the config there are some leftover code of LAION CLIP ViT-H

how to use hugging face model

nice work ！An error occurred while trying to load the model using the huggingface api
`from transformers import AutoProcessor, AutoModel, AutoTokenizer

processor = AutoProcessor.from_pretrained("LanguageBind/LanguageBind_Video")
model = AutoModel.from_pretrained("LanguageBind/LanguageBind_Video")
tokenizer = AutoTokenizer.from_pretrained("LanguageBind/LanguageBind_Video")`

KeyError: 'LanguageBindVideo'

Could you give an example of using huggingface transformers input video to extract features

bug in install requirements.txt

ERROR: Ignored the following versions that require a different python version: 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10
ERROR: Could not find a version that satisfies the requirement torch==1.13.0+cu116 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0)
ERROR: No matching distribution found for torch==1.13.0+cu116

I think the torch is outdate now, we should change this to newer version

When will you release the dataset?

where is LanguageBind_Image

What is the training configurations for full tuning?

Hi, I notice that in your paper, the results for full-tuning are reported. I'd like to know the training configurations for full tuning -- do you use the text prompt and input modality data with contrastive learning during full tuning, or use class labels with traditional classification setting (e.g., cross-entropy loss)? Thank you.

where is the LanguageBind_Audio_FT in huggingface?

Combination of multiple modalities

First of all congrats on the paper and thanks for providing the code!

In the paper at 'Zero-shot language-based multi-modal joint retrieval' you mention that integrating/combining multiple embeddings improves the performance. I am specifically referring to the sentence:

'Similar trends have been observed in other modalities, where each modality has the potential to enhance the performance when combined with other modalities.'

However, the paper does not clarify how the embeddings for different modalities are actually combined. If for instance, the input modalities are text, audio, video and depth the model would produce individual embeddings for all of the modalities. How do you then combine these embeddings in order to obtain the results you report?
Do you simply average the different embeddings?

Thanks in advance,
Anthony Mendil.

How to load pt model trained according to Training LanguageBind step?

How to load pt model trained according to Training LanguageBind step? or How to load these models like the Inference for multi model binding step in the readme.md

Text input length

May I ask what is the max input length of the text encoder?

Pretraining on video dataset without lora.

Greate work！
I am also very interested in your work. Recently, I tried to reproduce the work on video modality alignment. I used the pre-trained ViT-b32 of OpenAI for initialization. The visual encoder part uses temporal attention to model the temporal relationship. During training, the text encoder is fixed, and only the weights of the embedding layer and the temporal attention part of the visual encoder will be updated. During training, the loss of the model dropped from 5.9 to 5.2. If both the visual encoder and the text encoder are all fine-tuned, the loss can be reduced to about 0.3. For this situation where only some parameters of the visual encoder are fine-tuned, the loss converges poorly. I wonder if you have encountered this during training? What should I pay attention to when using this fine-tuning method?

Fine-tuneing LLM + LanguageBind?

How can I combine LanguageBind with LLM to fine-tune my own downstream tasks? Such as Qwen?

about LanguageBind_Video_merge

I'd like to know what settings correspond to the LanguageBind_Video_merge model you put on the hugging face

Hashtags and prompts?

Thank you for your excellent work!

Will you release the hashtags of the videos and the prompt used by mPLUG-owl and ChatGPT?

Audio-Language Alignment data for reproduction

Hi Dear Author,

Great work! I'd like to inquire where I can find the address for Audio-Language Alignment data. I noticed in scripts/audio_language/train.sh that there is a mention of 4,800,000 instances of audio-language data, which seems to be significantly more than the 1 million mentioned in the paper. Could you please provide information on where to download this data for easier replication of the paper's results?

Thank you!

Vision encoder version

Hi authors,

Thanks for releasing the code.
I noticed that you mentioned "Note that our image encoder is the same as OpenCLIP. Not as fine-tuned as other modalities."
I would like to know what is the exact version of CLIP weight are you using?

Thanks!

Difference from imagebind

Thank you for your excellent work. I want to know what is the difference between this work and ImageBind. According to my understanding, the difference is mainly reflected in the different modalities used as band, right? Thanks!

Non-reproducible MSRVTT results - I get R@1 accuracy less than 1%

I am trying to verify/reproduce your paper's validation results without training it myself and expected 42.6% R@1 accuracy for MSR-VTT.

But when I follow the instructions from TRAIN_AND_VALIDATE.md (I only did the eval.sh, no training) I get results that are as bad as randomly guessing with about 0.1% R@1 accuracy. See my out.log here:

Eval Epoch: 0, eval Video-Text Retrieval under MSRVTT test data
2024-04-21,14:07:56 | INFO | MSRVTT sim matrix size: 1000, 1000
2024-04-21,15:02:43 | INFO | Length-T: 1000, Length-V:1000
2024-04-21,15:02:47 | INFO | MSRVTT Text-to-Video:
2024-04-21,15:02:53 | INFO | >>> R@1: 0.0 - R@5: 0.6 - R@10: 0.8 - Median R: 516.0 - Mean R: 518.7
2024-04-21,15:03:00 | INFO | MSRVTT Video-to-Text:
2024-04-21,15:03:03 | INFO | >>> V2T$R@1: 0.1 - V2T$R@5: 0.6 - V2T$R@10: 0.8 - V2T$Median R: 491.0 - V2T$Mean R: 498.2

What I need:

Please tell me how i can select your final model for the eval script, which will lead to the same results you that you published.

What I suspect is wrong:

Well, I guess the issue is that I am trying to evaluate the untrained model here instead of your trained version.
Maybe I misunderstood the instructions, and the pretrained weights I downloaded are not the same as your fully trained model described in the paper.

I have also tried to get your final model by running my eval_msrvtt.sh script with the TRANSFORMERS_OFFLINE=0 environment variable and an empty cache_dir in hopes of downloading the fully trained version. Strangely enough this leads to slightly different results in my out.log:

2024-04-19,13:59:28 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/tokenizer_config.json to /raid/1moritz/models/languagebind/cache_dir/tmpctkzbg3u
2024-04-19,13:59:29 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/vocab.json to /raid/1moritz/models/languagebind/cache_dir/tmp6_ww7ayw
2024-04-19,13:59:29 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/merges.txt to /raid/1moritz/models/languagebind/cache_dir/tmp3g7ehptb
2024-04-19,13:59:30 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/tokenizer.json to /raid/1moritz/models/languagebind/cache_dir/tmp4h042saq
2024-04-19,13:59:31 | INFO | downloading https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K/resolve/main/special_tokens_map.json to /raid/1moritz/models/languagebind/cache_dir/tmp0exqanes
2024-04-19,13:59:31 | INFO | {'vl_ret': [{'msrvtt': <torch.utils.data.dataloader.DataLoader object at 0x7f9015f066b0>}]})
2024-04-19,13:59:31 | INFO |
Eval Epoch: 0, eval Video-Text Retrieval under MSRVTT test data
2024-04-19,14:06:35 | INFO | MSRVTT sim matrix size: 1000, 1000
2024-04-19,14:06:35 | INFO | Length-T: 1000, Length-V:1000
2024-04-19,14:06:35 | INFO | MSRVTT Text-to-Video:
2024-04-19,14:06:35 | INFO | >>> R@1: 0.0 - R@5: 0.4 - R@10: 0.7 - Median R: 511.0 - Mean R: 505.5
2024-04-19,14:06:35 | INFO | MSRVTT Video-to-Text:
2024-04-19,14:06:35 | INFO | >>> V2T$R@1: 0.2 - V2T$R@5: 0.6 - V2T$R@10: 0.9 - V2T$Median R: 500.0 - V2T$Mean R: 504.9

How to reproduce:

I follow TRAIN_AND_VALIDATE.md.

Download cache of pretrained weights from your google drive and specify CACHE_DIR.
Download MSRVTT from the source you mentioned in TRAIN_AND_VALIDATE.md
Change the data_root here.
Make minimal changes to eval.sh and save it as eval_msrvtt.sh. Then execute the script.

This is my eval_msrvtt.sh:

CACHE_DIR="/raid/1moritz/models/languagebind/cache_dir"
RESUME="video_language.pt"
ANNOTATION="path/to/data"
# this script is for 640 total batch_size (n(16) GPUs * batch_size(10) * accum_freq(4))
cd /srv/home/1moritz/Repositories/LanguageBind
# TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_addr $CHIEF_IP \
TORCH_DISTRIBUTED_DEBUG=DETAIL HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 torchrun --nproc_per_node 1 \
    -m main  \
    --train-data ${ANNOTATION} \
    --train-num-samples 3020000 \
    --clip-type "vl" --add-time-attn \
    --lock-text --lock-image --text-type "polish_mplug" \
    --init-temp 0.07 --learn-temp \
    --model "ViT-L-14" --cache-dir ${CACHE_DIR} \
    --convert_to_lora --lora_r 16 \
    --lr 1e-4 --coef-lr 1 \
    --beta1 0.9 --beta2 0.98 --wd 0.2 --eps 1e-6 \
    --num-frames 8 --force-patch-dropout 0.3 \
    --epochs 16 --batch-size 10 --accum-freq 4 --warmup 2000 \
    --precision "amp" --workers 10 --video-decode-backend "imgs" \
    --save-frequency 1 --log-every-n-steps 20 --report-to "tensorboard" --resume "latest" \
    --do_eval \
    --val_vl_ret_data "msrvtt"

Are some of these models interchangeable?

For example, I wonder if I train an LLM model using one LanguageBind/LanguageBind_Video_FT, LanguageBind/LanguageBind_Video, or LanguageBind/LanguageBind_Video_V1.5_FT. Can I later swap the Video Encoder for one of the other ones? Or would I need to retrain said LLM with a different encoder if I wish to swap the encoder? Should these give approximately similar results?

provide a sample data for training

Hi, in the readme train_and_validation, the data is not release so it's hard to reimplement data to the right format as you did
I want to reimplement the code of you for training, can you provide me a sample data?

关于视频文本的训练问题

是否在训练视频文本的时候没有添加时间维度的信息，也就是没有使用3D卷积？？期待回复

Clarification questions about the framework

I'm trying to understand this in context of other works in the ecosystem. For example, I'm interested in video. For the video encoder, there is the LoRa tuned and the Fully-finetuned, can I use the embeddings from these models with an already trained LLM or model? Can I use these embeddings with Video-Llava? Can I use the LanguageBind encoder as a replacement for Video-Llava encoder (video tower)?

Also the demos shown in gradio, only show modality comparisons. I'm also trying to understand how do you do zero shot classifications. Thank you -- someone who is confused but excited and thankful for for the work done.

research about a model video captioning

Hi,
I find your project intriguing and believe it could greatly assist in working with multiple data sources. However, I noticed that you haven't mentioned how the vector data generated by your project can be utilized for downstream tasks, such as video captioning. Do you have any plans to address this aspect? I'd be interested to hear your ideas on how one could leverage your model for such tasks.

Congrats on Acceptance !!!

I have been following and utilizing your codebase for an extended period in my research. I believe your paper deserves far more attention than Imagebind.

Use of undefined functions during fine_tune with custom audio data

To train using my own audio dataset, I left clip_type as al and while training, I noticed that the following code is executed when an audio clip is found in the VAT_dataset Class.

self.id2path_cap, self.ids = get_audio_anno()

However, I didn't see a definition for the get_audio_anno() function anywhere, so that's where I got the undefined function error. Is there any way I can get some information about that function?

VIT-H model on other modality [Audio/Depth/Thermal]

Nice work! I noticed that you have released VIT-H model for video modality. So, Do you have any plan to release VIT-H models for additional modalities?

If so, that would be great.

Inconsistent running results of inference.py

Hello,
Thank you for sharing such a great job！
I have encountered some issues where the inference results of the model are inconsistent when I run Python inference.py multiple times。
For example, the first time:

      Video x Text:
       [[1.0000000e+00 3.0187387e-08]
       [8.4319353e-08 9.9999988e-01]]
      Image x Text:
       [[1.0000000e+00 4.0604040e-09]
       [1.2165047e-08 1.0000000e+00]]
      Depth x Text:
       [[0.971602   0.02839794]
       [0.97326183 0.02673816]]
      Audio x Text:
       [[0.99523276 0.00476721]
       [0.09370264 0.9062974 ]]
      Thermal x Text:
       [[0.6276049 0.3723951]
       [0.6245749 0.3754251]]
      Video x Audio:
       [[1.0000000e+00 0.0000000e+00]
       [3.1131478e-32 1.0000000e+00]]
      Image x Depth:
       [[5.2336713e-07 9.9999952e-01]
       [1.0000000e+00 4.3559140e-08]]
      Image x Thermal:
       [[5.1953281e-40 1.0000000e+00]
       [7.0966505e-27 1.0000000e+00]]

But the second time, we got:

Video x Text:
 [[1.0000000e+00 3.0187387e-08]
 [8.4319353e-08 9.9999988e-01]]
Image x Text:
 [[1.0000000e+00 4.0604040e-09]
 [1.2165047e-08 1.0000000e+00]]
Depth x Text:
 [[0.17767465 0.8223253 ]
 [0.18100499 0.818995  ]]
Audio x Text:
 [[0.99523276 0.00476721]
 [0.09370264 0.9062974 ]]
Thermal x Text:
 [[0.47579706 0.52420294]
 [0.5624282  0.43757182]]
Video x Audio:
 [[1.0000000e+00 0.0000000e+00]
 [3.1131478e-32 1.0000000e+00]]
Image x Depth:
 [[0.9892476  0.01075235]
 [0.9906881  0.00931183]]
Image x Thermal:
 [[9.9999619e-01 3.8228222e-06]
 [1.0000000e+00 1.5902166e-24]]

Why does this randomness occur？