Coder Social home page Coder Social logo

damo-nlp-sg / video-llama Goto Github PK

View Code? Open in Web Editor NEW
2.4K 29.0 223.0 20.12 MB

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
large-language-models video-language-pretraining vision-language-pretraining blip2 llama minigpt4 cross-modal-pretraining multi-modal-chatgpt

video-llama's Introduction

Video-LLaMA

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capabilities.

News

  • [11.14] ⭐️ The current README file is for Video-LLaMA-2 (LLaMA-2-Chat as language decoder) only, instructions for using the previous version of Video-LLaMA (Vicuna as language decoder) can be found at here.
  • [08.03] 🚀🚀 Release Video-LLaMA-2 with Llama-2-7B/13B-Chat as language decoder
    • NO delta weights and separate Q-former weights anymore, full weights to run Video-LLaMA are all here 👉 [7B][13B]
    • Allow further customization starting from our pre-trained checkpoints [7B-Pretrained] [13B-Pretrained]
  • [06.14] NOTE: The current online interactive demo is primarily for English chatting and it may NOT be a good option to ask Chinese questions since Vicuna/LLaMA does not represent Chinese texts very well.
  • [06.13] NOTE: The audio support is ONLY for Vicuna-7B by now although we have several VL checkpoints available for other decoders.
  • [06.10] NOTE: We have NOT updated the HF demo yet because the whole framework (with the audio branch) cannot run normally on A10-24G. The current running demo is still the previous version of Video-LLaMA. We will fix this issue soon.
  • [06.08] 🚀🚀 Release the checkpoints of the audio-supported Video-LLaMA. Documentation and example outputs are also updated.
  • [05.22] 🚀🚀 Interactive demo online, try our Video-LLaMA (with Vicuna-7B as language decoder) at Hugging Face and ModelScope!!
  • [05.22] ⭐️ Release Video-LLaMA v2 built with Vicuna-7B
  • [05.18] 🚀🚀 Support video-grounded chat in Chinese
  • [05.18] ⭐️ Create a Hugging Face repo to store the model weights of all the variants of our Video-LLaMA.
  • [05.15] ⭐️ Release Video-LLaMA v2: we use the training data provided by VideoChat to further enhance the instruction-following capability of Video-LLaMA.
  • [05.07] Release the initial version of Video-LLaMA, including its pre-trained and instruction-tuned checkpoints.

Video-LLaMA

Introduction

  • Video-LLaMA is built on top of BLIP-2 and MiniGPT-4. It is composed of two core components: (1) Vision-Language (VL) Branch and (2) Audio-Language (AL) Branch.
    • VL Branch (Visual encoder: ViT-G/14 + BLIP-2 Q-Former)
      • A two-layer video Q-Former and a frame embedding layer (applied to the embeddings of each frame) are introduced to compute video representations.
      • We train VL Branch on the Webvid-2M video caption dataset with a video-to-text generation task. We also add image-text pairs (~595K image captions from LLaVA) into the pre-training dataset to enhance the understanding of static visual concepts.
      • After pre-training, we further fine-tune our VL Branch using the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat.
    • AL Branch (Audio encoder: ImageBind-Huge)
      • A two-layer audio Q-Former and an audio segment embedding layer (applied to the embedding of each audio segment) are introduced to compute audio representations.
      • As the used audio encoder (i.e., ImageBind) is already aligned across multiple modalities, we train AL Branch on video/image instruction data only, just to connect the output of ImageBind to the language decoder.
  • Only the Video/Audio Q-Former, positional embedding layers, and linear layers are trainable during cross-modal training.

Example Outputs

  • Video with background sound

  • Video without sound effects

  • Static image

Pre-trained & Fine-tuned Checkpoints

The following checkpoints store learnable parameters (positional embedding layers, Video/Audio Q-former, and linear projection layers) only.

The following checkpoints are the full weights (visual encoder + audio encoder + Q-Formers + language decoder) to launch Video-LLaMA:

Checkpoint Link Note
Video-LLaMA-2-7B-Pretrained link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
Video-LLaMA-2-7B-Finetuned link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat
Video-LLaMA-2-13B-Pretrained link Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs)
Video-LLaMA-2-13B-Finetuned link Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat

Usage

Environment Preparation

First, install ffmpeg.

apt update
apt install ffmpeg

Then, create a conda environment:

conda env create -f environment.yml
conda activate videollama

Prerequisites

Before using the repository, make sure you have obtained the following checkpoints:

DON'T have to do anything now!!

How to Run Demo Locally

Firstly, set the llama_model (for the path to the language decoder), imagebind_ckpt_path (for the path to the audio encoder), ckpt (for the path to VL branch) and ckpt_2 (for the path to AL branch) in eval_configs/video_llama_eval_withaudio.yaml accordingly. Then run the script:

python demo_audiovideo.py \
    --cfg-path eval_configs/video_llama_eval_withaudio.yaml \
    --model_type llama_v2 \ # or vicuna
    --gpu-id 0

Training

The training of each cross-modal branch (i.e., VL branch or AL branch) in Video-LLaMA consists of two stages,

  1. Pre-training on the Webvid-2.5M video caption dataset and LLaVA-CC3M image caption dataset.

  2. Fine-tuning using the image-based instruction-tuning data from MiniGPT-4/LLaVA and the video-based instruction-tuning data from VideoChat.

1. Pre-training

Data Preparation

Download the metadata and video following the instructions from the official Github repo of Webvid. The folder structure of the dataset is shown below:

|webvid_train_data
|──filter_annotation
|────0.tsv
|──videos
|────000001_000050
|──────1066674784.mp4
|cc3m
|──filter_cap.json
|──image
|────GCC_train_000000000.jpg
|────...

Script

Config the checkpoint and dataset paths in visionbranch_stage1_pretrain.yaml and audiobranch_stage1_pretrain.yaml respectively. Then, run the script:

conda activate videollama
# for pre-training VL branch
torchrun --nproc_per_node=8 train.py --cfg-path  ./train_configs/audiobranch_stage1_pretrain.yaml

# for pre-training AL branch
torchrun --nproc_per_node=8 train.py --cfg-path  ./train_configs/audiobranch_stage1_pretrain.yaml

2. Instruction Fine-tuning

Data

For now, the fine-tuning dataset consists of:

  • 150K image-based instructions from LLaVA [link]
  • 3K image-based instructions from MiniGPT-4 [link]
  • 11K video-based instructions from VideoChat [link]

Script

Config the checkpoint and dataset paths in visionbranch_stage2_pretrain.yaml and audiobranch_stage2_pretrain.yaml respectively. Then, run the following script:

conda activate videollama
# for fine-tuning VL branch
torchrun --nproc_per_node=8 train.py --cfg-path  ./train_configs/visionbranch_stage2_finetune.yaml

# for fine-tuning AL branch
torchrun --nproc_per_node=8 train.py --cfg-path  ./train_configs/audiobranch_stage2_finetune.yaml

Recommended GPUs

  • Pre-training: 8xA100 (80G)
  • Instruction-tuning: 8xA100 (80G)
  • Inference: 1xA100 (40G/80G) or 1xA6000

Acknowledgement

We are grateful for the following awesome projects our Video-LLaMA arising from:

  • MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
  • FastChat: An Open Platform for Training, Serving, and Evaluating Large Language Model based Chatbots
  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  • EVA-CLIP: Improved Training Techniques for CLIP at Scale
  • ImageBind: One Embedding Space To Bind Them All
  • LLaMA: Open and Efficient Foundation Language Models
  • VideoChat: Chat-Centric Video Understanding
  • LLaVA: Large Language and Vision Assistant
  • WebVid: A Large-scale Video-Text dataset
  • mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

The logo of Video-LLaMA is generated by Midjourney.

Term of Use

Our Video-LLaMA is just a research preview intended for non-commercial use only. You must NOT use our Video-LLaMA for any illegal, harmful, violent, racist, or sexual purposes. You are strictly prohibited from engaging in any activity that will potentially violate these guidelines.

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@article{damonlpsg2023videollama,
  author = {Zhang, Hang and Li, Xin and Bing, Lidong},
  title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
  year = 2023,
  journal = {arXiv preprint arXiv:2306.02858},
  url = {https://arxiv.org/abs/2306.02858}
}

video-llama's People

Contributors

eltociear avatar hangzhang-nlp avatar icethecoder avatar lidongbing avatar lixin4ever avatar rajathbharadwaj avatar sushantgautam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

video-llama's Issues

搭建demo返回错误的结果

模型加载没有问题,模型参数如下
model:
arch: video_llama
model_type: pretrain_vicuna
freeze_vit: True
freeze_qformer: True
max_txt_len: 512
end_sym: "###"
low_resource: False

frozen_llama_proj: False

llama_model: "vicuna-7b-delta-v0"
imagebind_ckpt_path: "imagebind_huge.pth"

fusion_head_layers: 2
max_frame_pos: 32
fusion_header_type: "seqTransf"

ckpt: "finetune-vicuna7b-v2.pth"
ckpt_2: "finetune_vicuna7b_audiobranch.pth"

datasets:
webvid:
vis_processor:
train:
name: "alpro_video_eval"
n_frms: 8
image_size: 224
text_processor:
train:
name: "blip_caption"

run:
task: video_text_pretrain

key error['API-TOKEN']

I just installed, "Video LLaMa finetune vicuna7B". When running the app.py I get key error['API-TOKEN'] when I get to the loading QFormer stage. I can see that the video-llama.py is looking for os.environ['API_TOKEN'] but where do I need to go in order to get an api token? is this a hugging face token? Or possibly a LLaMa token? What is the best way to save the token, so that os.environ['API_TOKEN'] can find it? Thank you!

how much data was used in the fine-tuning stage?

May I know how much data was used in the fine-tuning stage? How much data was used for fine-tuning from MiniGPT-4, LLaVA, and VideoChat respectively?

By the way, I would like to share my WeChat ID: suozhang717. I hope the author and fellow researchers can add me.

miss import

from video_llama.processors.video_processor import AlproVideoTrainProcessor should be in video_llama/datasets/datasets/llava_instruct_dataset.py

imagebind_ckpt_path这个应该如何配置?去哪里下载吗?

如题。我现在需要在本地运行起来,看到文档里面提示需要修改相关配置。但是对其中有几项配置有点疑问,还请帮解答一下。谢谢!
配置文件当中提示有4处内容需要修改,分别是:

# 这个是不是改成合并权重后的vicuna-13b的文件路径?
# 是不是通过执行apply_delta.py脚本得到的那个target目录下的文件?
  llama_model: "ckpt/vicuna-13b/" or "ckpt/vicuna-7b/" 
# 这个我没找到对应的文件目录,我理解应该也是要下载一个模型文件,但是看文档没找到对应的地址。我该如何处理?
  imagebind_ckpt_path: "ckpt/imagebind_path/"
# 这个是不是就是Vision-Language Branch表格当中列出来的模型文件?
# 但我有一个疑问,我应该使用表格当中的哪个呢?如果需要支持中文是不是直接使用finetune-ziya13b-zh?还是pretrain-ziya13b-zh?
  ckpt: path/visual_branch_ckpt/
# 对于这个如果我不需要audio是不是可以不配置?还是说一定要配置?
  ckpt_2: path/audio_branch_ckpt/

还请帮我解答一下上面代码当中的注释疑问,再次感谢!

Paper.pdf is missing.

Thanks for your great job.
I really wonder how it works.

Although there is Paper button on ReadMe, Not found error occurs.

Could you provide Video-LLaMA paper please?

Thanks.

kind regards.

the image size is defined in .yaml files but they are also manually set here and there

Great work,
I noticed the image size is fixed to 224 * 224 in a few files despite being a parameter in .yaml files. Also, I wonder if any size video would work (if someone wants to use your code to fine-tune their own data) if we just change the image size parameter. (considering some of the pre-trained models maybe work better with that fixed image size?). I'll appreciate it if you could explain what's the best approach if I want to use your code to fine-tune on a different video size(not using your model weight but rather your code)

推理时forward提取audio是在哪里?

感谢大佬的工作,有两个问题想请教下;

  1. 请问下forward里输入的只有video的帧数,img_embeds, atts_img = self.encode_audioQformer(image, modality_type=ModalityType.VISION) 的输入应该是图片batch?audio特征是怎么提取的呢?
  2. 这段代码是图像特征还是图像和audio的融合特征呢?是与input_ids(groundtruth?)列拼接吗?
    for cur_input_ids, cur_input_embeds in zip(input_ids, temp_input_embedding)
    ....

def forward(self, samples):
if 'conv_type' in samples.keys() and samples['conv_type']=='multi':

        im_patch_token_id = self.IMAGE_PATCH_TOKEN_ID
        image = samples["images"]
        input_ids = samples['input_ids']
        if len(image.size())==4:
            time = 1
            image = einops.repeat(image, 'b c h w -> b c t h w',t = time)

        if self.train_flag == 0:
            num_patch_tokens = self.num_video_query_token
            img_embeds, atts_img = self.encode_videoQformer_visual(image)
        elif self.train_flag == 1:
            num_patch_tokens = self.num_audio_query_token
            image = einops.rearrange(image, 'b c t h w -> b t c h w')
            img_embeds, atts_img = self.encode_audioQformer(image, modality_type=ModalityType.VISION)
            
        temp_input_ids = copy.deepcopy(input_ids)
        temp_input_ids[temp_input_ids == im_patch_token_id] = 0
        temp_input_embedding = self.llama_model.model.embed_tokens(temp_input_ids)

        new_input_embeds=[]
        cur_image_idx = 0
        for cur_input_ids, cur_input_embeds in zip(input_ids, temp_input_embedding):
            cur_image_features = img_embeds[cur_image_idx]

            if (cur_input_ids == im_patch_token_id).sum() != num_patch_tokens:
                    raise ValueError("The number of image patch tokens should be the same as the number of image patches.")
            masked_indices = torch.where(cur_input_ids == im_patch_token_id)[0]
            mask_index_start = masked_indices[0]
            if (masked_indices != torch.arange(mask_index_start, mask_index_start+num_patch_tokens, device=masked_indices.device, dtype=masked_indices.dtype)).any():
                raise ValueError("The image patch tokens should be consecutive.")
            
            cur_new_input_embeds = torch.cat((cur_input_embeds[:mask_index_start], cur_image_features, cur_input_embeds[mask_index_start+num_patch_tokens:]), dim=0)
            new_input_embeds.append(cur_new_input_embeds)
            
            cur_image_idx+=1
        inputs_embeds = torch.stack(new_input_embeds, dim=0)
        targets = samples['labels']
        attention_mask = samples['attention_mask']
        with self.maybe_autocast():
            outputs = self.llama_model(
                inputs_embeds=inputs_embeds,
                attention_mask=attention_mask,
                return_dict=True,
                labels=targets,
            )
        loss = outputs.loss
        return {"loss": loss}

Great work!

Thanks for the great work! One small question: what is the advantage of your proposed Video-LLaMA over VideoChat?

can not install pytorch 1.12.1

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement pytorch==1.12.1 (from versions: 0.1.2, 1.0.2)
ERROR: No matching distribution found for pytorch==1.12.1

finetune-billa7b-zh模型推理错误

使用video_llama_eval.yaml和demo_video.py,配置如下:
ckpt:finetune-billa7b-zh.pth

llama_proj_model:pretrained_minigpt4.pth
llama_model:Neutralzz/BiLLa-7B-LLM

报错如下图:
image

Which conference do you submit this paper?

Hi.
Thanks for your great works.

I'm a new researcher. And I'm interested in Video understanding field.

I read your paper. And I wonder what conference did you submit?
The form looks like NIPS.

Thanks.
kind regards.

No place to input in the Demo

Hi, I run the demo by

python demo_audiovideo.py     --cfg-path eval_configs/video_llama_eval_withaudio.yaml  --gpu-id 0

It launches successfully, but the input field is stuck and I cannot input words. And the section of 'Video-LLaMA' is null on the interface. I wonder what is the possible solution for it.

image

The following is how I fill the file the video_llama_eval_withaudio.yaml.

model:
  arch: video_llama
  model_type: pretrain_vicuna
  freeze_vit: True
  freeze_qformer: True
  max_txt_len: 512
  end_sym: "###"
  low_resource: False

  frozen_llama_proj: False

  # models/vicuna-7b-v1 uses llama-7b-hf + vicuna-7b-delta-v0
  llama_model: "models/vicuna-7b-v1" 
  imagebind_ckpt_path: "models/imageBind/"

  fusion_head_layers: 2
  max_frame_pos: 32
  fusion_header_type: "seqTransf"

  ckpt: "models/Video-LLaMA-Series/finetune-vicuna7b-v2.pth" #path/visual_branch_ckpt/  # pretrain: miniGPT4
  ckpt_2: "models/Video-LLaMA-Series/finetune_vicuna7b_audiobranch.pth" #path/audio_branch_ckpt/ # pretrain: ImageBind

datasets:
  webvid:
    vis_processor:
      train:
        name: "alpro_video_eval"
        n_frms: 8
        image_size: 224
    text_processor:
      train:
        name: "blip_caption"

run:
  task: video_text_pretrain

frame rate and duration of videos

Based on my understanding, you use only 8 frames sampled uniformly across the video. I have the following questions :

  1. does increasing the total frames by a lot slow down the training significantly ? my own conclusion was that it shouldn't because the features are extracted frame by frame and then concatenated. so it won't add to learnable parameters. is this correct ?
  2. what if I want to fix the frame rate instead of the total frames? so that I can account for longer or shorter videos but maintain a certain frame rate on them. is this possible with your model?
  3. if 2 is possible in theory, do you think your arch would work well on videos that need a higher sampling rate?
  4. does your model treat videos as 8 separate images and process them one by one or does it take into account the whole 8 frames at once?
  5. if I want to use other models for feature extraction of videos, can I still use your arch ? like instead of using raw videos using features that are already extracted from frozen models. (I think in a way this is what you do too by using blip-2)
  6. last but not least, where does learning actually happen in your work? there are a lot of pre-trained and frozen models in use, I'm kinda confused about what part gets updated based on fine-tuning.

I hope you could help me understand your work better and hopefully be able to use it on my own data.

[VideoChat] Video instruction data

Hi! It's interesting to build a chatbot for video understanding!
In our project "VideoChat🦜: Chat-Centric Video Understanding", we also build an interesting chatbot🤖 for both image and video. More importantly, we have released 11K video instruction data for spatiotemporal reasoning. You can also utilize them to enhance your project 💪🏻!
If you have tried it, don't forget to give me the feedback. We will improve the data in the future.
Project: https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat
Paper: https://arxiv.org/abs/2305.06355

image

Inquiry about “max_epoch” values for pre-training and fine-tuning

I would like to inquire about the max_epoch values set for pre-training and fine-tuning. Are they set to 5 and 3 as in the config file, respectively? I tried running the fine-tuning code with these values, but the results were not as coherent as the fine-tuned models provided by the official repository. Therefore, I would like to confirm if these are the correct values.

Thank you for your help and for sharing your work with the community.

Differences in the checkpoint size

image

The checkpoint size after training through video_llama_stage1_pretrain.yaml is far less than what you have in the public checkpoint.
The size of your public checkpoint (finetune-vicuna7b-v2.pth) is: 254MB
But the one I got is just around 37MB

Are you using a different training config than what is in video_llama_stage1_pretrain.yaml?
Can you share your training config? I wanted to fine-tune by loading the checkpoint from finetune-vicuna7b-v2.pth.
But get mismatch errors.

video_llama_eval_withaudio.yaml issue

Initializing Chat
Loading VIT
Loading VIT Done
Loading Q-Former
Traceback (most recent call last):
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 259, in hf_raise_for_status
response.raise_for_status()
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/pretrain_vicuna7b-v2.pth/resolve/main/tokenizer.model

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/transformers/utils/hub.py", line 409, in cached_file
resolved_file = hf_hub_download(
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1166, in hf_hub_download
metadata = get_hf_file_metadata(
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)

Traceback (most recent call last):
File "demo_audiovideo.py", line 66, in
model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))
File "/home/akilliceviribilisim/Video-LLaMA/video_llama/models/video_llama.py", line 567, in from_config
model = cls(
File "/home/akilliceviribilisim/Video-LLaMA/video_llama/models/video_llama.py", line 120, in init
self.llama_tokenizer = LlamaTokenizer.from_pretrained(llama_model, use_fast=False)
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1770, in from_pretrained
resolved_vocab_files[file_id] = cached_file(
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/transformers/utils/hub.py", line 424, in cached_file
raise EnvironmentError(
OSError: pretrain_vicuna7b-v2.pth is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True .

can you help me how can i solve it

transformers version issue

I notice that in your work, the required transformers version is 4.28.0, which, if I'm not mistaking, had a bug with llama tokenizer. Have you tested higher versions of transformers? If I'm using, for example, version 4.29, will this affect the model converted?

size mismatch error

python demo_audiovideo.py --cfg-path eval_configs/video_llama_eval_withaudio.yaml  --gpu-id 3
企业微信截图_16866218696720

Asking for a simple script to get text and video features

First of all - Amazing work on this one.

I'm a bit getting lost with the repo, may I request a simple few line script that does something like the following:

model = CLIPViP("pretrain_clipvip_base_32.pt")
text_features = model.encode_text("This is a very cute cat")
video_features = model.encode_video("vid_file.mp4")
cosine(text_features, video_features)

[Extra] Preferably I wish to get the video features for a batch of mp4 files with different lengths

Thank you

The effect gap is too large

I tuned the demo file and ran it, but I got terrible results, not as good as in the hugging-face demo, and also repeating the question asked at the end of each answer. I don't know what caused it.

The following is my configuration file, is there any problem?


model:
arch: video_llama
model_type: pretrain_vicuna
freeze_vit: True
freeze_qformer: True
max_txt_len: 512
end_sym: "###"
low_resource: False

frozen_llama_proj: False

llama_model: "vicuna-7b/"
imagebind_ckpt_path: "imagebind/"

fusion_head_layers: 2
max_frame_pos: 32
fusion_header_type: "seqTransf"

ckpt: "finetune-vicuna7b-v2.pth"
ckpt_2: "finetune_vicuna7b_audiobranch.pth"

datasets:
webvid:
vis_processor:
train:
name: "alpro_video_eval"
n_frms: 8
image_size: 224
text_processor:
train:
name: "blip_caption"

run:
task: video_text_pretrain

The amount of updated parameters during stage1 and stage2 ?

Great project !
I would like to ask 3 questions to learn:
1.Does your public checkpoint include the parameters of the 2-layer Q-former and the linear projection layer?
2.Seeing that the freeze_qformer is set to True in your stage1 and stage2 yaml files, is it because you have frozen the parameters of the Q-former and only fine-tuned the llama_proj? But I saw that the parameters of the Q-former were fine-tuned on your model diagram.
3. Is the amount of parameters fine-tuned the same in the pre-training stage1 and fine-tuning stage2 ?
Thank you very much~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.