mx-mark / videotransformer-pytorch Goto Github PK

PyTorch implementation of a collections of scalable Video Transformer Benchmarks.

Python 80.64% Jupyter Notebook 19.36%

pytorch-implmention pytorch-lightning deeplearning action-recognition transformer pretrained-model timesformer vivit maskfeat

videotransformer-pytorch's Introduction

PyTorch implementation of Video Transformer Benchmarks

This repository is mainly built upon Pytorch and Pytorch-Lightning. We wish to maintain a collections of scalable video transformer benchmarks, and discuss the training recipes of how to train a big video transformer model.

Now, we implement the TimeSformer, ViViT and MaskFeat. And we have pre-trained the TimeSformer-B, ViViT-B and MaskFeat on Kinetics400/600, but still can't guarantee the performance reported in the paper. However, we find some relevant hyper-parameters which may help us to reach the target performance.

Update

We have fixed serval known issues and now can build script to pretrain MViT-B with MaskFeat or finetune MViT-B/TimeSformer-B/ViViT-B on K400.
We have reimplemented the methods of hog extraction and hog prediction in MaskFeat which are currently more efficient to pretrain.
Note that if someone want to train TimeSformer-B or ViViT-B with current repo, they need to carefully adjust the learning rate and weight decay for a better performance. For example, you can can choose 0.005 for peak learning rate and 0.0001 for weight decay by default.

Difference
TODO
Setup
Usage
Result
Acknowledge
Contribution

Difference

In order to share the basic divided spatial-temporal attention module to different video transformer, we make some changes in the following apart.

1. Position embedding

We split the position embedding from R(n^t*h*w×d) mentioned in the ViViT paper into R(n^h*w×d) and R(n^t×d) to stay the same as TimeSformer.

2. Class token

In order to make clear whether to add the class_token into the module forward computation, we only compute the interaction between class_token and query when the current layer is the last layer (except FFN) of each transformer block.

3. Initialize from the pre-trained model

Tokenization: the token embedding filter can be chosen either Conv2D or Conv3D, and the initializing weights of Conv3D filters from Conv2D can be replicated along temporal dimension and averaging them or initialized with zeros along the temporal positions except at the center t/2.
Temporal MSA module weights: one can choose to copy the weights from spatial MSA module or initialize all weights with zeros.
Initialize from the MAE pre-trained model provided by ZhiLiang, where the class_token that does not appear in the MAE pre-train model is initialized from truncated normal distribution.
Initialize from the ViT pre-trained model can be found here.

TODO

[√] add more TimeSformer and ViViT variants pre-trained weights.
- A larger version and other operation types.
[√] add linear prob and finetune recipe.
- Make available to transfer the pre-trained model to downstream task.
add more scalable Video Transformer benchmarks.
- We will mainly focus on the data-efficient models.
add more robust objective functions.
- Pre-train the model through the dominated self-supervised methods, e.g Mask Image Modeling.

Setup

pip install -r requirements.txt

Usage

Training

# path to Kinetics400 train set and val set
TRAIN_DATA_PATH='/path/to/Kinetics400/train_list.txt'
VAL_DATA_PATH='/path/to/Kinetics400/val_list.txt'
# path to root directory
ROOT_DIR='/path/to/work_space'
# path to pretrain weights
PRETRAIN_WEIGHTS='/path/to/weights'

# pretrain mvit using maskfeat
python model_pretrain.py \
	-lr 8e-4 -epoch 300 -batch_size 16 -num_workers 8 -frame_interval 4 -num_frames 16 -num_class 400 \
	-root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH

# finetune mvit with maskfeat pretrain weights
python model_pretrain.py \
	-lr 0.005 -epoch 200 -batch_size 8 -num_workers 4 -num_frames 16 -frame_interval 4 -num_class 400 \
	-arch 'mvit' -optim_type 'adamw' -lr_schedule 'cosine' -objective 'supervised' -mixup True \
	-auto_augment 'rand_aug' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
	-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS

# finetune timesformer with imagenet pretrain weights
python model_pretrain.py \
	-lr 0.005 -epoch 30 -batch_size 8 -num_workers 4 -num_frames 8 -frame_interval 32 -num_class 400 \
	-arch 'timesformer' -attention_type 'divided_space_time' -optim_type 'sgd' -lr_schedule 'cosine' \
	-objective 'supervised' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
	-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS -weights_from 'imagenet'

# finetune vivit with imagenet pretrain weights
python model_pretrain.py \
	-lr 0.005 -epoch 30 -batch_size 8 -num_workers 4 -num_frames 16 -frame_interval 16 -num_class 400 \
	-arch 'vivit' -attention_type 'fact_encoder' -optim_type 'sgd' -lr_schedule 'cosine' \
	-objective 'supervised' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \
	-val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS -weights_from 'imagenet'

The minimal folder structure will look like as belows.

root_dir
├── results
│   ├── experiment_tag
│   │   ├── ckpt
│   │   ├── log

Result

Kinetics-400/600

1. Model Zoo

name	weights from	dataset	epochs	num frames	spatial crop	top1_acc	top5_acc	weight	log
TimeSformer-B	ImageNet-21K	K600	15e	8	224	78.4	93.6	Google drive or BaiduYun(code: yr4j)	log
ViViT-B	ImageNet-21K	K400	30e	16	224	75.2	91.5	Google drive
MaskFeat	from scratch	K400	100e	16	224			Google drive

1.1 Visualize

For each column, we show the masked input(left), HOG predictions(middle) and original video frame(right).

Here, we show the extracted attention map of a random frame sampled from the demo video.

2. Train Recipe(ablation study)

2.1 Acc

operation	top1_acc	top5_acc	top1_acc (three crop)
base	68.2	87.6	-
+ `frame_interval` 4 -> 16 (span more time)	72.9(+4.7)	91.0(+3.4)	-
+ RandomCrop, flip (overcome overfit)	75.7(+2.8)	92.5(+1.5)	-
+ `batch size` 16 -> 8 (more iterations)	75.8(+0.1)	92.4(-0.1)	-
+ `frame_interval` 16 -> 24 (span more time)	77.7(+1.9)	93.3(+0.9)	78.4
+ `frame_interval` 24 -> 32 (span more time)	78.4(+0.7)	94.0(+0.7)	79.1

tips: frame_interval and data augment counts for the validation accuracy.

2.2 Time

operation	epoch_time
base (start with DDP)	9h+
+ `speed up training recipes`	1h+
+ switch from `get_batch first` to `sample_Indice first`	0.5h
+ `batch size` 16 -> 8	33.32m
+ `num_workers` 8 -> 4	35.52m
+ `frame_interval` 16 -> 24	44.35m

tips: Improve the frame_interval will drop a lot on time performance.

1.speed up training recipes:

More GPU device.
pin_memory=True.
Avoid CPU->GPU Device transfer (such as .item(), .numpy(), .cpu() operations on tensor or log to disk).

2.get_batch first means that we firstly read all frames through the video reader, and then get the target slice of frames, so it largely slow down the data-loading speed.

Acknowledge

this repo is built on top of Pytorch-Lightning, pytorchvideo, skimage, decord and kornia. I also learn many code designs from MMaction2. I thank the authors for releasing their code.

Contribution

I look forward to seeing one can provide some ideas about the repo, please feel free to report it in the issue, or even better, submit a pull request.

And your star is my motivation, thank u~

videotransformer-pytorch's People

Contributors

Stargazers

Watchers

videotransformer-pytorch's Issues

How do we load ImageNet-21k ViT weights?

Hi guys, thanks for open sourcing this repo!

I see that your pretrained K600 models were initialized from the ViT ImageNet-21k weights. Can you share a snippet on how you initialized them? Did you use the models from timm?

Thanks!

torch version

what's the version of torch and torchvision

How can ViViT be used to extract video features?

What are the detailed steps and best practices for using the ViViT model to effectively extract video features for various video analysis tasks? I would greatly appreciate any guidance or insights. Thank you in advance.

How Can I Create a Video_Loader Function That Lets Me Use My Own Videos With ViViT?

As the title says, I would like to create a function that prepares a folder of videos for use with the ViViT transformer, but am having trouble with where to start. Any point in the right direction would be incredibly helpful.

Command for maskfeat pretraining

Request the pretraining command for maskfeat.This doesn't seem to be provided in the code. Thank you!

Errors when loading pretrained weights -pretrain_pth 'vivit_model.pth' -weights_from 'kinetics'

When I want to finetune my dataset based on pretrained kinetics vivit model, the errors occured. I am new to pytorch, may I know How could solve the following errors? Thanks.

command

python model_pretrain.py \
	-lr 0.001 -epoch 100 -batch_size 32 -num_workers 4  -frame_interval 16  \
	-arch 'vivit' -attention_type 'fact_encoder' -optim_type 'sgd' -lr_schedule 'cosine' \
	-objective 'supervised' -root_dir ./ \
    -gpus 0 -num_class 2 -img_size 50 -num_frames 13 \
    -warmup_epochs 5 \
    -pretrain_pth 'vivit_model.pth' -weights_from 'kinetics'

Errors:

RuntimeError: Error(s) in loading state_dict for ViViT:
File "/home/VideoTransformer-pytorch/weight_init.py", line 319, in init_from_kinetics_pretrain_
    msg = module.load_state_dict(state_dict, strict=False)
  File "/home/anaconda3/envs/pytorchvideo/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1407, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
        size mismatch for pos_embed: copying a param with shape torch.Size([1, 197, 768]) from checkpoint, the shape in current model is torch.Size([1, 10, 768]).
        size mismatch for time_embed: copying a param with shape torch.Size([1, 9, 768]) from checkpoint, the shape in current model is torch.Size([1, 7, 768]).

about train_list.txt

I just testing this repo for video detection

Can i have annotation files??

regarding fine-tuning ViViT model on my dataset.

@mx-mark Is it possible to fine-tune ViViT model on my own video dataset with different set of classes? Also, what is the procedure to create new dataset?

Question about Loading a pretrained model(ViT)

Hello
thanks for your works.
i have a simple question.
i downloaded a pretrained weight(ViT) from google research github. and i just wanna know that how can i recognize my vivit model initialized successfully from pretrained weight(ViT).

model

Thanks for your code . I want to know whether Vivit-B is model 2 or not.

Maskfeat downstream task performance

I tried to finetune a classifier with the maskfeat pretrained weights you provided, but the final performance was terrible (UCF101 Acc@top1=52%). What is your performance with finetune maskfeat? and what are your mvit finetune settings?

Pretrained ViViT weights

Hi, thanks for releasing your code. Do you have the pretrained weights for ViViT in torch? I found Timesformer weights only.

What is the final score of maskfeat?

AttributeError: 'VideoTransformer' object has no attribute 'weight_decay'

I got this error until I changed the following model_trainer.py:

param_group["weight_decay"] = self._get_momentum(base_value=self.weight_decay, final_value=self.configs.weight_decay_end)

param_group["weight_decay"] = self._get_momentum(base_value=self.configs.weight_decay, final_value=self.configs.weight_decay_end)

structure of ViViT-b

What is the structure of model ViViT-b you published? I can't read it with the default parameters

build_finetune_optimizer raise NotImplementedError

why build_finetune_optimizer raise NotImplementedError if hparams.arch is not mvit? I use the training command in README to finune ViViT

def build_finetune_optimizer(hparams, model):
	if hparams.arch == 'mvit':
		if hparams.layer_decay == 1:
			get_layer_func = None
			scales = None
		else:
			num_layers = 16
			get_layer_func = partial(get_mvit_layer, num_layers=num_layers + 2)
			scales = list(hparams.layer_decay ** i for i in reversed(range(num_layers + 2)))
	else:
		raise NotImplementedError

error happened when I run dataset.py

error information：File "D:\anaconda3\envs\adarnn\lib\site-packages\torchvision\transforms\functional.py", line 494, in resized_crop
assert _is_pil_image(img), 'img should be PIL Image'
AssertionError: img should be PIL Image

my configuration: win10,python3.7，torch 1.6.0,
Your apply would be appreciated! Thank you very much!

how to make datasets use my own videos

as title, thanks a lot.

Example training command/performance

Trying to get top1_acc of >78 as shown in the example log.

Do we know the settings and dataset used for training?

I am training on K400 and using the command in the example:
python model_pretrain.py
-lr 0.005
-pretrain 'vit'
-objective 'supervised'
-epoch 30
-batch_size 8
-num_workers 4
-arch 'timesformer'
-attention_type 'divided_space_time'
-num_frames 8 \
-frame_interval 32
-num_class 400
-optim_type 'sgd'
-lr_schedule 'cosine'
-root_dir ROOT_DIR
-train_data_path TRAIN_DATA_PATH
-val_data_path VAL_DATA_PATH

I am unable to get above >73. Increasing frame_interval does not help.

Curious what I can do to get similar performance.

can i have a dockerimage?

core dumped error occur while run script.

can i have a docker images?

How to test my trained model?

Hello, thank you very much for sharing this wonderful project!

I have now trained my own model and generated a .pth file using this code. How can I use this .pth file to test other data?

Looking forward to your response, and I would greatly appreciate it!

Log-File for ViViT finetuning with Imagenet pre-train Weights

Hi @mx-mark
Do you have a log file for experiment of ViViT fine-tuning with Imagenet-21k pre-train weights?

I am referring to following experiment:

python model_pretrain.py -lr 0.005 -epoch 30 -batch_size 8 -num_workers 4 -num_frames 16 -frame_interval 16 -num_class 400 \ -arch 'vivit' -attention_type 'fact_encoder' -optim_type 'sgd' -lr_schedule 'cosine' \ -objective 'supervised' -root_dir $ROOT_DIR -train_data_path $TRAIN_DATA_PATH \ -val_data_path $VAL_DATA_PATH -pretrain_pth $PRETRAIN_WEIGHTS -weights_from 'imagenet'

the format of train_file.txt and loaded self-supervised pretrain checkpoints

Hi, can you provide train_file.txt, val_file.txt and test_file.txt? My email is [email protected] the way, I would like to share with you how to train model loaded with self-supervised pretrain checkpoints.

How to convert TimeSformer Implementation For Regression Tasks

Hi, Great work! I was just wondering how to modify the implemented code for a regression task. Furthermore, can we use custom dataloader for the datasets as the dataset I am using is frames and thier corresponding values(regression labels).

代码写的真好

通俗易懂，适合新手，感谢感谢

HOG visualization

How do you visualize hog feature? The output is a histogram, right?
Thanks!

Missing keys in demo notebook

Hi, thank you for sharing your work.

When I follow the instructions in the notebook file (VideoTransformer_demo.ipynb), I got trouble loading the pre-trained weights of ViViT model.

After downloading and placing the "./vivit_model.pth" file, I was able to instantiate the ViViT model.
However, the log says that there are many missing keys in the given pth file.

Is it the desired behavior? or should I do some preprocessing to match the parameter name?

This is the output after parameter loading.

load model finished, the missing key of transformer is:['transformer_layers.0.layers.0.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.0.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.0.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.0.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.1.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.1.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.1.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.1.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.2.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.2.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.2.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.2.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.3.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.3.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.3.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.3.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.4.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.4.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.4.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.4.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.5.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.5.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.5.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.5.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.6.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.6.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.6.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.6.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.7.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.7.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.7.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.7.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.8.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.8.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.8.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.8.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.9.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.9.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.9.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.9.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.10.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.10.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.10.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.10.attentions.0.attn.proj.bias', 'transformer_layers.0.layers.11.attentions.0.attn.qkv.weight', 'transformer_layers.0.layers.11.attentions.0.attn.qkv.bias', 'transformer_layers.0.layers.11.attentions.0.attn.proj.weight', 'transformer_layers.0.layers.11.attentions.0.attn.proj.bias', 'transformer_layers.1.layers.0.attentions.0.attn.qkv.weight', 'transformer_layers.1.layers.0.attentions.0.attn.qkv.bias', 'transformer_layers.1.layers.0.attentions.0.attn.proj.weight', 'transformer_layers.1.layers.0.attentions.0.attn.proj.bias', 'transformer_layers.1.layers.1.attentions.0.attn.qkv.weight', 'transformer_layers.1.layers.1.attentions.0.attn.qkv.bias', 'transformer_layers.1.layers.1.attentions.0.attn.proj.weight', 'transformer_layers.1.layers.1.attentions.0.attn.proj.bias', 'transformer_layers.1.layers.2.attentions.0.attn.qkv.weight', 'transformer_layers.1.layers.2.attentions.0.attn.qkv.bias', 'transformer_layers.1.layers.2.attentions.0.attn.proj.weight', 'transformer_layers.1.layers.2.attentions.0.attn.proj.bias', 'transformer_layers.1.layers.3.attentions.0.attn.qkv.weight', 'transformer_layers.1.layers.3.attentions.0.attn.qkv.bias', 'transformer_layers.1.layers.3.attentions.0.attn.proj.weight', 'transformer_layers.1.layers.3.attentions.0.attn.proj.bias'], cls is:[]

Thank you in advance!

+edit)
FYI, these are the unexpected keys from the load_state_dict().
transformer unexpected: ['cls_head.weight', 'cls_head.bias', 'transformer_layers.0.layers.0.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.0.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.0.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.0.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.1.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.1.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.1.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.1.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.2.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.2.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.2.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.2.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.3.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.3.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.3.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.3.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.4.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.4.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.4.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.4.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.5.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.5.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.5.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.5.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.6.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.6.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.6.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.6.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.7.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.7.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.7.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.7.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.8.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.8.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.8.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.8.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.9.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.9.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.9.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.9.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.10.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.10.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.10.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.10.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.11.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.11.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.11.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.11.attentions.0.attn.out_proj.bias', 'transformer_layers.1.layers.0.attentions.0.attn.in_proj_weight', 'transformer_layers.1.layers.0.attentions.0.attn.in_proj_bias', 'transformer_layers.1.layers.0.attentions.0.attn.out_proj.weight', 'transformer_layers.1.layers.0.attentions.0.attn.out_proj.bias', 'transformer_layers.1.layers.1.attentions.0.attn.in_proj_weight', 'transformer_layers.1.layers.1.attentions.0.attn.in_proj_bias', 'transformer_layers.1.layers.1.attentions.0.attn.out_proj.weight', 'transformer_layers.1.layers.1.attentions.0.attn.out_proj.bias', 'transformer_layers.1.layers.2.attentions.0.attn.in_proj_weight', 'transformer_layers.1.layers.2.attentions.0.attn.in_proj_bias', 'transformer_layers.1.layers.2.attentions.0.attn.out_proj.weight', 'transformer_layers.1.layers.2.attentions.0.attn.out_proj.bias', 'transformer_layers.1.layers.3.attentions.0.attn.in_proj_weight', 'transformer_layers.1.layers.3.attentions.0.attn.in_proj_bias', 'transformer_layers.1.layers.3.attentions.0.attn.out_proj.weight', 'transformer_layers.1.layers.3.attentions.0.attn.out_proj.bias']

classification head unexpected: ['cls_token', 'pos_embed', 'time_embed', 'patch_embed.projection.weight', 'patch_embed.projection.bias', 'transformer_layers.0.layers.0.attentions.0.norm.weight', 'transformer_layers.0.layers.0.attentions.0.norm.bias', 'transformer_layers.0.layers.0.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.0.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.0.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.0.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.0.ffns.0.norm.weight', 'transformer_layers.0.layers.0.ffns.0.norm.bias', 'transformer_layers.0.layers.0.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.0.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.0.ffns.0.layers.1.weight', 'transformer_layers.0.layers.0.ffns.0.layers.1.bias', 'transformer_layers.0.layers.1.attentions.0.norm.weight', 'transformer_layers.0.layers.1.attentions.0.norm.bias', 'transformer_layers.0.layers.1.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.1.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.1.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.1.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.1.ffns.0.norm.weight', 'transformer_layers.0.layers.1.ffns.0.norm.bias', 'transformer_layers.0.layers.1.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.1.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.1.ffns.0.layers.1.weight', 'transformer_layers.0.layers.1.ffns.0.layers.1.bias', 'transformer_layers.0.layers.2.attentions.0.norm.weight', 'transformer_layers.0.layers.2.attentions.0.norm.bias', 'transformer_layers.0.layers.2.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.2.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.2.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.2.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.2.ffns.0.norm.weight', 'transformer_layers.0.layers.2.ffns.0.norm.bias', 'transformer_layers.0.layers.2.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.2.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.2.ffns.0.layers.1.weight', 'transformer_layers.0.layers.2.ffns.0.layers.1.bias', 'transformer_layers.0.layers.3.attentions.0.norm.weight', 'transformer_layers.0.layers.3.attentions.0.norm.bias', 'transformer_layers.0.layers.3.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.3.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.3.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.3.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.3.ffns.0.norm.weight', 'transformer_layers.0.layers.3.ffns.0.norm.bias', 'transformer_layers.0.layers.3.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.3.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.3.ffns.0.layers.1.weight', 'transformer_layers.0.layers.3.ffns.0.layers.1.bias', 'transformer_layers.0.layers.4.attentions.0.norm.weight', 'transformer_layers.0.layers.4.attentions.0.norm.bias', 'transformer_layers.0.layers.4.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.4.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.4.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.4.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.4.ffns.0.norm.weight', 'transformer_layers.0.layers.4.ffns.0.norm.bias', 'transformer_layers.0.layers.4.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.4.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.4.ffns.0.layers.1.weight', 'transformer_layers.0.layers.4.ffns.0.layers.1.bias', 'transformer_layers.0.layers.5.attentions.0.norm.weight', 'transformer_layers.0.layers.5.attentions.0.norm.bias', 'transformer_layers.0.layers.5.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.5.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.5.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.5.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.5.ffns.0.norm.weight', 'transformer_layers.0.layers.5.ffns.0.norm.bias', 'transformer_layers.0.layers.5.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.5.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.5.ffns.0.layers.1.weight', 'transformer_layers.0.layers.5.ffns.0.layers.1.bias', 'transformer_layers.0.layers.6.attentions.0.norm.weight', 'transformer_layers.0.layers.6.attentions.0.norm.bias', 'transformer_layers.0.layers.6.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.6.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.6.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.6.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.6.ffns.0.norm.weight', 'transformer_layers.0.layers.6.ffns.0.norm.bias', 'transformer_layers.0.layers.6.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.6.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.6.ffns.0.layers.1.weight', 'transformer_layers.0.layers.6.ffns.0.layers.1.bias', 'transformer_layers.0.layers.7.attentions.0.norm.weight', 'transformer_layers.0.layers.7.attentions.0.norm.bias', 'transformer_layers.0.layers.7.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.7.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.7.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.7.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.7.ffns.0.norm.weight', 'transformer_layers.0.layers.7.ffns.0.norm.bias', 'transformer_layers.0.layers.7.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.7.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.7.ffns.0.layers.1.weight', 'transformer_layers.0.layers.7.ffns.0.layers.1.bias', 'transformer_layers.0.layers.8.attentions.0.norm.weight', 'transformer_layers.0.layers.8.attentions.0.norm.bias', 'transformer_layers.0.layers.8.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.8.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.8.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.8.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.8.ffns.0.norm.weight', 'transformer_layers.0.layers.8.ffns.0.norm.bias', 'transformer_layers.0.layers.8.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.8.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.8.ffns.0.layers.1.weight', 'transformer_layers.0.layers.8.ffns.0.layers.1.bias', 'transformer_layers.0.layers.9.attentions.0.norm.weight', 'transformer_layers.0.layers.9.attentions.0.norm.bias', 'transformer_layers.0.layers.9.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.9.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.9.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.9.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.9.ffns.0.norm.weight', 'transformer_layers.0.layers.9.ffns.0.norm.bias', 'transformer_layers.0.layers.9.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.9.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.9.ffns.0.layers.1.weight', 'transformer_layers.0.layers.9.ffns.0.layers.1.bias', 'transformer_layers.0.layers.10.attentions.0.norm.weight', 'transformer_layers.0.layers.10.attentions.0.norm.bias', 'transformer_layers.0.layers.10.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.10.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.10.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.10.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.10.ffns.0.norm.weight', 'transformer_layers.0.layers.10.ffns.0.norm.bias', 'transformer_layers.0.layers.10.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.10.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.10.ffns.0.layers.1.weight', 'transformer_layers.0.layers.10.ffns.0.layers.1.bias', 'transformer_layers.0.layers.11.attentions.0.norm.weight', 'transformer_layers.0.layers.11.attentions.0.norm.bias', 'transformer_layers.0.layers.11.attentions.0.attn.in_proj_weight', 'transformer_layers.0.layers.11.attentions.0.attn.in_proj_bias', 'transformer_layers.0.layers.11.attentions.0.attn.out_proj.weight', 'transformer_layers.0.layers.11.attentions.0.attn.out_proj.bias', 'transformer_layers.0.layers.11.ffns.0.norm.weight', 'transformer_layers.0.layers.11.ffns.0.norm.bias', 'transformer_layers.0.layers.11.ffns.0.layers.0.0.weight', 'transformer_layers.0.layers.11.ffns.0.layers.0.0.bias', 'transformer_layers.0.layers.11.ffns.0.layers.1.weight', 'transformer_layers.0.layers.11.ffns.0.layers.1.bias', 'transformer_layers.1.layers.0.attentions.0.norm.weight', 'transformer_layers.1.layers.0.attentions.0.norm.bias', 'transformer_layers.1.layers.0.attentions.0.attn.in_proj_weight', 'transformer_layers.1.layers.0.attentions.0.attn.in_proj_bias', 'transformer_layers.1.layers.0.attentions.0.attn.out_proj.weight', 'transformer_layers.1.layers.0.attentions.0.attn.out_proj.bias', 'transformer_layers.1.layers.0.ffns.0.norm.weight', 'transformer_layers.1.layers.0.ffns.0.norm.bias', 'transformer_layers.1.layers.0.ffns.0.layers.0.0.weight', 'transformer_layers.1.layers.0.ffns.0.layers.0.0.bias', 'transformer_layers.1.layers.0.ffns.0.layers.1.weight', 'transformer_layers.1.layers.0.ffns.0.layers.1.bias', 'transformer_layers.1.layers.1.attentions.0.norm.weight', 'transformer_layers.1.layers.1.attentions.0.norm.bias', 'transformer_layers.1.layers.1.attentions.0.attn.in_proj_weight', 'transformer_layers.1.layers.1.attentions.0.attn.in_proj_bias', 'transformer_layers.1.layers.1.attentions.0.attn.out_proj.weight', 'transformer_layers.1.layers.1.attentions.0.attn.out_proj.bias', 'transformer_layers.1.layers.1.ffns.0.norm.weight', 'transformer_layers.1.layers.1.ffns.0.norm.bias', 'transformer_layers.1.layers.1.ffns.0.layers.0.0.weight', 'transformer_layers.1.layers.1.ffns.0.layers.0.0.bias', 'transformer_layers.1.layers.1.ffns.0.layers.1.weight', 'transformer_layers.1.layers.1.ffns.0.layers.1.bias', 'transformer_layers.1.layers.2.attentions.0.norm.weight', 'transformer_layers.1.layers.2.attentions.0.norm.bias', 'transformer_layers.1.layers.2.attentions.0.attn.in_proj_weight', 'transformer_layers.1.layers.2.attentions.0.attn.in_proj_bias', 'transformer_layers.1.layers.2.attentions.0.attn.out_proj.weight', 'transformer_layers.1.layers.2.attentions.0.attn.out_proj.bias', 'transformer_layers.1.layers.2.ffns.0.norm.weight', 'transformer_layers.1.layers.2.ffns.0.norm.bias', 'transformer_layers.1.layers.2.ffns.0.layers.0.0.weight', 'transformer_layers.1.layers.2.ffns.0.layers.0.0.bias', 'transformer_layers.1.layers.2.ffns.0.layers.1.weight', 'transformer_layers.1.layers.2.ffns.0.layers.1.bias', 'transformer_layers.1.layers.3.attentions.0.norm.weight', 'transformer_layers.1.layers.3.attentions.0.norm.bias', 'transformer_layers.1.layers.3.attentions.0.attn.in_proj_weight', 'transformer_layers.1.layers.3.attentions.0.attn.in_proj_bias', 'transformer_layers.1.layers.3.attentions.0.attn.out_proj.weight', 'transformer_layers.1.layers.3.attentions.0.attn.out_proj.bias', 'transformer_layers.1.layers.3.ffns.0.norm.weight', 'transformer_layers.1.layers.3.ffns.0.norm.bias', 'transformer_layers.1.layers.3.ffns.0.layers.0.0.weight', 'transformer_layers.1.layers.3.ffns.0.layers.0.0.bias', 'transformer_layers.1.layers.3.ffns.0.layers.1.weight', 'transformer_layers.1.layers.3.ffns.0.layers.1.bias', 'norm.weight', 'norm.bias']

AssertionError：When loading annotation，an assertion error for label appears

I use maskfeat for pre-training according to the author's settings, and the train_data_path is k400_classmap.json.
I didn't make any other changes to code.
Then the following assertion error occurred.

Thank you for your help!

where is vit-b pretrained model on imagenet-21k?

i try search some model online but no one can real-use

How to place kinetics400 dataset？

Sorry to bother you again.
How to correctly place the prepared kinetics400 dataset when pretrain according to usage (vivit).

errors are reported as follows:

115 M Trainable params
0 Non-trainable params
115 M Total params
460.218 Total estimated model params size (MB)
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s][14:18:20] /github/workspace/src/video/video_reader.cc:83: ERROR opening: abseiling/BiKRPPjAzvw.mp4, No such file or directory
Error reading abseiling/BiKRPPjAzvw.mp4...
[14:18:20] /github/workspace/src/video/video_reader.cc:83: ERROR opening: hammer_throw/AYcsAYm3Pic.mp4, No such file or directory
Error reading hammer_throw/AYcsAYm3Pic.mp4...
[14:18:20] /github/workspace/src/video/video_reader.cc:83: ERROR opening: making_sushi/OZepG6XiLPU.mp4, No such file or directory
Error reading making_sushi/OZepG6XiLPU.mp4...
[14:18:20] /github/workspace/src/video/video_reader.cc:83: ERROR opening: holding_snake/iqXrMJdfD6Q.mp4, No such file or directory
Error reading holding_snake/iqXrMJdfD6Q.mp4...
[14:18:20] /github/workspace/src/video/video_reader.cc:83: ERROR opening: blowing_nose/W1rplwHQoxI.mp4, No such file or directory
Error reading blowing_nose/W1rplwHQoxI.mp4...

[14:34:25] 14:34:25/github/workspace/src/video/video_reader.cc] :/github/workspace/src/video/video_reader.ccError reading cutting_pineapple/TnZIN3rfyIc.mp4...
Error reading snorkeling/yAiqzcM2UCo.mp4...
83:: 83ERROR opening: : catching_or_throwing_baseball/F7hs_aIqsbk.mp4ERROR opening: , planting_trees/D5--ZGEjiWI.mp4No such file or directory,
No such file or directory
Error reading catching_or_throwing_baseball/F7hs_aIqsbk.mp4...
Error reading planting_trees/D5--ZGEjiWI.mp4...
[14:34:25] /github/workspace/src/video/video_reader.cc:83: ERROR opening: [cooking_egg/WIWuMqN_SV0.mp4, No such file or directory14:34:25[
] 14:34:25/github/workspace/src/video/video_reader.cc] :/github/workspace/src/video/video_reader.ccError reading cooking_egg/WIWuMqN_SV0.mp4...
83[:: 8314:34:25ERROR opening: : ] ERROR opening: motorcycling/y3ld8SrteSM.mp4/github/workspace/src/video/video_reader.ccshining_shoes/6VLda6SPjwQ.mp4, :, 83No such file or directoryNo such file or directory:

ERROR opening: ripping_paper/DOAwyFz2Y0I.mp4, No such file or directory
Error reading shining_shoes/6VLda6SPjwQ.mp4...
Error reading motorcycling/y3ld8SrteSM.mp4...
Error reading ripping_paper/DOAwyFz2Y0I.mp4...
[14:34:25] /github/workspace/src/video/video_reader.cc:83: ERROR opening: disc_golfing/5ueYObM1DOY.mp4, No such file or directory
[Error reading disc_golfing/5ueYObM1DOY.mp4...
14:34:25] /github/workspace/src/video/video_reader.cc:83: ERROR opening: shining_shoes/4F3HxPIT91o.mp4, No such file or directory
[14:34:25] [Error reading shining_shoes/4F3HxPIT91o.mp4...
/github/workspace/src/video/video_reader.cc:8314:34:25: ] ERROR opening: /github/workspace/src/video/video_reader.ccbrushing_hair/6JO9EwAZ7Y0.mp4:, 83No such file or directory:
ERROR opening: shearing_sheep/sSTHZHHp-_c.mp4, [No such file or directoryError reading brushing_hair/6JO9EwAZ7Y0.mp4...

How do I apply transfer learning to a pre-trained Vivit model?

How do i train and test on a new dataset, a pre-trained model, after loading the model as it said in the demo notebook?

vitit.pth inference

Does this model support inference with vivit.pth?

Vivit Training Problem

First of all thank you for your excellent work!
Let me talk about my configuration first. Set the model training hyperparameters according to the Training you gave. There are two main changes: changing the data set and using your Kinetics pre-training model.
I am using the VGGSound dataset, which also splits the video into a sequence of RGB image frames as the dataset.
The problem occurs in the model training phase. When using the pre-training model to initialize and train 1 epoch, the accuracy reaches 0.2, but the accuracy decreases as the training progresses.
2022-07-04 18:43:18 - Evaluating mean top1_acc:0.213, top5_acc:0.427 of current training epoch
2022-07-04 18:48:55 - Evaluating mean top1_acc:0.171, top5_acc:0.360 of current validation epoch
2022-07-04 21:08:07 - Evaluating mean top1_acc:0.197, top5_acc:0.430 of current training epoch
2022-07-04 21:12:59 - Evaluating mean top1_acc:0.071, top5_acc:0.202 of current validation epoch
2022-07-04 23:30:01 - Evaluating mean top1_acc:0.059, top5_acc:0.175 of current training epoch
2022-07-04 23:34:57 - Evaluating mean top1_acc:0.027, top5_acc:0.089 of current validation epoch
2022-07-05 01:46:54 - Evaluating mean top1_acc:0.029, top5_acc:0.102 of current training epoch
2022-07-05 01:51:35 - Evaluating mean top1_acc:0.017, top5_acc:0.060 of current validation epoch
2022-07-05 03:42:59 - Evaluating mean top1_acc:0.026, top5_acc:0.092 of current training epoch
2022-07-05 03:47:38 - Evaluating mean top1_acc:0.016, top5_acc:0.056 of current validation epoch
2022-07-05 05:42:18 - Evaluating mean top1_acc:0.027, top5_acc:0.096 of current training epoch
2022-07-05 05:46:48 - Evaluating mean top1_acc:0.013, top5_acc:0.054 of current validation epoch
2022-07-05 07:35:56 - Evaluating mean top1_acc:0.028, top5_acc:0.096 of current training epoch
2022-07-05 07:40:33 - Evaluating mean top1_acc:0.017, top5_acc:0.063 of current validation epoch
2022-07-05 09:32:25 - Evaluating mean top1_acc:0.028, top5_acc:0.099 of current training epoch
2022-07-05 09:37:00 - Evaluating mean top1_acc:0.017, top5_acc:0.066 of current validation epoch
2022-07-05 11:28:31 - Evaluating mean top1_acc:0.029, top5_acc:0.101 of current training epoch
2022-07-05 11:33:02 - Evaluating mean top1_acc:0.017, top5_acc:0.062 of current validation epoch

How to use the maskfeat model to imagenet dataset

Hi! Thanks for your code. For maskfeat code, is it possible to use it on imagenet dataset to reproduce the paper's result?

How to dataloader?

Hello, thank you very much for your outstanding work. I was new to computer vision, and I didn't see how the images were loaded into the model. Could you tell me how to extract 16 frames from the video and input them into the VIVIT model? Looking forward to your reply

Request code for finetune with self-supervised pretrained weights

I tried to do self-supervised experiments with your code, but ran into a lot of problems during the fine-tuning stage. Can you share your MVIT finetune code? Thank you！

While training viti I am getting thsi error

Predictions and targets are expected to have the same shape, but got torch.Size([8, 400]) and torch.Size([8]).

How to load Tensorflow checkpoints?

Hello, thanks for your great work. I have successfully trained the Vivit. However, only several checkpoints are available. In another issue, you have mentioned that the pre-trained models are from the original repo of Google. Could you kindly share the code for conversion or tell the method?