Coder Social home page Coder Social logo

mvit's Introduction

Official PyTorch implementation of MViTv2, from the following paper:

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection. CVPR 2022.
Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer*


MViT is a multiscale transformer which serves as a general vision backbone for different visual recognition tasks:

Image Classification: Included in this repo.

Object Detection and Instance Segmentation: See MViTv2 in Detectron2.

Video Action Recognition and Detection: See MViTv2 in PySlowFast.

Results and Pre-trained Models

ImageNet-1K trained models

name resolution acc@1 #params FLOPs 1k model
MViTv2-T 224x224 82.3 24M 4.7G model
MViTv2-S 224x224 83.6 35M 7.0G model
MViTv2-B 224x224 84.4 52M 10.2G model
MViTv2-L 224x224 85.3 218M 42.1G model

ImageNet-21K trained models

name resolution acc@1 #params FLOPs 21k model 1k model
MViTv2-B 224x224 - 52M 10.2G model -
MViTv2-L 224x224 87.5 218M 42.1G model -
MViTv2-H 224x224 88.0 667M 120.6G model -

Installation

Please check INSTALL.md for installation instructions.

Training

Here we can train a standard MViTv2 model from scratch by:

python tools/main.py \
  --cfg configs/MViTv2_T.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 256 \

Evaluation

To evaluate a pretrained MViT model:

python tools/main.py \
  --cfg configs/test/MViTv2_T_test.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TEST.BATCH_SIZE 256 \

Acknowledgement

This repository is built based on the PySlowFast.

License

MViT is released under the Apache 2.0 license.

Citation

If you find this repository helpful, please consider citing:

@inproceedings{li2021improved,
  title={MViTv2: Improved multiscale vision transformers for classification and detection},
  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
  booktitle={CVPR},
  year={2022}
}

@inproceedings{fan2021multiscale,
  title={Multiscale vision transformers},
  author={Fan, Haoqi and Xiong, Bo and Mangalam, Karttikeya and Li, Yanghao and Yan, Zhicheng and Malik, Jitendra and Feichtenhofer, Christoph},
  booktitle={ICCV},
  year={2021}
}

mvit's People

Contributors

amyreese avatar bigfootjon avatar lyttonhao avatar r-barnes avatar thatch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mvit's Issues

MViTv2 on UCF101 and HMDB51

Thanks for your outstanding work! Both Mvit (with pooling) and Swin (with window) reduce Network complexity, giving me hope to implement it with my machine. Even if I prefer Mvit with its simpleness, I meet great difficulty with limited GPUs. So could you pls assist in :

  1. efficient speedup strategy. (MViTv2_S is still enormous for me, and I mean some efficient video strategy. if have)
  2. config and models on UCF101 and HMDB51. (Perhaps transformer and new architecture donnot work on these tiny datasets, but these are my last hope. BTW, both fine-tune and scratch are crucial for me
    I just wanna cry for low accuracy on UCF, but it's much better for the long wait for several months on k400
  3. UCF and HMDB dataloader. (I define the ucf.py and hmdb.py by employing almost the same code in kinetics.py, and more advanced implementation of dataloader is also essential.

There seem to be many issues, and addressing them maybe needs many resources. However, if you have any ideas about any of them, pls contact me at [email protected]. Looking forward to your reply. LOL @lyttonhao @haooooooqi

pre-training model

Hello, thank you for your work. Is there a pre-training model of AVA dataset?

Possible bug in MultiScaleBlock?

I was looking at the mvitv2 model for inclusion in timm and discovered an inconsistency...

def forward(self, x, hw_shape):
x_norm = self.norm1(x)
x_block, hw_shape_new = self.attn(x_norm, hw_shape)
if self.dim_mul_in_att and self.dim != self.dim_out:
x = self.proj(x_norm)
x_res, _ = attention_pool(
x, self.pool_skip, hw_shape, has_cls_embed=self.has_cls_embed
)
x = x_res + self.drop_path(x_block)
x_norm = self.norm2(x)
x_mlp = self.mlp(x_norm)

If the shortcut projection is used in the attention path it uses the normalized input. If the shortcut projection is not used, the shortcut uses the non-normalized input. It's likely not a significant issue either way, but was this the intended behaviour? To use normalized x in the attention shortcut only when the shortcut projection is used?

Code for video action recognition not available

I found MViT quite interesting and wished to use it as a backbone for video quality assessment, and to that end, I tried to search for the code of video action recognition(link : https://github.com/facebookresearch/SlowFast/tree/main/projects/mvitv2), but it seems the code is not available, or not released.

It would be a huge help if you release the code for videos, since I am a novice in the field. If the code is not available or cannot be released, could you provide me some tips/pointers where the major changes would take place?

Pretrained ImageNet21k weight for Initialize the MViT for video training

Hi, I have noticed that the "adaptive kv stride" in configuration for pretrained ImageNet weight is "4, 4". But according to the paper, in the version of Video MViTv2, the "adaptive kv stride" is "1,8,8". Therefore it cannot be directly used for video training initialization. Would you mind sharing the weights used for initialize MViT for video training.

Would Mixed Precision Training Impact Performance?

Hi,

Thanks for the great work. I have tried to train MViTv2 model on IN-1K but I got ~0.5 point Top-1 less than reported. I think the differences between my training procedure compared to the paper are just a smaller batch size (I used 1024 instead of 2048) and enabling mixed-precision training. Do you think these would cause the performance difference? Thanks!

Best,
Junwei

Ask for requirements.txt

I am a beginer and I want to use this program. I find it difficult to run this program. Can you give me a 'requirements.txt'?

ImageNet-21k -> 1k fine-tunes

@lyttonhao do you happen to have the links to the fine-tuned 1k weights mentioned in the paper @ 224x224, 384x384 and 512x512 for the L/H models?

They're pretty decent results but aside from that they were of interest as they are some of the few weights fine-tuned weights I'm aware of pretrained on the winter21 variant of 21k.

The release time of MeMvit

Thanks for your great work on the field of video recognition, I want to ask when would you like to release the code of MeMVit

Demo

How can I do a demo with your code and pre-train weights to understand the code?

CLS token in pretrained model

Hi, I have noticed CLS token is still existed in the provided ckpt. But in the appendix, it said that class token is not utilized for training? So I am confused about this. Did the cls token in provided ckpt be trained or just randomly initialized?

HWIN

Hello,is there HWIN mechanism mentioned in the paper in the code.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.