Coder Social home page Coder Social logo

visformer's Introduction

Visformer

pytorch

Introduction

This is a pytorch implementation for the Visformer models. This project is based on the training code in DeiT and the tools in timm.

Usage

Clone the repository:

git clone https://github.com/danczs/Visformer.git

Install pytorch, timm and einops:

pip install -r requirements.txt

Data Preparation

The layout of Imagenet data:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img1.jpeg
    class2/
      img2.jpeg

Network Training

Visformer_small

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save

Visformer_tiny

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model visformer_tiny --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save

Viformer V2 models

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model swin_visformer_small_v2 --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model swin_visformer_tiny_v2 --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

The model performance:

model top-1 (%) FLOPs (G) paramters (M)
Visformer_tiny 78.6 1.3 10.3
Visformer_tiny_V2 79.6 1.3 9.4
Visformer_small 82.2 4.9 40.2
Visformer_small_V2 83.0 4.3 23.6
Visformer_medium_V2 83.6 8.5 44.5

pre-trained models:

model model log top-1 (%)
Visformer_small (original) github github 82.21
Visformer_small (+ Swin for downstream tasks) github github 82.34
Visformer_small_v2 (+ Swin for downstream tasks) github github 83.00
Visformer_medium_v2 (+ Swin for downstream tasks) github github 83.62

(In some logs, the model is only tested for the last 50 epochs to save the training time.)

More information about Visformer V2.

Object Detection on COCO

The standard self-attention is not efficient for high-reolution inputs, so we simply replace the standard self-attention with Swin-attention for object detection. Therefore, Swin Transformer is our directly baseline.

Mask R-CNN

Backbone sched box mAP mask mAP params FLOPs FPS
Swin-T 1x 42.6 39.3 48 267 14.8
Visformer-S 1x 43.0 39.6 60 275 13.1
VisformerV2-S 1x 44.8 40.7 43 262 15.2
Swin-T 3x + MS 46.0 41.6 48 367 14.8
VisformerV2-S 3x + MS 47.8 42.5 43 262 15.2

Cascade Mask R-CNN

Backbone sched box mAP mask mAP params FLOPs FPS
Swin-T 1x + MS 48.1 41.7 86 745 9.5
VisformerV2-S 1x + MS 49.3 42.3 81 740 9.6
Swin-T 3x + MS 50.5 43.7 86 745 9.5
VisformerV2-S 3x + MS 51.6 44.1 81 740 9.6

This repo only contains the key files for object detection ('./ObjectDetction'). Swin-Visformer-Object-Detection is the full detection project.

Pre-trained Model

Beacause of the policy of our institution, we cannot send the pre-trained models out directly. Thankfully, @hzhang57 and @developer0hye provides Visformer_small and Visformer_tiny models trained by themselves.

Automatic Mixed Precision (amp)

In the original version of Visformer, amp can cause NaN values. We find that the overflow comes from the attention mask:

scale = head_dim ** -0.5
attn = ( q  @ k.transpose(-2,-1) ) * scale

To avoid overflow, we pre-normalize q & k, and, thus, overall normalize 'attn' with 'head_dim' instead of 'head_dim ** 0.5':

scale = head_dim ** -0.5
attn =  (q * scale) @ (k.transpose(-2,-1) * scale) 

Amp training:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5
python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --model visformer_tiny --batch-size 256 --drop-path 0.03 --data-path /path/to/imagenet --output_dir /path/to/save --amp --qk-scale-factor=-0.5

This change won't degrade the training performance.

Using amp for the original pre-trained models:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --model visformer_small --batch-size 64 --data-path /path/to/imagenet --output_dir /path/to/save --eval --resume /path/to/weights --amp

Citing

@inproceedings{chen2021visformer,
  title={Visformer: The vision-friendly transformer},
  author={Chen, Zhengsu and Xie, Lingxi and Niu, Jianwei and Liu, Xuefeng and Wei, Longhui and Tian, Qi},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={589--598},
  year={2021}
}

visformer's People

Contributors

danczs avatar developer0hye avatar ppwwyyxx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

visformer's Issues

visformer_tiny到visformer_tinyV2的精度提升

作者你好,我注意到visformer_tiny在model.py和swin_model.py中都有给出,而visformer_tinyV2只在swin_model.py中给出,请问visformer_tinyV2的精度提升1个百分点是微调网络结构带来的,还是swin_model.py这种方式实现带来的精度提升。
image

Pre-trained weights?

Hi, I want to extend the model on my own task, will you release pre-trained weights?

step-wise patch embedding的实现

你好,请问论文中所说的step-wise patch embedding的实现具体体现在哪里呢?是通过不同stage设置不同patch size的patch embedding来体现的吗?

Configuration of block numbers in different stages

Visformer have 7, 4, 4 blocks in stages S1, S2, S3, which is somewhat different from common configurations like ResNets. ResNets put the most blocks in stage 3 where the image size is 14x14, while Visformer put the most blocks in S1 where the image size is 28x28. Did you conduct some experiments and find your configuration better than ResNets in terms of performance and accuracy? I would appreciate it if you could provide these experiment results or your thoughts on it. Thank you!

111

Can you provide a training example for the base setting?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.