Coder Social home page Coder Social logo

sail-sg / poolformer Goto Github PK

View Code? Open in Web Editor NEW
1.3K 22.0 116.0 472 KB

PoolFormer: MetaFormer Is Actually What You Need for Vision (CVPR 2022 Oral)

Home Page: https://arxiv.org/abs/2111.11418

License: Apache License 2.0

Shell 0.34% Jupyter Notebook 53.59% Python 46.07%
transformer mlp pooling image-classification pytorch

poolformer's Introduction


🔥 🔥 Our follow-up work "MetaFormer Baselines for Vision" (code: metaformer) introduces more MetaFormer baselines including

  • IdentityFormer with token mixer of identity mapping surprisingly achieve >80% accuracy.
  • RandFormer achieves >81% accuracy by random token mixing, demonstrating MetaForemr works well with arbitrary token mixers.
  • ConvFormer with token mixer of separable convolution significantly outperforms ConvNeXt by large margin.
  • CAFormer with token mixers of separable convolutions and vanilla self-attention sets new record on ImageNet-1K.

This is a PyTorch implementation of PoolFormer proposed by our paper "MetaFormer Is Actually What You Need for Vision" (CVPR 2022 Oral).

Note: Instead of designing complicated token mixer to achieve SOTA performance, the target of this work is to demonstrate the competence of Transformer models largely stem from the general architecture MetaFormer. Pooling/PoolFormer are just the tools to support our claim.

MetaFormer Figure 1: MetaFormer and performance of MetaFormer-based models on ImageNet-1K validation set. We argue that the competence of Transformer/MLP-like models primarily stem from the general architecture MetaFormer instead of the equipped specific token mixers. To demonstrate this, we exploit an embarrassingly simple non-parametric operator, pooling, to conduct extremely basic token mixing. Surprisingly, the resulted model PoolFormer consistently outperforms the DeiT and ResMLP as shown in (b), which well supports that MetaFormer is actually what we need to achieve competitive performance. RSB-ResNet in (b) means the results are from “ResNet Strikes Back” where ResNet is trained with improved training procedure for 300 epochs.

PoolFormer

Figure 2: (a) The overall framework of PoolFormer. (b) The architecture of PoolFormer block. Compared with Transformer block, it replaces attention with an extremely simple non-parametric operator, pooling, to conduct only basic token mixing.

Bibtex

@inproceedings{yu2022metaformer,
  title={Metaformer is actually what you need for vision},
  author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={10819--10829},
  year={2022}
}

Detection and instance segmentation on COCO configs and trained models are here.

Semantic segmentation on ADE20K configs and trained models are here.

The code to visualize Grad-CAM activation maps of PoolFomer, DeiT, ResMLP, ResNet and Swin are here.

The code to measure MACs are here.

Image Classification

1. Requirements

torch>=1.7.0; torchvision>=0.8.0; pyyaml; apex-amp (if you want to use fp16); timm (pip install git+https://github.com/rwightman/pytorch-image-models.git@9d6aad44f8fd32e89e5cca503efe3ada5071cc2a)

data prepare: ImageNet with the following folder structure, you can extract ImageNet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

2. PoolFormer Models

Model #Params Image resolution #MACs* Top1 Acc Download
poolformer_s12 12M 224 1.8G 77.2 here
poolformer_s24 21M 224 3.4G 80.3 here
poolformer_s36 31M 224 5.0G 81.4 here
poolformer_m36 56M 224 8.8G 82.1 here
poolformer_m48 73M 224 11.6G 82.5 here

All the pretrained models can also be downloaded by BaiDu Yun (password: esac). * For convenient comparison with future models, we update the numbers of MACs counted by fvcore library (example code) which are also reported in the new arXiv version.

Web Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo: Hugging Face Spaces

Usage

We also provide a Colab notebook which run the steps to perform inference with poolformer: Colab

3. Validation

To evaluate our PoolFormer models, run:

MODEL=poolformer_s12 # poolformer_{s12, s24, s36, m36, m48}
python3 validate.py /path/to/imagenet  --model $MODEL -b 128 \
  --pretrained # or --checkpoint /path/to/checkpoint 

4. Train

We show how to train PoolFormers on 8 GPUs. The relation between learning rate and batch size is lr=bs/1024*1e-3. For convenience, assuming the batch size is 1024, then the learning rate is set as 1e-3 (for batch size of 1024, setting the learning rate as 2e-3 sometimes sees better performance).

MODEL=poolformer_s12 # poolformer_{s12, s24, s36, m36, m48}
DROP_PATH=0.1 # drop path rates [0.1, 0.1, 0.2, 0.3, 0.4] responding to model [s12, s24, s36, m36, m48]
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --apex-amp

5. Visualization

gradcam

The code to visualize Grad-CAM activation maps of PoolFomer, DeiT, ResMLP, ResNet and Swin are here.

Acknowledgment

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

pytorch-image-models, mmdetection, mmsegmentation.

Besides, Weihao Yu would like to thank TPU Research Cloud (TRC) program for the support of partial computational resources.

poolformer's People

Contributors

ak391 avatar amrzv avatar ir1d avatar temps1101 avatar yuweihao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

poolformer's Issues

PoolFormer pretrained using MAE

Hi,

I really enjoy reading and doing one 2d pose estimation project using PoolFormer as backbone, also love the idea of metaformer. Have you thought about pretraining the model using MAE? Would you expect to have a performance boost as ViT does? Thanks in advance

About Normalization

Hi, thanks for your excellent work.
In your ablation studies (section 4.4), you compared Group Normalization (group number is set as 1 for simplicity), Layer Normalization, and Batch Normalization. The conclusion is that Group Normalization is 0.7% or 0.8% higher than Layer Normalization or Batch Normalization.
But when the number of groups is 1, Group Normalization is equivalent to Layer Normalization, right?

Why use the pool(x) - x

I have a quesition for use the pool(x) - x

I give the answer is : pooling is used for smooth x, '-' mean model ignore the same value with the pool size and focus the neightbor wihch small or bigger than smooth x. Reinforcement the relation for the 8-neighboor.

So this pooling is the local-relation, than use the conv mlp mixer to exchange the information for gloabl.

Is that reason? Thank for you replay

Some confusion about random mixing

Hi~ Many thanks to your excellent works and codebase. I still have some puzzles about the random mixing operator:

  1. In MetaFormer v1, I noticed that in random mixing is followed by Softmax, while in v2 there is no Softmax.
  2. According to my understanding, class spatialfc corresponds to the random mixing, but it doesn't seem to freeze in the codebase.

If you could explain how random mixing works best, I would appreciate it!

why the speed slower than pvtv2-b1?

Recently I trained a transformer based instance seg model, tested with different backbone, here is the result and speed test:

image

batchsize is training batchsize. Why the speed of poolformer is the slowest one? is that normal?

Slower than pvtv2-b1 and precision less than it...

I can't load only m48 somehow.

thank you for your sharing good code.
I have questions.

1
I downloaded poolformer_m48.pth.tar and poolformer_m36.pth.tar and load them, but I can't load only m48 somehow.
loading parama are different from saved one.
I created model by using code in train.py like this;

args.model = 'poolformer_m48'
model = create_model(
    args.model,
    pretrained=args.pretrained,
    num_classes=args.num_classes,
    drop_rate=args.drop,
    drop_connect_rate=args.drop_connect,  # DEPRECATED, use drop_path
    drop_path_rate=args.drop_path,
    drop_block_rate=args.drop_block,
    global_pool=args.gp,
    bn_tf=args.bn_tf,
    bn_momentum=args.bn_momentum,
    bn_eps=args.bn_eps,
    scriptable=args.torchscript,
    checkpoint_path=args.initial_checkpoint)

Is anything wrong?
params of args are default param except for checkpoint_path and pretrained.

2
Also, how can I set up params for creating model like dropout except for drop path.
You've described it in running train, like this

DROP_PATH=0.1 # drop path rates [0.1, 0.1, 0.2, 0.3, 0.4] responding to model [s12, s24, s36, m36, m48]
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --apex-amp

why use use_layer_scale

thanks for your great contribution!
in the implement for poolformerblock ,there is a layer_scale after token_mixer. What is the impact of this operation?

On the use of Apex AMP and hybrid stages

Is there a specific reason why you used Apex AMP instead of the native AMP provided by PyTorch? Have you tried native AMP?

I tried to train poolformer_s12 and poolformer_s24 with solo-learn; with native fp16 the loss goes to nan after a few epochs, while with fp32 it works fine. Did you experience similar behavior?

On a side note, can you provide the implementation and the hyperparameters for the hybrid stage [Pool, Pool, Attention, Attention]? It seems very interesting!

Bug on transfer poolformer to detr

Hi, I got this bug when transformer poolformer to detr-like model (simply replace backbone), It might because of detr using single feature, but I dont know exactly why, can u help take a look?

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 0: backbone.norm0.weight, backbone.norm0.bias
Parameter indices which did not receive grad for rank 0: 248 249

this is my config:

model = dict(
    # backbone=dict(
    #     type='PyramidVisionTransformerV2',
    #     embed_dims=64,
    #     _delete_=True,
    #     out_indices=(0, 1, 2, 3,),
    #     num_layers=[2, 2, 2, 2],
    #     init_cfg=dict(checkpoint='https://github.com/whai362/PVT/'
    #                   'releases/download/v2/pvt_v2_b1.pth')),
    backbone=dict(
        type='poolformer_s24_feat',
        style='pytorch',
        out_indices=(0, 1, 2, 3,),
        norm_cfg=dict(type='BN', requires_grad=False),
        init_cfg=dict(
            type='Pretrained',
            checkpoint='https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s24.pth.tar',
        ),
    ),
    neck=dict(
        type='SMCAFPN',
        in_channels=[64, 128, 320, 512],
        out_channels=256,
        start_level=1,
        num_outs=5,
        relu_before_extra_convs=True),
    bbox_head=dict(
        in_channels=512,
    ))

About MLN(Modified Layer Normalization)

This paper provides new perspectives about Transformer block, but I have some questions about one of the details.
As far as I know, the LayerNorm officially provided by Pytorch implements the same function as the MLN, which computes the
mean and variance along token and channel dimensions. So where is the improvement?
image
The official example :
#Image Example
N, C, H, W = 20, 5, 10, 10
input = torch.randn(N, C, H, W)
#Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
#as shown in the image below
layer_norm = nn.LayerNorm([C, H, W])
output = layer_norm(input)

Object detection training

For pre-trained weight, can i use my local path instead of git path like this?

image

When i run a training code
i get
image

Can this be ignored?

Thank you in advance!

Inquiry about the Hybrid design

Hello, thank you for sharing the code of the paper.

Could you please release the code of the hybrid design?

Also, I have a question please about the hybrid design in Table. 6. Did you replace the whole pooling stage with attention or SpatialFC stage ? or just replace the last block of each stage?

Thank you.

Error: About self.pool(x)

Hello, I am more interested in the poolformer you proposed, but an error occurred during the use of PoolFormerBlock, as follows:
Traceback (most recent call last):
File "train.py", line 545, in
train(hyp, opt, device, tb_writer)
File "train.py", line 89, in train
model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create
File "E:\Work\yolov5\models\yolo.py", line 106, in init
m.stride = torch.tensor([s / x.shape[-2] for x in self.forward(torch.zeros(1, ch, s, s))]) # forward
File "E:\Work\yolov5\models\yolo.py", line 138, in forward
return self.forward_once(x, profile) # single-scale inference, train
File "E:\Work\yolov5\models\yolo.py", line 157, in forward_once
x = m(x) # run # 执行网络组件操作
File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "E:\Work\yolov5_T23\models\common.py", line 194, in forward
n = self.token_mixer(m)
File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "E:\Work\yolov5_T23\models\Confor_VC.py", line 93, in forward
x1 = self.pool(x) - x # x1 = self.pool(x) - x
File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\pooling.py", line 594, in forward
return F.avg_pool2d(input, self.kernel_size, self.stride,
TypeError: avg_pool2d(): argument 'kernel_size' (position 2) must be tuple of ints, not bool

I want to put the poolformer behind a ConvBlock and the above problem occurred。
thank you!

About subtract in pooling

Hi, thank you for publishing such a nice paper. I just have one question. I do not understand the subtraction of the input in eqn.4. Is it necessary? What will happen if we just do the average pooling without substrating the input?

Invitation of making PR for OpenMMLab / MMSegmentation.

Hi, first congrats for acceptance of CVPR'2022. This work deserves because it is very great.

I am a member of OpenMMLab and mainly work for developing MMSegmentation. I think if it supported officially, many more people would use it for benchmark, which would promote research in computer vision area.

Would you like to make PR for openmmlab? We could discuss together to refactor your code and use our own GPUs to train & re-implement.

I think it is pretty cool because it would make more reseachers and community members use this excellent work! Here is our re-implementing work: ConvNeXt.

We do hope PoolFormer could also be added as backbones in our codebase so that many researchers could use directly it for downstream tasks.

Looking forward to your reply!

Best,

what makes pooling competitive performance or even more than attention?

In this paper, you confirm that the success of ViT does not come from the attention token mixer but from a general architecture defined as metaFormer. And the special thing is that in the paper you just need to replace attention to a super simple pooling operator and it gives a SOTA performance. So the question is what makes pooling competitive performance or even more than attention?

Checkpoints of the Ablation study

Hi, thanks for your amazing work.
I am reading the Tab 6, and I am surprised because the method is so simple and very effective, especially when the Pooling is replaced with Identity Mapping. Top1 74.3 on ImageNet-1k with only Conv1x1 and Norm layer. I am thrilled...
Can you release this checkpoint so that we can verify. Thanks again.
image

Welcome update to OpenMMLab 2.0

Welcome update to OpenMMLab 2.0

I am Vansin, the technical operator of OpenMMLab. In September of last year, we announced the release of OpenMMLab 2.0 at the World Artificial Intelligence Conference in Shanghai. We invite you to upgrade your algorithm library to OpenMMLab 2.0 using MMEngine, which can be used for both research and commercial purposes. If you have any questions, please feel free to join us on the OpenMMLab Discord at https://discord.gg/amFNsyUBvm or add me on WeChat (van-sin) and I will invite you to the OpenMMLab WeChat group.

Here are the OpenMMLab 2.0 repos branches:

OpenMMLab 1.0 branch OpenMMLab 2.0 branch
MMEngine 0.x
MMCV 1.x 2.x
MMDetection 0.x 、1.x、2.x 3.x
MMAction2 0.x 1.x
MMClassification 0.x 1.x
MMSegmentation 0.x 1.x
MMDetection3D 0.x 1.x
MMEditing 0.x 1.x
MMPose 0.x 1.x
MMDeploy 0.x 1.x
MMTracking 0.x 1.x
MMOCR 0.x 1.x
MMRazor 0.x 1.x
MMSelfSup 0.x 1.x
MMRotate 1.x 1.x
MMYOLO 0.x

Attention: please create a new virtual environment for OpenMMLab 2.0.

when will segmentation configs out?

Many thanks for your excellent work.
I have tried to used the pretrained PoolFormer_S12 with SemanticFPN configs from MMSegmentation with default setting (16w iterations). But, the model just achieved 36+ mIoU (Single-scale testing) over ADE20k, which is much lower than the result (37.2) in the main paper.
Could you please share the segmentation configs?

s12 model Reproduction experiment

Using the s12 model, only the four card batch size is 240 for a single card, and the acc top1 is 76 in the end. If there are no eight cards, how can the acc reach 80,Other parameter defaults. --Apex amp can greatly affect the accuracy in addition to fast training.

Some question about Layernorm and GroupNorm.

Thanks for your good work in the CV area.
I have some questions about GroupNorm and LayerNorm. The GroupNorm with group_num = 1 is equivalent to the LayerNorm. Why does GroupNorm outperform LayerNorm in your ablation study (Table. 6)?

A simple example from https://pytorch.org/docs/stable/generated/torch.nn.GroupNorm.html

input = torch.randn(20, 6, 10, 10)
# Separate 6 channels into 3 groups
m = nn.GroupNorm(3, 6)
# Separate 6 channels into 6 groups (equivalent with InstanceNorm)
m = nn.GroupNorm(6, 6)
# Put all 6 channels into a single group (equivalent with LayerNorm)
m = nn.GroupNorm(1, 6)
# Activating the module
output = m(input)

Looking forward to your reply!

Design on positional embedding?

Hello authors,

I appreciate a lot your current work, which inspired the community. I am here to raise a very simple and quick question after checking the code and architecture design.

I observed that network using pooling, MLP or identical as token mixer, you do not include positional embedding, while you consider this component only when you use MHA. What is the concern of this design and why other models do not rely on this embedding?

Best,

How to measure MACs?

Hi, thanks for your nice work :)
I also watched your presentation record through this conference.

I want to apply the poolformer for my work, can I ask how did you measure the MACs of the architecture introduced in your paper?
Or if you were not bothered, I want to ask if I could be shared your measurement code.

Addition of the Organization on HuggingFace Transformers

Hello PoolFormer team!

I have been working on porting the implementation of PoolFormer to HuggingFace Transformers library (you can see my PR here) and I was wondering if I can go ahead and add Sea AI labs as an organization to the HuggingFace models hub.

This will allow all model checkpoints to be uploaded onto the hub as well as model cards, etc.

Kind regards,
Tanay Mehta

About clip_norm

Thanks for the excellent work and Happy Chinese New Year!

I noticed that you didn't use the clip_norm that is commonly used in your previous works. Why?

PoolFormer for Segmentation task

Hello!
When applying PoolFormer to a segmentation task, using the provided pth file shows that there are no weights for the hierarchical features used for dense prediction.

Since I use a pretrained model for classification tasks, I think this is an expected result.

Did you also train PoolFormer for segmentation task without weights for hierarchical features?

Thank you.

Why not using DW conv

Hi, thanks for the paper.
While your paper does again show that any mixing in spatial domain could work in CV, from practical point of view there is a large issue with using AvgPool2d. On inference it's not faster that DepthwiseConv but using a fixed filter instead of learned one, which leads to much lower network capacity. Have you tried using DW 3x3 instead of AvgPool ?

No module named 'mmcv_ custom.runner.optimizer'

Hello, Doctor, I haven't modified the code, but I can't run it. The error prompt is as follows:
No module named 'mmcv_ custom.runner.optimizer'
I checked the mmdet version and found it correct.

About poolformer as a tool for demonstration of MetaFormer

Hi, Thanks for the wonderful work, and I am really impressed with the proposed 'MetaFormer' concepts and experimental results you have provided! While reading the paper, some questions were raised regarding the poolformer and the concept of MetaFormer that I wanted to share with you.

  1. As far as I understand, the metaformer basically consists of 'input embedding + iteration of blocks with [norm - token mixer - residual connection - norm - channel mixer - residual connection].' Then does MetaFormer not have consideration for non-overlapping patches or a sequence of flattened patches? If so, is the combination of token mixer and channel mixer with other components basically what we have for the 'MetaFormer' regardless of the hierarchical structure of networks or shape of inputs?
  2. The poolformer has non-parametric 2D pooling for the token mixer, which is extremely simple compared to previous token mixers. However, the patch embedding inserted between the blocks seems to have implicit token mixing since it is a convolution with a smaller stride than its kernel size and eventually yields overlapped patches. Under the assumption of overlapping patches, I believe the resulting patches share information on the same spatial locations.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.