sail-sg / poolformer Goto Github PK

View Code? Open in Web Editor NEW

1.3K 22.0 116.0 472 KB

PoolFormer: MetaFormer Is Actually What You Need for Vision (CVPR 2022 Oral)

Home Page: https://arxiv.org/abs/2111.11418

License: Apache License 2.0

Shell 0.34% Jupyter Notebook 53.59% Python 46.07%

transformer mlp pooling image-classification pytorch

poolformer's Introduction

PoolFormer: MetaFormer Is Actually What You Need for Vision (CVPR 2022 Oral)

🔥 🔥 Our follow-up work "MetaFormer Baselines for Vision" (code: metaformer) introduces more MetaFormer baselines including

IdentityFormer with token mixer of identity mapping surprisingly achieve >80% accuracy.
RandFormer achieves >81% accuracy by random token mixing, demonstrating MetaForemr works well with arbitrary token mixers.
ConvFormer with token mixer of separable convolution significantly outperforms ConvNeXt by large margin.
CAFormer with token mixers of separable convolutions and vanilla self-attention sets new record on ImageNet-1K.

This is a PyTorch implementation of PoolFormer proposed by our paper "MetaFormer Is Actually What You Need for Vision" (CVPR 2022 Oral).

Note: Instead of designing complicated token mixer to achieve SOTA performance, the target of this work is to demonstrate the competence of Transformer models largely stem from the general architecture MetaFormer. Pooling/PoolFormer are just the tools to support our claim.

Figure 1: MetaFormer and performance of MetaFormer-based models on ImageNet-1K validation set. We argue that the competence of Transformer/MLP-like models primarily stem from the general architecture MetaFormer instead of the equipped specific token mixers. To demonstrate this, we exploit an embarrassingly simple non-parametric operator, pooling, to conduct extremely basic token mixing. Surprisingly, the resulted model PoolFormer consistently outperforms the DeiT and ResMLP as shown in (b), which well supports that MetaFormer is actually what we need to achieve competitive performance. RSB-ResNet in (b) means the results are from “ResNet Strikes Back” where ResNet is trained with improved training procedure for 300 epochs.

Figure 2: (a) The overall framework of PoolFormer. (b) The architecture of PoolFormer block. Compared with Transformer block, it replaces attention with an extremely simple non-parametric operator, pooling, to conduct only basic token mixing.

Bibtex

@inproceedings{yu2022metaformer,
  title={Metaformer is actually what you need for vision},
  author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={10819--10829},
  year={2022}
}

Detection and instance segmentation on COCO configs and trained models are here.

Semantic segmentation on ADE20K configs and trained models are here.

The code to visualize Grad-CAM activation maps of PoolFomer, DeiT, ResMLP, ResNet and Swin are here.

The code to measure MACs are here.

Image Classification

1. Requirements

torch>=1.7.0; torchvision>=0.8.0; pyyaml; apex-amp (if you want to use fp16); timm (pip install git+https://github.com/rwightman/pytorch-image-models.git@9d6aad44f8fd32e89e5cca503efe3ada5071cc2a)

data prepare: ImageNet with the following folder structure, you can extract ImageNet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

2. PoolFormer Models

Model	#Params	Image resolution	#MACs*	Top1 Acc	Download
poolformer_s12	12M	224	1.8G	77.2	here
poolformer_s24	21M	224	3.4G	80.3	here
poolformer_s36	31M	224	5.0G	81.4	here
poolformer_m36	56M	224	8.8G	82.1	here
poolformer_m48	73M	224	11.6G	82.5	here

All the pretrained models can also be downloaded by BaiDu Yun (password: esac). * For convenient comparison with future models, we update the numbers of MACs counted by fvcore library (example code) which are also reported in the new arXiv version.

Web Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo:

Usage

We also provide a Colab notebook which run the steps to perform inference with poolformer:

3. Validation

To evaluate our PoolFormer models, run:

MODEL=poolformer_s12 # poolformer_{s12, s24, s36, m36, m48}
python3 validate.py /path/to/imagenet  --model $MODEL -b 128 \
  --pretrained # or --checkpoint /path/to/checkpoint

4. Train

We show how to train PoolFormers on 8 GPUs. The relation between learning rate and batch size is lr=bs/1024*1e-3. For convenience, assuming the batch size is 1024, then the learning rate is set as 1e-3 (for batch size of 1024, setting the learning rate as 2e-3 sometimes sees better performance).

MODEL=poolformer_s12 # poolformer_{s12, s24, s36, m36, m48}
DROP_PATH=0.1 # drop path rates [0.1, 0.1, 0.2, 0.3, 0.4] responding to model [s12, s24, s36, m36, m48]
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --apex-amp

5. Visualization

The code to visualize Grad-CAM activation maps of PoolFomer, DeiT, ResMLP, ResNet and Swin are here.

Acknowledgment

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

pytorch-image-models, mmdetection, mmsegmentation.

Besides, Weihao Yu would like to thank TPU Research Cloud (TRC) program for the support of partial computational resources.

poolformer's People

Contributors

Stargazers

Watchers

Forkers

baudzhou zichaotong klrc wangtianrui crystal0622 liu-qi333 xuanheiiis techthiyanes muwutufu dumpmemory ir1d sailfish009 gaobb xzjzsa zxkyjimmy tommylitlle big-data-ai pkulzb zlapp hl-louis metavai zplizzi jaedukseo alfagao andrew-zhu jimba86 qiaoda-da stjordanis distoramos shihuahuang95 jasperac delldu cv-ip iswarm xychen9459 zebrajack hannacun zi-hao-wei zhanglangjd cnu1439 anzhella-pankratova woo1 yousuf907 wangergou135 syedsajidhussain hiyoung-asr johnson-yue shengzhang90 pprp ak391 xrdevieee animesh bysen32 bobrown mengtender bingyuanw song-pingfan webberxie panghongwei17 wolfworld6 torment123 yangsenwxy snoopybingo danpeng2 eee12315 ml-lab luis-guerraf fusica liyy201912 yuhangyh chenyangsi peterzs mc-nya urlhearts taesunwhang xxaxtt jizongfox zonszer adambear 377fsq wei-baldwin-zeng jonychoi gcxamy nodk himidev asajatovic laplacekorea jackzhousz dl-vit trellixvulnteam whuhxb wozaimoyu xingyuxie xiaoqiangzhou phamthanhtu310702 faithfulnguyen cccccatv iumyx2612 junnjxx robotseye

poolformer's Issues

PoolFormer pretrained using MAE

Hi,

I really enjoy reading and doing one 2d pose estimation project using PoolFormer as backbone, also love the idea of metaformer. Have you thought about pretraining the model using MAE? Would you expect to have a performance boost as ViT does? Thanks in advance

segmentation不使用分布式训练

segmentation不使用分布式训练请问使用什么指令。总是出现分布式相关的报错，

About Normalization

Hi, thanks for your excellent work.
In your ablation studies (section 4.4), you compared Group Normalization (group number is set as 1 for simplicity), Layer Normalization, and Batch Normalization. The conclusion is that Group Normalization is 0.7% or 0.8% higher than Layer Normalization or Batch Normalization.
But when the number of groups is 1, Group Normalization is equivalent to Layer Normalization, right?

1 Aboutu the results graph

Hello, can you tell me hows to get the results graph？

Why use the pool(x) - x

I have a quesition for use the pool(x) - x

I give the answer is : pooling is used for smooth x, '-' mean model ignore the same value with the pool size and focus the neightbor wihch small or bigger than smooth x. Reinforcement the relation for the 8-neighboor.

So this pooling is the local-relation, than use the conv mlp mixer to exchange the information for gloabl.

Is that reason? Thank for you replay

I only obtain 0.096 top1 acc when I test the official poolformer_s12 model.

hi, all

I only obtain 0.096 top1 acc when I test the official poolformer_s12 model. Has anyone run into the same problem?

torch=1.7.0a0
torchvision=0.8.0a0+45f960c
timm=0.5.0

How to check the number of parameters of both object detection and instance segmentation

Thanks for your nice work.
I'm employing your baseline for my work.

I have a question about how to get the number of parameters for object detection and instance segmentation in your paper.
I checked how to do in image classification case in the 'misc', but I'm not familiar with mmlab's library.
Could you please how to do that?

Some confusion about random mixing

Hi~ Many thanks to your excellent works and codebase. I still have some puzzles about the random mixing operator:

In MetaFormer v1, I noticed that in random mixing is followed by Softmax, while in v2 there is no Softmax.
According to my understanding, class spatialfc corresponds to the random mixing, but it doesn't seem to freeze in the codebase.

If you could explain how random mixing works best, I would appreciate it!

Have you ever tested the network on Cifar-10 dataset? What about the results?

Pretrained weights for other versions

Are there any available weights for the PoolFormer version with BN and LN instead of the Modified Layer Norm? (Table 5 in https://arxiv.org/abs/2111.11418)

How to achieve the grad-CAM visualization?

Thanks for your awesome work and for sharing them all.

I found out that the pictures in the supplement paper are beautiful, and I want to follow this.

Could you share the code for this? or can tell me how to achieve the grad-CAM activation map?

why the speed slower than pvtv2-b1?

Recently I trained a transformer based instance seg model, tested with different backbone, here is the result and speed test:

batchsize is training batchsize. Why the speed of poolformer is the slowest one? is that normal？

Slower than pvtv2-b1 and precision less than it...

I can't load only m48 somehow.

thank you for your sharing good code.
I have questions.

1
I downloaded poolformer_m48.pth.tar and poolformer_m36.pth.tar and load them, but I can't load only m48 somehow.
loading parama are different from saved one.
I created model by using code in train.py like this;

args.model = 'poolformer_m48'
model = create_model(
    args.model,
    pretrained=args.pretrained,
    num_classes=args.num_classes,
    drop_rate=args.drop,
    drop_connect_rate=args.drop_connect,  # DEPRECATED, use drop_path
    drop_path_rate=args.drop_path,
    drop_block_rate=args.drop_block,
    global_pool=args.gp,
    bn_tf=args.bn_tf,
    bn_momentum=args.bn_momentum,
    bn_eps=args.bn_eps,
    scriptable=args.torchscript,
    checkpoint_path=args.initial_checkpoint)

Is anything wrong?
params of args are default param except for checkpoint_path and pretrained.

2
Also, how can I set up params for creating model like dropout except for drop path.
You've described it in running train, like this

DROP_PATH=0.1 # drop path rates [0.1, 0.1, 0.2, 0.3, 0.4] responding to model [s12, s24, s36, m36, m48]
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --apex-amp

Could you please provide the training log on the ImageNet for reference?

why use use_layer_scale

thanks for your great contribution!
in the implement for poolformerblock ,there is a layer_scale after token_mixer. What is the impact of this operation?

On the use of Apex AMP and hybrid stages

Is there a specific reason why you used Apex AMP instead of the native AMP provided by PyTorch? Have you tried native AMP?

I tried to train poolformer_s12 and poolformer_s24 with solo-learn; with native fp16 the loss goes to nan after a few epochs, while with fp32 it works fine. Did you experience similar behavior?

On a side note, can you provide the implementation and the hyperparameters for the hybrid stage [Pool, Pool, Attention, Attention]? It seems very interesting!

Bug on transfer poolformer to detr

Hi, I got this bug when transformer poolformer to detr-like model (simply replace backbone), It might because of detr using single feature, but I dont know exactly why, can u help take a look?

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 0: backbone.norm0.weight, backbone.norm0.bias
Parameter indices which did not receive grad for rank 0: 248 249

this is my config:

model = dict(
    # backbone=dict(
    #     type='PyramidVisionTransformerV2',
    #     embed_dims=64,
    #     _delete_=True,
    #     out_indices=(0, 1, 2, 3,),
    #     num_layers=[2, 2, 2, 2],
    #     init_cfg=dict(checkpoint='https://github.com/whai362/PVT/'
    #                   'releases/download/v2/pvt_v2_b1.pth')),
    backbone=dict(
        type='poolformer_s24_feat',
        style='pytorch',
        out_indices=(0, 1, 2, 3,),
        norm_cfg=dict(type='BN', requires_grad=False),
        init_cfg=dict(
            type='Pretrained',
            checkpoint='https://github.com/sail-sg/poolformer/releases/download/v1.0/poolformer_s24.pth.tar',
        ),
    ),
    neck=dict(
        type='SMCAFPN',
        in_channels=[64, 128, 320, 512],
        out_channels=256,
        start_level=1,
        num_outs=5,
        relu_before_extra_convs=True),
    bbox_head=dict(
        in_channels=512,
    ))

count_include_pad=False

Hi, why do you set this argument to 'False'? The default is 'True'.

It is hard to reproduce the ADE20k result with PoolFormer S12

Hi, I have tried to reproduce the ADE20k with S12 which is 37.2 for three times. And the results are 36.86, 36.87, and 36.92. I have not modified any default settings.

About MLN（Modified Layer Normalization）

This paper provides new perspectives about Transformer block, but I have some questions about one of the details.
As far as I know, the LayerNorm officially provided by Pytorch implements the same function as the MLN, which computes the
mean and variance along token and channel dimensions. So where is the improvement?

The official example :
#Image Example
N, C, H, W = 20, 5, 10, 10
input = torch.randn(N, C, H, W)
#Normalize over the last three dimensions (i.e. the channel and spatial dimensions)
#as shown in the image below
layer_norm = nn.LayerNorm([C, H, W])
output = layer_norm(input)

Object detection training

For pre-trained weight, can i use my local path instead of git path like this?

When i run a training code
i get

Can this be ignored?

Thank you in advance!

Inquiry about the Hybrid design

Hello, thank you for sharing the code of the paper.

Could you please release the code of the hybrid design?

Also, I have a question please about the hybrid design in Table. 6. Did you replace the whole pooling stage with attention or SpatialFC stage ? or just replace the last block of each stage?

Thank you.

Error: About self.pool(x)

Hello, I am more interested in the poolformer you proposed, but an error occurred during the use of PoolFormerBlock, as follows：
Traceback (most recent call last):
File "train.py", line 545, in
train(hyp, opt, device, tb_writer)
File "train.py", line 89, in train
model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create
File "E:\Work\yolov5\models\yolo.py", line 106, in init
m.stride = torch.tensor([s / x.shape[-2] for x in self.forward(torch.zeros(1, ch, s, s))]) # forward
File "E:\Work\yolov5\models\yolo.py", line 138, in forward
return self.forward_once(x, profile) # single-scale inference, train
File "E:\Work\yolov5\models\yolo.py", line 157, in forward_once
x = m(x) # run # 执行网络组件操作
File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "E:\Work\yolov5_T23\models\common.py", line 194, in forward
n = self.token_mixer(m)
File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "E:\Work\yolov5_T23\models\Confor_VC.py", line 93, in forward
x1 = self.pool(x) - x # x1 = self.pool(x) - x
File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\pooling.py", line 594, in forward
return F.avg_pool2d(input, self.kernel_size, self.stride,
TypeError: avg_pool2d(): argument 'kernel_size' (position 2) must be tuple of ints, not bool

I want to put the poolformer behind a ConvBlock and the above problem occurred。
thank you!

About subtract in pooling

Hi, thank you for publishing such a nice paper. I just have one question. I do not understand the subtraction of the input in eqn.4. Is it necessary? What will happen if we just do the average pooling without substrating the input?

"pretrained=True” do not really load the pre-trained model

How can your poolformer model be applied to semantic segmentation tasks?

I have seen in relevant papers that this poolformer has also been applied to semantic segmentation tasks. What should be done to enable it to complete the semantic segmentation task?

你好，我有一些结构的问题

KeyError: "RetinaNet: 'poolformer_s12 is not in the models registry'"

What is the difference between GN with group= 1 and LN?

Hi, could you please identify the difference between GN with group=1 and LN? The accuracy gap is large.

cvpr2022 call for demos

Hi, there is a call for demos this year for cvpr 2022

https://cvpr2022.thecvf.com/call-demos

where a demo can be added to the Hugging Face organization here: https://huggingface.co/cvpr

would you be interested in submitting a demo for this?

Why not conduct the experiment to directly compare the pooling and DW convolution

Thanks for your great work. I want to ask if there are some experiments to compare 3x3 avgpooling and 3x3 dw convolution, for example, directly replacing pooling with 3x3 dw convolution in the same architecture.

Invitation of making PR for OpenMMLab / MMSegmentation.

Hi, first congrats for acceptance of CVPR'2022. This work deserves because it is very great.

I am a member of OpenMMLab and mainly work for developing MMSegmentation. I think if it supported officially, many more people would use it for benchmark, which would promote research in computer vision area.

Would you like to make PR for openmmlab? We could discuss together to refactor your code and use our own GPUs to train & re-implement.

I think it is pretty cool because it would make more reseachers and community members use this excellent work! Here is our re-implementing work: ConvNeXt.

We do hope PoolFormer could also be added as backbones in our codebase so that many researchers could use directly it for downstream tasks.

Looking forward to your reply!

Best,

what makes pooling competitive performance or even more than attention?

In this paper, you confirm that the success of ViT does not come from the attention token mixer but from a general architecture defined as metaFormer. And the special thing is that in the paper you just need to replace attention to a super simple pooling operator and it gives a SOTA performance. So the question is what makes pooling competitive performance or even more than attention?

Checkpoints of the Ablation study

Hi, thanks for your amazing work.
I am reading the Tab 6, and I am surprised because the method is so simple and very effective, especially when the Pooling is replaced with Identity Mapping. Top1 74.3 on ImageNet-1k with only Conv1x1 and Norm layer. I am thrilled...
Can you release this checkpoint so that we can verify. Thanks again.

Welcome update to OpenMMLab 2.0

I am Vansin, the technical operator of OpenMMLab. In September of last year, we announced the release of OpenMMLab 2.0 at the World Artificial Intelligence Conference in Shanghai. We invite you to upgrade your algorithm library to OpenMMLab 2.0 using MMEngine, which can be used for both research and commercial purposes. If you have any questions, please feel free to join us on the OpenMMLab Discord at https://discord.gg/amFNsyUBvm or add me on WeChat (van-sin) and I will invite you to the OpenMMLab WeChat group.

Here are the OpenMMLab 2.0 repos branches:

	OpenMMLab 1.0 branch	OpenMMLab 2.0 branch
MMEngine		0.x
MMCV	1.x	2.x
MMDetection	0.x 、1.x、2.x	3.x
MMAction2	0.x	1.x
MMClassification	0.x	1.x
MMSegmentation	0.x	1.x
MMDetection3D	0.x	1.x
MMEditing	0.x	1.x
MMPose	0.x	1.x
MMDeploy	0.x	1.x
MMTracking	0.x	1.x
MMOCR	0.x	1.x
MMRazor	0.x	1.x
MMSelfSup	0.x	1.x
MMRotate	1.x	1.x
MMYOLO		0.x

Attention: please create a new virtual environment for OpenMMLab 2.0.

when will segmentation configs out?

Many thanks for your excellent work.
I have tried to used the pretrained PoolFormer_S12 with SemanticFPN configs from MMSegmentation with default setting (16w iterations). But, the model just achieved 36+ mIoU (Single-scale testing) over ADE20k, which is much lower than the result (37.2) in the main paper.
Could you please share the segmentation configs?

Recommendations for training poolFormer on custom dataset?

Hello, Thank you for this great contribution. Can you please clarify recommendations for training on a custom dataset?

s12 model Reproduction experiment

Using the s12 model, only the four card batch size is 240 for a single card, and the acc top1 is 76 in the end. If there are no eight cards, how can the acc reach 80，Other parameter defaults. --Apex amp can greatly affect the accuracy in addition to fast training.

Some question about Layernorm and GroupNorm.

Thanks for your good work in the CV area.
I have some questions about GroupNorm and LayerNorm. The GroupNorm with group_num = 1 is equivalent to the LayerNorm. Why does GroupNorm outperform LayerNorm in your ablation study (Table. 6)?

A simple example from https://pytorch.org/docs/stable/generated/torch.nn.GroupNorm.html

input = torch.randn(20, 6, 10, 10)
# Separate 6 channels into 3 groups
m = nn.GroupNorm(3, 6)
# Separate 6 channels into 6 groups (equivalent with InstanceNorm)
m = nn.GroupNorm(6, 6)
# Put all 6 channels into a single group (equivalent with LayerNorm)
m = nn.GroupNorm(1, 6)
# Activating the module
output = m(input)

Looking forward to your reply!

Design on positional embedding?

Hello authors,

I appreciate a lot your current work, which inspired the community. I am here to raise a very simple and quick question after checking the code and architecture design.

I observed that network using pooling, MLP or identical as token mixer, you do not include positional embedding, while you consider this component only when you use MHA. What is the concern of this design and why other models do not rely on this embedding?

Best,

How to measure MACs?

Hi, thanks for your nice work :)
I also watched your presentation record through this conference.

I want to apply the poolformer for my work, can I ask how did you measure the MACs of the architecture introduced in your paper?
Or if you were not bothered, I want to ask if I could be shared your measurement code.

Can I say PoolFormer is just a non-trainable MLP-like module?

Hi! Thanks for sharing the great work!
I have some questions about PoolFormer.
If I explain PoolFormer like the following attachments, can I say PoolFormer is just a non-trainable MLP-like model?

Addition of the Organization on HuggingFace Transformers

Hello PoolFormer team!

I have been working on porting the implementation of PoolFormer to HuggingFace Transformers library (you can see my PR here) and I was wondering if I can go ahead and add Sea AI labs as an organization to the HuggingFace models hub.

This will allow all model checkpoints to be uploaded onto the hub as well as model cards, etc.

Kind regards,
Tanay Mehta

About clip_norm

Thanks for the excellent work and Happy Chinese New Year!

I noticed that you didn't use the clip_norm that is commonly used in your previous works. Why?

when will detection configs out?

PoolFormer for Segmentation task

Hello!
When applying PoolFormer to a segmentation task, using the provided pth file shows that there are no weights for the hierarchical features used for dense prediction.

Since I use a pretrained model for classification tasks, I think this is an expected result.

Did you also train PoolFormer for segmentation task without weights for hierarchical features?

Thank you.

Why not using DW conv

Hi, thanks for the paper.
While your paper does again show that any mixing in spatial domain could work in CV, from practical point of view there is a large issue with using AvgPool2d. On inference it's not faster that DepthwiseConv but using a fixed filter instead of learned one, which leads to much lower network capacity. Have you tried using DW 3x3 instead of AvgPool ?

No module named 'mmcv_ custom.runner.optimizer'

Hello, Doctor, I haven't modified the code, but I can't run it. The error prompt is as follows:
No module named 'mmcv_ custom.runner.optimizer'
I checked the mmdet version and found it correct.

About poolformer as a tool for demonstration of MetaFormer

Hi, Thanks for the wonderful work, and I am really impressed with the proposed 'MetaFormer' concepts and experimental results you have provided! While reading the paper, some questions were raised regarding the poolformer and the concept of MetaFormer that I wanted to share with you.

As far as I understand, the metaformer basically consists of 'input embedding + iteration of blocks with [norm - token mixer - residual connection - norm - channel mixer - residual connection].' Then does MetaFormer not have consideration for non-overlapping patches or a sequence of flattened patches? If so, is the combination of token mixer and channel mixer with other components basically what we have for the 'MetaFormer' regardless of the hierarchical structure of networks or shape of inputs?
The poolformer has non-parametric 2D pooling for the token mixer, which is extremely simple compared to previous token mixers. However, the patch embedding inserted between the blocks seems to have implicit token mixing since it is a convolution with a smaller stride than its kernel size and eventually yields overlapped patches. Under the assumption of overlapping patches, I believe the resulting patches share information on the same spatial locations.

Thanks!

sail-sg / poolformer Goto Github PK

poolformer's Introduction

PoolFormer: MetaFormer Is Actually What You Need for Vision (CVPR 2022 Oral)

Bibtex

Image Classification

1. Requirements

2. PoolFormer Models

Web Demo

Usage

3. Validation

4. Train

5. Visualization

Acknowledgment

poolformer's People

Contributors

Stargazers

Watchers

Forkers

poolformer's Issues

Welcome update to OpenMMLab 2.0

Recommend Projects

Recommend Topics

Recommend Org