Coder Social home page Coder Social logo

shoufachen / adaptformer Goto Github PK

View Code? Open in Web Editor NEW
319.0 8.0 18.0 2.74 MB

[NeurIPS 2022] Implementation of "AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition"

Home Page: https://arxiv.org/abs/2205.13535

License: MIT License

Python 100.00%
adapter recognition vision-transformer visual-adapter neurips-2022

adaptformer's Introduction

[NeurIPS 2022] AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

teaser

This is a PyTorch implementation of the paper AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition.

Shoufa Chen1*, Chongjian Ge1*, Zhan Tong2, Jiangliu Wang2,3, Yibing Song2, Jue Wang2, Ping Luo1
1The University of Hong Kong, 2Tencent AI Lab, 3The Chinese University of Hong Kong
*denotes equal contribution

Catalog

  • Video code
  • Image code

Usage

Install

  • Tesla V100 (32G): CUDA 10.1 + PyTorch 1.6.0 + torchvision 0.7.0
  • timm 0.4.8
  • einops
  • easydict

Data Preparation

See DATASET.md.

Training

Start

# video
OMP_NUM_THREADS=1 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=8 \
    --node_rank=$1 --master_addr=$2 --master_port=22234 \
    --use_env main_video.py \
    --finetune /path/to/pre_trained/checkpoints \
    --output_dir /path/to/output \
    --batch_size 16 --epochs 90 --blr 0.1 --weight_decay 0.0 --dist_eval \
    --data_path /path/to/SSV2 --data_set SSV2 \
    --ffn_adapt

on each of 8 nodes. --master_addr is set as the ip of the node 0. and --node_rank is 0, 1, ..., 7 for each node.

# image
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_image.py \
    --batch_size 128 --cls_token \
    --finetune /path/to/pre_trained/mae_pretrain_vit_b.pth \
    --dist_eval --data_path /path/to/data \
    --output_dir /path/to/output  \
    --drop_path 0.0  --blr 0.1 \
    --dataset cifar100 --ffn_adapt

To obtain the pre-trained checkpoint, see PRETRAIN.md.

Acknowledgement

The project is based on MAE, VideoMAE, timm, and MAM. Thanks for their awesome works.

Citation

@article{chen2022adaptformer,
      title={AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition},
      author={Chen, Shoufa and Ge, Chongjian and Tong, Zhan and Wang, Jiangliu and Song, Yibing and Wang, Jue and Luo, Ping},
      journal={arXiv preprint arXiv:2205.13535},
      year={2022}
}

License

This project is under the MIT license. See LICENSE for details.

adaptformer's People

Contributors

shoufachen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adaptformer's Issues

Inconsistency in pretrained mae vitb16 weights

Hi
I have a question about your pretreined mae vitb16 weights.
When I use "https://github.com/ShoufaChen/AdaptFormer/releases/download/v0.1/mae_pretrain_vit_b.pth" which is the one on your GitHub repo and train on Cifar100 for 1 epoch I generally get
Acc@1 37.330 Acc@5 67.820 loss 2.768
(the command is python main_image.py --batch_size=128 --cls_token --epochs=1 --finetune=mae_pretrain_vit_b.pth --dist_eval
--data_path=cifar100 --output_dir=output --num_workers=16 --drop_path=0.0 --blr=0.1 --dataset=cifar100 --ffn_adapt)
But
When I use "https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth" which is the claimed weight on your paper's text and train on Cifar100 for 1 epoch I generally get
Acc@1 5.660 Acc@5 21.350 loss 4.295.
(the command is python main_image.py --batch_size=128 --cls_token --epochs=1 --finetune=mae_pretrain_vit_base.pth --dist_e
val --data_path=cifar100 --output_dir=output --num_workers=16 --drop_path=0.0 --blr=0.1 --dataset=cifar100 --ffn_adapt)
These results are consistent over multiple runs.
My questions are, what is the difference between these two pretrained weights? And what causes this huge difference in results?

Code for NUS-WIDE

Hello,

Could you please share the code on the NUS-WIDE dataset? Or, can you indicate what key adjustments I need to make if I adapt the existing open-source code to fit the NUS-WIDE dataset?

Thanks in advance!

How to solve the problem of loss NAN?

Hi,

I reprocess this code on SSv2 datatset. I follow the blr 0.1, and use 2 gpus with batchsize of 7 (the valid total batchsize is 14). But the loss is Nan when epoch is 14. How to solve this problem? Thanks~

Why should we even use adapter in image vit?

The average running time for each epoch in full-tuning is 2 minutes and for adaptformer 2:30 on a single gpu device with RTX 3090 and the results (as we can see in your paper tables isn't that significantly better and even most of the times full-tuning is working better). This rises the question of why even care about using adapter and "small" number of tunable parameters in image vit? (even the ram difference isn't that significant (10GB adapter and 15GB full-tuning ))

[ Preprocessing SSv2 ]

Could you please, share with us the pre-processing script for SSv2, it is not available in the provided repo of VideoMAE ?

As well, the csv, and txt files for the SSv2 splits if possible ?

Thanks !

Evaluation Results are not Consistent in Consecutive Evaluations & Sensitivty to Batch Size

Thank you for sharing this wonderful work! Could you help to look into the following two issues:

  1. I tested the code on the HMDB51 dataset, the results can be inconsistent for two consecutive evaluations (run the LINE414 of main_videp.py test_stats = evaluate(data_loader_val, model, device) twice).

  2. For fine-tuning with Swin Transformer, I ran the code with a smaller batch size (i.e., 32) on 4 3090 GPUs several times, and the results for tunning the linear layer are around 71+%. Is the large batch size making such a difference from the reported results of 74%?

Thank you very much in advance!

generalize to more downstream tasks

Hi, thanks a lot for the wonderful work~!
I notice that there is no fine-tuning result on ImageNet, neither downstream tasks like semantic segmentation/object detection.
Have you tried several tasks as mentioned above? Can the method possibly generalize to them?

Thanks again for your kind reply.

Data split for HMDB51 dataset

The paper says "HMDB51 is composed of 6,849 videos with 51 categories, making a split of 3.5k/1.5k train/val videos." Do you conduct random split on the 6849 videos? And how can I get the test videos?

Two questions about the experimental results in Tabel 1 of the paper.

Hi, I would like to ask you two questions about the experimental results in the paper's Table 1.
I would like to ask where the acc 53.97 of full tuning of ssv2 was obtained?
截屏2022-10-10 17 56 13
When I read VideoMAE, I found that pretrain on ssv2 and then finetune on ssv2 can get 69.3 results. I know your paper is using the K400's pretrain parameters, but I also did experiments and I can achieve 65+ results with 50 rounds finetune on ssv2:
1

  1. so my first question is where did you get 53.97 from?
  2. The second question is that the data in the chart below I also did not find in the table, is it written wrong?

截屏2022-10-10 18 10 30

Some questions.

It's been a very interesting job and it has inspired me a lot, thank you for sharing this paper!
After reading, I have some doubts. Can you help me answer my doubts? Thank you very much!
(1)Table 2c gives an ablation experiment on scale factor s, but the article does not mention why this s is used. Because the residual forms I have seen elsewhere are all directly additive, which is equivalent to s equal to 1, why use s here?
(2)Why is the performance of full-tuning inferior to AdaptFormer in video experiments (Table 1)? In other words, what properties of AdaptFormer make it have such good performance.

Thank you!

Converting VideoMAE weights

Please confirm if we are using VIDEOMAE weights, do we need to use 'model' or 'module' in line 9 of convert.py? It seems it cann't find any key with 'model'.

Vit-B IN21K weights

Can you please share the converted weights of IN21K that can be used to finetune for action recognition?

Which ViT IN21K weights did you use?

Thanks for sharing your great work. 🚀
Can you please provide/point to the pre-trained weights used in section A.4 of the paper? i.e. the weights of the ViT trained on ImageNet21K.

Thank you

question about reproduce

Hi,
Thank you for your great job.
But I want to know how can I get the same cifar100 with fulltune result(85.90). I don't know the specific hyperparameters to get the result..
I only get 85.48
{"train_lr": 6.4497597627254405e-06, "train_loss": 0.19667177986449155, "test_loss": 0.7042211120641684, "test_acc1": 85.48, "test_acc5": 97.35, "epoch": 99, "n_parameters": 85875556}
My experiment setting is
python3 -m torch.distributed.launch --nproc_per_node=1 --use_env main_image.py
--batch_size 128 --cls_token
--finetune xxxx
--dist_eval --data_path xxxx
--output_dir xxxx
--drop_path 0.0 --blr 0.1
--dataset cifar100 --fulltune

Also, for the adapterformer-64, I also get "test_acc1": 85.8, "test_acc5": 97.92, how can I get 85.90?
python3 -m torch.distributed.launch --nproc_per_node=1 --use_env main_image.py
--batch_size 128 --cls_token
--finetune xxx
--dist_eval --data_path xxxx
--output_dir xxxx
--drop_path 0.0 --blr 0.1
--dataset cifar100 --ffn_adapt

missing keys when load pre-trained checkpoint

Thanks for sharing codes for your excellent work! When I run main_image.py with the provided pre-trained checkpoint, I got some info as follows:

image

For example, one of the key in the model`s state_dict is blocks.0.fc1.weight , but blocks.0.mlp.fc1.weight in the pre-trained checkpoint. And I got poor results when I train cifar100.

I suspect that I am doing something wrong. Could you help me? Thank you very much!

Questions about layernorm

Thanks for your excellent work.

I find there is an issue that the implementation does not match the paper (Fig.2 b):

In line 79 in custom_modules.py, the input of adaptmlp is not processed with layernorm.

I have also checked the instantiation of class Adapter, it seems that the adapter_layernorm_option is always set as "None" according to tuning_config in main_image.py or main_video.py. So the input is also not normed in Adapter.

I wondered if the code is correct, or the Fig.2b needs to be revised.

Evaluation

Hi, I can train by this command:

image

python3 -m torch.distributed.launch --nproc_per_node=2 --use_env main_image.py
--batch_size 64 --cls_token
--finetune ./pretrained/mae_pretrain_vit_b.pth
--dist_eval --data_path ./data/
--output_dir ./output/
--drop_path 0.0 --blr 0.1
--dataset cifar100 --ffn_adapt

After the training process, I got a lot of checkpoints.pth. However, you provide the test command. Could you give some suggestions for test commands?

How to add adapt-mlp to swin-transformer?

Thanks for sharing such great work!
I have some problem about how to using adapt-mlp in swin? As we know, number channel is different in different stage of swin, so how should we set middle channel in this condition?

Hyper-parameter for VPT

Thanks a lot for the great work! I am having trouble re-producing the results for VPT [46], so I wonder whether it's possible to share your hyper-parameters for training VPT that are reported in Table 5 under the supervised pre-training?
image

Kind regards,
Haoyu

[ Video_main.py : SSv2 dataset ]

Hi,

Wonderful work, thanks for sharing the code.

I'm trying to re-train the model for SSv2 dataset using AdaptFormer. I wasn't able to the use converted checkpoint provided (videomae_pretrain_vit_b_1600.pth.)

It raises an error "Warning: Double check" (line:318, main_video.py) since it does not correspond to any if code blocks mentionned in the main_video.py.

What should be the label for this checkpoint, such that the model loads the weights and starts training ?

Thanks

Question about AdaptFormer architecture

Hello. I am deeply grateful to you for releasing the code of your amazing research.
However, I confirmed that the structure of AdaptFormer was slightly different between Figure 2 (b) of the paper and the written code.
In the figure of the paper, trainable layers (AdaptMLP) are shown to feed data after the second Layer Norm.
On the other hand, the code (line 79 in models/custom_modules.py) of adaptmlp receives data after multi-head attention as an input.
Which of the two is correct?
Please let me know if I misunderstand.

Questions about data split.

Thanks for your great work. I would like to know on which data split (val or test) the performance is reported in your paper. I notice that the final testing seems to be conducted on validation set.

mode = 'test'
anno_path = os.path.join(args.data_path, 'val.csv') 

question about the initialization

Hello, thank you for doing such a great job. Regarding the initialization of parameters in adapterformer, why should the weight and bias of the up_proj be initialized to 0? In your paper, you mentioned that this is done to ensure stability during initial training. But if parameters are initialized to 0, This should result in all the calculation results of the adapter branch being 0 so that the adapter we added will not have any impact on the training of the model. Looking forward to receiving your reply!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.