shoufachen / adaptformer Goto Github PK

View Code? Open in Web Editor NEW

319.0 8.0 18.0 2.74 MB

[NeurIPS 2022] Implementation of "AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition"

Home Page: https://arxiv.org/abs/2205.13535

License: MIT License

Python 100.00%

adapter recognition vision-transformer visual-adapter neurips-2022

adaptformer's Introduction

[NeurIPS 2022] AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Project Page | arXiv

This is a PyTorch implementation of the paper AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition.

Shoufa Chen¹*, Chongjian Ge¹*, Zhan Tong², Jiangliu Wang^2,3, Yibing Song², Jue Wang², Ping Luo¹
¹The University of Hong Kong, ²Tencent AI Lab, ³The Chinese University of Hong Kong
*denotes equal contribution

Catalog

Video code
Image code

Usage

Install

Tesla V100 (32G): CUDA 10.1 + PyTorch 1.6.0 + torchvision 0.7.0
timm 0.4.8
einops
easydict

Data Preparation

See DATASET.md.

Training

Start

# video
OMP_NUM_THREADS=1 python3 -m torch.distributed.launch \
    --nproc_per_node=8 --nnodes=8 \
    --node_rank=$1 --master_addr=$2 --master_port=22234 \
    --use_env main_video.py \
    --finetune /path/to/pre_trained/checkpoints \
    --output_dir /path/to/output \
    --batch_size 16 --epochs 90 --blr 0.1 --weight_decay 0.0 --dist_eval \
    --data_path /path/to/SSV2 --data_set SSV2 \
    --ffn_adapt

on each of 8 nodes. --master_addr is set as the ip of the node 0. and --node_rank is 0, 1, ..., 7 for each node.

# image
python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_image.py \
    --batch_size 128 --cls_token \
    --finetune /path/to/pre_trained/mae_pretrain_vit_b.pth \
    --dist_eval --data_path /path/to/data \
    --output_dir /path/to/output  \
    --drop_path 0.0  --blr 0.1 \
    --dataset cifar100 --ffn_adapt

To obtain the pre-trained checkpoint, see PRETRAIN.md.

Acknowledgement

The project is based on MAE, VideoMAE, timm, and MAM. Thanks for their awesome works.

Citation

@article{chen2022adaptformer,
      title={AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition},
      author={Chen, Shoufa and Ge, Chongjian and Tong, Zhan and Wang, Jiangliu and Song, Yibing and Wang, Jue and Luo, Ping},
      journal={arXiv preprint arXiv:2205.13535},
      year={2022}
}

License

This project is under the MIT license. See LICENSE for details.

adaptformer's People

Contributors

Stargazers

Watchers

Forkers

ishine luo77123 bruceyo yangfukui dl-vit rubycheen mfkiwl lyttonkeepfoing englishler lawrence-cj peternara pinglmlcv yoavkurtz lyconan shiqi-wu clxia12 sarthaxxxxx anapplecore

adaptformer's Issues

How to use Adaptformer for Swin Transformer

Hello

Thank you for the super interesting work. Do you have an implementation of Swin transformer where Adaptformer was integrated?

AdaptFormer on Swin Transformer

Hello, do you have the code for AdaptFormer on Swin Transformer?

Inconsistency in pretrained mae vitb16 weights

Hi
I have a question about your pretreined mae vitb16 weights.
When I use "https://github.com/ShoufaChen/AdaptFormer/releases/download/v0.1/mae_pretrain_vit_b.pth" which is the one on your GitHub repo and train on Cifar100 for 1 epoch I generally get
Acc@1 37.330 Acc@5 67.820 loss 2.768
(the command is python main_image.py --batch_size=128 --cls_token --epochs=1 --finetune=mae_pretrain_vit_b.pth --dist_eval
--data_path=cifar100 --output_dir=output --num_workers=16 --drop_path=0.0 --blr=0.1 --dataset=cifar100 --ffn_adapt)
But
When I use "https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth" which is the claimed weight on your paper's text and train on Cifar100 for 1 epoch I generally get
Acc@1 5.660 Acc@5 21.350 loss 4.295.
(the command is python main_image.py --batch_size=128 --cls_token --epochs=1 --finetune=mae_pretrain_vit_base.pth --dist_e
val --data_path=cifar100 --output_dir=output --num_workers=16 --drop_path=0.0 --blr=0.1 --dataset=cifar100 --ffn_adapt)
These results are consistent over multiple runs.
My questions are, what is the difference between these two pretrained weights? And what causes this huge difference in results?

Code for NUS-WIDE

Hello,

Could you please share the code on the NUS-WIDE dataset? Or, can you indicate what key adjustments I need to make if I adapt the existing open-source code to fit the NUS-WIDE dataset?

Thanks in advance!

I would like to know the number of tokens selected for "VPT" in Table1.

How to solve the problem of loss NAN?

Hi,

I reprocess this code on SSv2 datatset. I follow the blr 0.1, and use 2 gpus with batchsize of 7 (the valid total batchsize is 14). But the loss is Nan when epoch is 14. How to solve this problem? Thanks~

Why should we even use adapter in image vit?

The average running time for each epoch in full-tuning is 2 minutes and for adaptformer 2:30 on a single gpu device with RTX 3090 and the results (as we can see in your paper tables isn't that significantly better and even most of the times full-tuning is working better). This rises the question of why even care about using adapter and "small" number of tunable parameters in image vit? (even the ram difference isn't that significant (10GB adapter and 15GB full-tuning ))

[ Preprocessing SSv2 ]

Could you please, share with us the pre-processing script for SSv2, it is not available in the provided repo of VideoMAE ?

As well, the csv, and txt files for the SSv2 splits if possible ?

Thanks !

Evaluation Results are not Consistent in Consecutive Evaluations & Sensitivty to Batch Size

Thank you for sharing this wonderful work! Could you help to look into the following two issues:

I tested the code on the HMDB51 dataset, the results can be inconsistent for two consecutive evaluations (run the LINE414 of main_videp.py test_stats = evaluate(data_loader_val, model, device) twice).
For fine-tuning with Swin Transformer, I ran the code with a smaller batch size (i.e., 32) on 4 3090 GPUs several times, and the results for tunning the linear layer are around 71+%. Is the large batch size making such a difference from the reported results of 74%?

Thank you very much in advance!

How to finetune the Imagenet21k

Hi, I want to finetune the supervised_pretrained_imagenet21k model, could you tell me how can I implement that?

How to obtain supervised pre-trained model parameters in VideoMAE.

For videos, you take supervised and self-supervised pre-trained models from VideoMAE. Where can I find the.pth file of the supervised weight parameters?

generalize to more downstream tasks

Hi, thanks a lot for the wonderful work~!
I notice that there is no fine-tuning result on ImageNet, neither downstream tasks like semantic segmentation/object detection.
Have you tried several tasks as mentioned above? Can the method possibly generalize to them?

Thanks again for your kind reply.

Data split for HMDB51 dataset

The paper says "HMDB51 is composed of 6,849 videos with 51 categories, making a split of 3.5k/1.5k train/val videos." Do you conduct random split on the 6849 videos? And how can I get the test videos?

Does adapter-based fine-tuning save memory compared to full parameter fine-tuning?

Can Adapterformer save memory usage during the training phase compared to full parameter fine-tuning?
How much memory does full parameter fine-tuning use and how much memory does Adapterformer use?

Two questions about the experimental results in Tabel 1 of the paper.

Hi, I would like to ask you two questions about the experimental results in the paper's Table 1.
I would like to ask where the acc 53.97 of full tuning of ssv2 was obtained?

When I read VideoMAE, I found that pretrain on ssv2 and then finetune on ssv2 can get 69.3 results. I know your paper is using the K400's pretrain parameters, but I also did experiments and I can achieve 65+ results with 50 rounds finetune on ssv2:

so my first question is where did you get 53.97 from?
The second question is that the data in the chart below I also did not find in the table, is it written wrong?

Some questions.

It's been a very interesting job and it has inspired me a lot, thank you for sharing this paper!
After reading, I have some doubts. Can you help me answer my doubts? Thank you very much!
(1)Table 2c gives an ablation experiment on scale factor s, but the article does not mention why this s is used. Because the residual forms I have seen elsewhere are all directly additive, which is equivalent to s equal to 1, why use s here?
(2)Why is the performance of full-tuning inferior to AdaptFormer in video experiments (Table 1)? In other words, what properties of AdaptFormer make it have such good performance.

Thank you!

Converting VideoMAE weights

Please confirm if we are using VIDEOMAE weights, do we need to use 'model' or 'module' in line 9 of convert.py? It seems it cann't find any key with 'model'.

Vit-B IN21K weights

Can you please share the converted weights of IN21K that can be used to finetune for action recognition?

how to prepare the dataset of HMDB51?

Which ViT IN21K weights did you use?

Thanks for sharing your great work. 🚀
Can you please provide/point to the pre-trained weights used in section A.4 of the paper? i.e. the weights of the ViT trained on ImageNet21K.

Thank you

Question about Table 2 (multi-label classification performance)

Hello, thank you for sharing nice work!

I have question about table 2.

I think the performance is recorded incorrectly (Ex. 61.26 - 59.07 != 0.06), is there anything I missed?

Thanks.

(below table from arXiv v3 : https://arxiv.org/abs/2205.13535)

question about reproduce

Hi,
Thank you for your great job.
But I want to know how can I get the same cifar100 with fulltune result(85.90). I don't know the specific hyperparameters to get the result..
I only get 85.48
{"train_lr": 6.4497597627254405e-06, "train_loss": 0.19667177986449155, "test_loss": 0.7042211120641684, "test_acc1": 85.48, "test_acc5": 97.35, "epoch": 99, "n_parameters": 85875556}
My experiment setting is
python3 -m torch.distributed.launch --nproc_per_node=1 --use_env main_image.py
--batch_size 128 --cls_token
--finetune xxxx
--dist_eval --data_path xxxx
--output_dir xxxx
--drop_path 0.0 --blr 0.1
--dataset cifar100 --fulltune

Also, for the adapterformer-64, I also get "test_acc1": 85.8, "test_acc5": 97.92, how can I get 85.90?
python3 -m torch.distributed.launch --nproc_per_node=1 --use_env main_image.py
--batch_size 128 --cls_token
--finetune xxx
--dist_eval --data_path xxxx
--output_dir xxxx
--drop_path 0.0 --blr 0.1
--dataset cifar100 --ffn_adapt

missing keys when load pre-trained checkpoint

Thanks for sharing codes for your excellent work! When I run main_image.py with the provided pre-trained checkpoint, I got some info as follows:

For example, one of the key in the model`s state_dict is blocks.0.fc1.weight , but blocks.0.mlp.fc1.weight in the pre-trained checkpoint. And I got poor results when I train cifar100.

I suspect that I am doing something wrong. Could you help me? Thank you very much!

Questions about layernorm

Thanks for your excellent work.

I find there is an issue that the implementation does not match the paper (Fig.2 b):

In line 79 in custom_modules.py, the input of adaptmlp is not processed with layernorm.

I have also checked the instantiation of class Adapter, it seems that the adapter_layernorm_option is always set as "None" according to tuning_config in main_image.py or main_video.py. So the input is also not normed in Adapter.

I wondered if the code is correct, or the Fig.2b needs to be revised.

Evaluation

Hi, I can train by this command:

image

python3 -m torch.distributed.launch --nproc_per_node=2 --use_env main_image.py
--batch_size 64 --cls_token
--finetune ./pretrained/mae_pretrain_vit_b.pth
--dist_eval --data_path ./data/
--output_dir ./output/
--drop_path 0.0 --blr 0.1
--dataset cifar100 --ffn_adapt

After the training process, I got a lot of checkpoints.pth. However, you provide the test command. Could you give some suggestions for test commands?

When will Image code be available?

Hi,

Thanks for your great work!

May I ask when will Image code be available?

Thanks.

Thank you for your work. This is the first paper I have replicated in the field of artificial intelligence, and your detailed explanations and well-organized code have brought great convenience to my replication process.

How to add adapt-mlp to swin-transformer?

Thanks for sharing such great work!
I have some problem about how to using adapt-mlp in swin? As we know, number channel is different in different stage of swin, so how should we set middle channel in this condition?

How can we get more pre-trained checkpoints?

Hi,

May I ask where can we get more pre-trained checkpoints, such as vit_base_patch16_384, which can be directly loaded based on current code?

Thanks.

Hyper-parameter for VPT

Thanks a lot for the great work! I am having trouble re-producing the results for VPT [46], so I wonder whether it's possible to share your hyper-parameters for training VPT that are reported in Table 5 under the supervised pre-training?

Kind regards,
Haoyu

What is the difference between AdaptFormer and baseline "Adatper" in VPT [46]?

Thanks for sharing the nice work. In my opinion, the AdaptFormer is the same as the baseline "Adatper" used in VPT [46]. Am I misunderstanding?

[ Video_main.py : SSv2 dataset ]

Hi,

Wonderful work, thanks for sharing the code.

I'm trying to re-train the model for SSv2 dataset using AdaptFormer. I wasn't able to the use converted checkpoint provided (videomae_pretrain_vit_b_1600.pth.)

It raises an error "Warning: Double check" (line:318, main_video.py) since it does not correspond to any if code blocks mentionned in the main_video.py.

What should be the label for this checkpoint, such that the model loads the weights and starts training ?

Thanks

Question about AdaptFormer architecture

Hello. I am deeply grateful to you for releasing the code of your amazing research.
However, I confirmed that the structure of AdaptFormer was slightly different between Figure 2 (b) of the paper and the written code.
In the figure of the paper, trainable layers (AdaptMLP) are shown to feed data after the second Layer Norm.
On the other hand, the code (line 79 in models/custom_modules.py) of adaptmlp receives data after multi-head attention as an input.
Which of the two is correct?
Please let me know if I misunderstand.

Questions about data split.

Thanks for your great work. I would like to know on which data split (val or test) the performance is reported in your paper. I notice that the final testing seems to be conducted on validation set.

mode = 'test'
anno_path = os.path.join(args.data_path, 'val.csv')

question about the initialization

Hello, thank you for doing such a great job. Regarding the initialization of parameters in adapterformer, why should the weight and bias of the up_proj be initialized to 0? In your paper, you mentioned that this is done to ensure stability during initial training. But if parameters are initialized to 0, This should result in all the calculation results of the adapter branch being 0 so that the adapter we added will not have any impact on the training of the model. Looking forward to receiving your reply！

shoufachen / adaptformer Goto Github PK

adaptformer's Introduction

[NeurIPS 2022] AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Project Page | arXiv

Catalog

Usage

Install

Data Preparation

Training

Acknowledgement

Citation

License

adaptformer's People

Contributors

Stargazers

Watchers

Forkers

adaptformer's Issues

image

Recommend Projects

Recommend Topics

Recommend Org