pengzhiliang / mae-pytorch Goto Github PK

View Code? Open in Web Editor NEW

2.6K 2.6K 346.0 214 KB

Unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners

Python 100.00%

mae-pytorch's People

Contributors

Stargazers

Watchers

Forkers

jianpingzhonggit towercv karobben namnaku87 harrylan tju-liuzhiqiang wj1031924 alvahan lt-21 dumpmemory kajimacn vingoli xiangliu886 peternara fudanyliu queenie88 jinec98 heavenceles hercules261188 lijuny mengqidyangge sailfish009 woshiwht hosford42 sunsmarterjie jackpurcell aska205 lhj-ym annopackage jianglin314 zikun97 danny0559 cmeninwa bradley16 stevenhailin happy20200 legionchang implus caigx111 zhongjia123 starmemda jihaonew electronicelephant wuzhan11 celestial-bai antecede renqi1998 tikboahit mr-zhou22 cydiachen fuxuliu myt889 chenshen03 hoarjour rockeycoss luyongjian1 iwldzt3011 zwy4896 hust-wayne finder2018 zongzi13545329 nullius-2020 elevatedesigner ucasligang xiangjun0103 namemzy floatbutterfly yjnick yuminhhuang spryin zplizzi hyeongminmoon b1anchenv myhan1996 gaoqfsspku sabersabersaber sirupli 12341123 chenglong89 artfintl qddse shengzhang90 liaw05 reaaer andy12392 mollystark zwq1230 flishwang tndot atonderski maoxiaofei99 ganzhan magnety jing-pan-china open-lamp tahavahedi yaoxy2010 edwinlzw kingk-boy deyingli

mae-pytorch's Issues

Training Equipment

I am very interested in what type of GPU you use for training and how long you have been training for?

Hello, I'm a novice. I want to ask. Download directly ImageNet_ILSVRC2012 dataset, can you get the pre training model according to your pre train process? If not, how can I get it. Can two GPUs pre train themselves?

About next plan.

Next, will you consider implementing downstream tasks, such as classification or segmentation

Info about the pretrained models

Can you please give more info about the pretrained models?

How many warmup epochs did you use?
How many GPUs, what kind, and what was the batch size and the learning rate?

Thank you very much!

A little question about the visualization

Hi, thanks a lot for the excellent work!!

When I executed run_mae_vis.py, there is a strange phenomenon that the model is able to reconstruct the origin img without lossing much information even if only one patch is given.

As shown in the following example, the mask_ratio is set as 0.999, making sure that the model can only get one patch. However, it is really strange to find out that the model can almost reconstruct the origin img (e.g. the shape of the bird, the yellow feather...)

I think it is impossible for the model to do this. Therefore, may I ask whether there is some information of the origin img other than the visible patch that is sent into the model? Really looking forward to your reply! Thanks!

the problem of result

I'm sorry to interrupt you again and appreciate the previous helps. And the program has been run.But the result is not good.I choose the datasets which are created by me.I want ti know if I choose the large datasets the result will be better？And I find the patch size you set is 1616，if I choose smaller like 66，can it will be better and more clear?
thank you for your attention. @pengzhiliang

Run with pretrain_mae_vit_base_mask_0.75_400e.pth get an model initialization error

Traceback (most recent call last):
File "run_mae_vis.py", line 138, in
main(opts)
File "run_mae_vis.py", line 79, in main
model = get_model(args)
File "run_mae_vis.py", line 63, in get_model
model = create_model(
File "/home/ub/miniconda3/envs/torch1.8/lib/python3.6/site-packages/timm/models/factory.py", line 57, in create_model
model = create_fn(**model_args, **kwargs)
File "/home/ub/bwj/MAE-pytorch/modeling_pretrain.py", line 317, in pretrain_mae_base_patch16_224
**kwargs)
TypeError: init() got an unexpected keyword argument 'num_classes'

the config

hello,
You have trained a total of 400 epochs, the parameter configuration is set to 1600, and then you took the 400th epoch model, or you directly set --epochs to 400

TypeError: init() got an unexpected keyword argument 'in_chans'

分享一个issue，我一开始用的是11.22更新的requirements.txt，会报如下错误：

File"/MAE-pytorch/modeling_pretrain.py", line 319, in pretrain_mae_base_patch16_224
**kwargs)
**kwargs)
TypeError: init() got an unexpected keyword argument 'in_chans'

我发现好像是timm==0.3.2导致的，后来升至0.4.12可以解决这个问题。
测试了V100 (CentOS) 和 A100 (Ubuntu)，都存在这个issue。
卸载timm好像也一样可以跑……我第一次看CV的code，了解的还不是很多。
顺便想问一下你们在V100选用的batch size是64吗

Why you memtion you cannot reproduce the results reported in the paper

Thank you for your contribution to implement the code for MAE. I notice you only provide the Vit/L and Vit/B model, pretrained for 400 epochs, but you say you can not reproduce the results provided in the paper. Maybe if you continue to pre-train the model for longer time, you can get that result in MAE paper.

3张3090跑得了4096batchsize的嘛？

Need to interpolate positional embedding to work at higher resolutions

Hi again, sorry for the slow response in issue #26. I have some more clarifications and visualizations here.

I agree that the sine-cosine embeddings are not learnable. However it seems like they still need to be interpolated for the model to work well. I suspect that this is at least partially due to the fact that they are 1d, and thus the model has to learn the number of rows/columns. E.g. it cannot express "look one patch down" directly, but rather needs to express it as "look X patches forward". And X changes if we change resolution.

I have attached attention visualizations that show what happens if you run on higher res with or without interpolating the positional embedding. As you can see, the non-interpolated version looks much worse and has weird diagonal stripes.

This is not a major issue to me, but I wanted to let you (and anyone else that has the same problem) know about this. I think the best solution is what I mentioned before: to simply include the positional embeddings in the checkpoint even though they are not learnable parameters.

Original:

With interpolation:

Without interpolation:

finetune code about cls token

Hi, thanks for your great repo!
you fixed the code:
# return self.fc_norm(x[:, 1:].mean(1)) return self.fc_norm(x.mean(1))

but did not fixed the "else" branch, why?(maybe this branch will never been used?)
return x[:, 0]

ViT-Small

Any plan to pre-train a ViT-Small version?

Managing 512*512 input size

Thanks for your code! It's really a great work! I want to train MAE on a dataset with 512*512 image size, and how can I adjust the model structure? Thanks for your help.

finetuning部分代码

非常感谢作者的工作。可以透露下finetuning部分代码什么时候会放出吗？谢谢！

Reported Result for Vit-large

Hi,

I notice the reported perfromance of vit-large is 84.5%. I was wondering, does this correspond to the baseline of the paper?

In other words, can we say the reproduced v.s. official is 84.5% : 84.9%?

Thanks

[Attention] Information leak in visualization

God job.

Hi, I think this operation will leak the information of the original input image(mean and var of one patch).

MAE-pytorch/run_mae_vis.py

Line 124 in 3546179

    
           rec_img = rec_img * (img_squeeze.var(dim=-2, unbiased=True, keepdim=True).sqrt() + 1e-6) + img_squeeze.mean(dim=-2, keepdim=True)

Anyone who uses the model trained with the normalized loss for visualization should pay attention to this operation.

I also suggest the author add the comment on this line. @pengzhiliang

请问你们用的是什么算力呀？

原文用了128核的TPUv3，完全是我企及不到的算力。请问你们用了什么算力啊，我评估一下我手上的算力配不配做预训练

knn eval of MAE

I eval vit_base of 500/1600 pretraining on imagenet1000 using knn metric. By loading all the pretained parameter with vit GAP method (not need cls token), the knn 20-NN result is 33.4 in imagenet100 dataset, which is very low and not match the accuracy of linear prob.

a small issue in modeling_finetune.py

Hi, thanks very much for your great repo! Since the current finetuning code doesn't learn a cls token, I think the line below t = x[:, 1:, :] will discard the first patch token?

MAE-pytorch/modeling_finetune.py

Lines 277 to 281 in 40780d6

    
           if self.fc_norm is not None: 
        
               t = x[:, 1:, :] 
        
               return self.fc_norm(t.mean(1)) 
        
           else: 
        
               return x[:, 0]

请问一下复现版本能够达到多少性能？

hello, a little question about this code.

I am beginner and crazy about this MAE, but about the pretrained model--pretrain_mae_vit_base_mask_0.75_400e.pth downloaded from BaiDu yun, I really wanna know where to load it in the file. Can it be used in modeling_pretrain.py?

Why don't train the k_bias

Hi, I found that the k_bias of every attention layer is set as not requiring gradient, while the q_bias and v_bias do require gradient, see the code

Is there any reason behind this? Since all the biases are set to learnable in common implementations.

forward() takes 2 positional arguments but 3 were given

Is this a bug?
with torch.cuda.amp.autocast(): outputs = model(images, bool_masked_pos) loss = loss_func(input=outputs, target=labels)
Traceback (most recent call last): File "E:/ImageProject/MAE-OCT/run_mae_pretraining.py", line 264, in <module> main(opts) File "E:/ImageProject/MAE-OCT/run_mae_pretraining.py", line 238, in main normlize_target=args.normlize_target, File "E:\ImageProject\MAE-OCT\engine_for_pretraining.py", line 66, in train_one_epoch outputs = model(images, bool_masked_pos) File "D:\anaconda3\envs\mea\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) TypeError: forward() takes 2 positional arguments but 3 were given

normalize

Thanks for your implementation！

The input seems to be normalized twice: firstly in the DataAugmentationForMAE，then in train_one_epoch, the image is normalized again.

Do you ever use batch_size larger than 128 per GPU

My loss stuck at 0.985 when using norm_target and batch_size 256 per GPU, do you meet this?

How to process the dataset...

I want to know how to process the dataset....
There are a lot of tar files.

Is there any commands or processing python file?
It is my firtst time to do this....

Sorry for my stupid question..

EMA evaluation does not use the exponential moving avera

How about change encoder and decoder to transformer variants

Hi,

我想试试换成swin transformer，那是不是mask token只在第一layer 替换到图片token里面比较合理

the problem of imagenet datasets

Hi,firstly congradulate you!
However,the imagenet has too many pictures,I cannot choose which one.
can you tell me the details of datasets? it's better to upload baiduDisk or Drive.
thank you!

Why do you use cosine weight decay?

Hi,

Thanks for providing the code. I found that you are also using cosine annealing curve to decay weight decay. It seems that the paper does not mention this. Would you please tell me why do you use it ?

How is fine-tuning performed actually?

From my understanding, the authors pretrained the MAE on ImageNet, and used the encoder for fine-tuning. In their paper, they did not explain how the masking is handled. During the fine-tuning stage, are the images masked? If not, how does it fit the encoder which accepts only a portion of the input? I wonder how you handle it. Can you kindly show me in your code?

A question about pretrained model

hi, I have a question that if I want to extract the feature in pretrained model,should I only to inference the encoder with all mask=0?

Positional embedding interpolation

I don't understand these lines of code in the fine-tuning script.

num_patches = model.patch_embed.num_patches
num_extra_tokens = model.pos_embed.shape[-2] - num_patches
orig_size = int((pos_embed_checkpoint.shape[-2] - num_extra_tokens) ** 0.5)

What is the meaning of extra tokens?
Is there any assumption that the new image size is smaller/larger than the original image size? Same goes for the patch size.

How do you test the accuracy of your model?

Hello, I can find run_mae_vis.py ,but I can't find the code to test the accuracy, can you provide one? Thanks~

Grad Norm Becomes Inf

On two gpus.

Epoch: [24] [1230/1251] eta: 0:00:06 lr: 0.000375 min_lr: 0.000375 loss: 0.6870 (0.6848) loss_scale: 2097152.0000 (2046895.3111) weight_decay: 0.0500 (0.0500) grad_norm: 0.0929 (0.0969) time: 0.3023 data: 0.0010 max mem: 8361 Epoch: [24] [1240/1251] eta: 0:00:03 lr: 0.000375 min_lr: 0.000375 loss: 0.6877 (0.6848) loss_scale: 2097152.0000 (2047300.2804) weight_decay: 0.0500 (0.0500) grad_norm: 0.0942 (0.0971) time: 0.2731 data: 0.0018 max mem: 8361 Epoch: [24] [1250/1251] eta: 0:00:00 lr: 0.000375 min_lr: 0.000375 loss: 0.6856 (0.6849) loss_scale: 2097152.0000 (2047698.7754) weight_decay: 0.0500 (0.0500) grad_norm: 0.0942 (0.0971) time: 0.2560 data: 0.0012 max mem: 8361 Epoch: [24] Total time: 0:06:23 (0.3067 s / it) Averaged stats: lr: 0.000375 min_lr: 0.000375 loss: 0.6856 (0.6851) loss_scale: 2097152.0000 (2047698.7754) weight_decay: 0.0500 (0.0500) grad_norm: 0.0942 (0.0971) Epoch: [25] [ 0/1251] eta: 1:25:25 lr: 0.000375 min_lr: 0.000375 loss: 0.6770 (0.6770) loss_scale: 2097152.0000 (2097152.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.0918 (0.0918) time: 4.0974 data: 3.7792 max mem: 8361 Epoch: [25] [ 10/1251] eta: 0:13:50 lr: 0.000375 min_lr: 0.000375 loss: 0.6854 (0.6838) loss_scale: 2097152.0000 (2097152.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.0910 (0.0949) time: 0.6694 data: 0.3704 max mem: 8361

How does the phenomenon occur?

TypeError: init() got an unexpected keyword argument 'use_mean_pooling'

您好，
感谢您提供的这个工作，非常棒，但是当我尝试运行run_class_finetuning.py时，出现报错TypeError: init() got an unexpected keyword argument 'use_mean_pooling'
请问该如何解决，谢谢

problem of fine-tuning's learning rate and batch-size

Weights & Biases logger for MAE-pytorch

Hi @pengzhiliang

I am an ML Engineer at Weights & Biases and I wanted to know if you were actively reviewing PRs at the moment? We would love to make a PR to add Weights & Biases' experiment tracking and image logging if you have the time to review it?

I think adding experiment tracking will help you and your users keep the fine-tuning experiments organized and image logging will be helpful in visualising and storing the image reconstructions for better model evaluation :)

We have built integrations into transformers, YOLOv5, PyTorch Lightning etc so we should be able to make a quick and clean PR that shouldn't take too much time for you to review.

Pretrained models

Thanks a lot for the code!
However, I only have 4 GPUS. And it may cost me about 33 days to get the Vit-Large pretrained model.
Would you mind uploading the pretrained model? Or the Vit-base pretrained model will be appreciated.

Why use ema in finetuning?

Thanks for your great work!
I am confused that I don't think the authors mention ema in their paper but I find it in your implementation. Could you explain why you use that? And is it OK for me not to use it?

timm model中找不到pretrain_mae_base_patch16_224 模型registry

hello，timm model中找不到pretrain_mae_base_patch16_224 模型registry，pretrain_mae_base_patch16_224 和 PretrainVisionTransformer是怎么映射的呢？

Can you provide the organization of the dataset

Can you provide the organization of the data set
such as:

bugs

MAE-pytorch/modeling_pretrain.py", line 296, in pretrain_mae_base_patch16_224
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'num_classes'

Positional embedding not stored in checkpoints - problem for tuning/inference at higher resolution

Hi,

I'm very impressed with the quick reproduction, nice work!

I have tried running inference with the provided models and noticed that the current checkpoints do not contain the positional embedding. This is not an issue when running on the same resolution (224,224). However, it makes it tricky to run inference at higher resolutions, since there is no positional encoding to interpolate.

I have been using the following hard-coded workaround but as far as I can see, the only solution is to change the model so that the positional encoding is stored as part of the checkpoint. Here is the hard-coded solution for loading current models and running tuning/inference:

# this replaces the code from line 334 in run_class_finetuning.py

        # Maybe interpolate position embedding
        old_n_positions = int((224/16)**2)
        if model.pos_embed.shape[1] != old_n_positions:
            embedding_size = model.pos_embed.shape[-1]
            old_pos_embed = modeling_finetune.get_sinusoid_encoding_table(old_n_positions, embedding_size)
            num_patches = model.patch_embed.num_patches
            num_extra_tokens = model.pos_embed.shape[-2] - num_patches
            assert num_extra_tokens == 0, "No support for class tokens"
            # height (== width) for the checkpoint position embedding
            orig_size = int((old_pos_embed.shape[-2] - num_extra_tokens) ** 0.5)
            # height (== width) for the new position embedding
            new_size = int(num_patches ** 0.5)
            # class_token and dist_token are kept unchanged
            if orig_size != new_size:
                print("Position interpolate from %dx%d to %dx%d" % (orig_size, orig_size, new_size, new_size))
                extra_tokens = old_pos_embed[:, :num_extra_tokens]
                # only the position tokens are interpolated
                pos_tokens = old_pos_embed[:, num_extra_tokens:]
                pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
                pos_tokens = torch.nn.functional.interpolate(
                    pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
                pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(1, 2)
                new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
                model.pos_embed = new_pos_embed

MAE prediction visualization code

Thank you for your contribution. I wonder if you plan to release the mask prediction visualization code?

Could you provide a vpn-free link of pretrained weights rather than Google Drive?

Thanks for your code, I'm poor of GPUs, but I want to make some interesting experements based on MAE.
However, I can't access Google Drive, would you please give a BaiduYun share link of Pretrained Model Weights or some other Cloud Disk without vpn?

Thanks a lot~

get an error

Thank you for your contribution. when i want to run this code, get an error:
main(opts)
File "run_mae_pretraining.py", line 150, in main
model = get_model(args)
File "run_mae_pretraining.py", line 125, in get_model
model = create_model(
File "/opt/conda/lib/python3.8/site-packages/timm/models/factory.py", line 57, in create_model
model = create_fn(**model_args, **kwargs)
File "/work/MAE-pytorch/modeling_pretrain.py", line 305, in pretrain_mae_base_patch16_224
model = PretrainVisionTransformer(
TypeError: init() got an unexpected keyword argument 'in_chans'

How can i solve this problem~

	if self.fc_norm is not None:
	t = x[:, 1:, :]
	return self.fc_norm(t.mean(1))
	else:
	return x[:, 0]