Coder Social home page Coder Social logo

mae-pytorch's People

Contributors

flishwang avatar pengzhiliang avatar tikboahit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mae-pytorch's Issues

About next plan.

Next, will you consider implementing downstream tasks, such as classification or segmentation

How is fine-tuning performed actually?

From my understanding, the authors pretrained the MAE on ImageNet, and used the encoder for fine-tuning. In their paper, they did not explain how the masking is handled. During the fine-tuning stage, are the images masked? If not, how does it fit the encoder which accepts only a portion of the input? I wonder how you handle it. Can you kindly show me in your code?

Why don't train the k_bias

Hi, I found that the k_bias of every attention layer is set as not requiring gradient, while the q_bias and v_bias do require gradient, see the code

Is there any reason behind this? Since all the biases are set to learnable in common implementations.

Managing 512*512 input size

Thanks for your code! It's really a great work! I want to train MAE on a dataset with 512*512 image size, and how can I adjust the model structure? Thanks for your help.

Training Equipment

I am very interested in what type of GPU you use for training and how long you have been training for?

hello, a little question about this code.

I am beginner and crazy about this MAE, but about the pretrained model--pretrain_mae_vit_base_mask_0.75_400e.pth downloaded from BaiDu yun, I really wanna know where to load it in the file. Can it be used in modeling_pretrain.py?

How to process the dataset...

I want to know how to process the dataset....
There are a lot of tar files.

Is there any commands or processing python file?
It is my firtst time to do this....

Sorry for my stupid question..

A question about pretrained model

hi, I have a question that if I want to extract the feature in pretrained model,should I only to inference the encoder with all mask=0?

bugs

MAE-pytorch/modeling_pretrain.py", line 296, in pretrain_mae_base_patch16_224
    **kwargs)
TypeError: __init__() got an unexpected keyword argument 'num_classes'

[Attention] Information leak in visualization

God job.

Hi, I think this operation will leak the information of the original input image(mean and var of one patch).

rec_img = rec_img * (img_squeeze.var(dim=-2, unbiased=True, keepdim=True).sqrt() + 1e-6) + img_squeeze.mean(dim=-2, keepdim=True)

Anyone who uses the model trained with the normalized loss for visualization should pay attention to this operation.

I also suggest the author add the comment on this line. @pengzhiliang

Need to interpolate positional embedding to work at higher resolutions

Hi again, sorry for the slow response in issue #26. I have some more clarifications and visualizations here.

I agree that the sine-cosine embeddings are not learnable. However it seems like they still need to be interpolated for the model to work well. I suspect that this is at least partially due to the fact that they are 1d, and thus the model has to learn the number of rows/columns. E.g. it cannot express "look one patch down" directly, but rather needs to express it as "look X patches forward". And X changes if we change resolution.

I have attached attention visualizations that show what happens if you run on higher res with or without interpolating the positional embedding. As you can see, the non-interpolated version looks much worse and has weird diagonal stripes.

This is not a major issue to me, but I wanted to let you (and anyone else that has the same problem) know about this. I think the best solution is what I mentioned before: to simply include the positional embeddings in the checkpoint even though they are not learnable parameters.

Original:
original_res
With interpolation:
with_interp
Without interpolation:
without_interp

请问你们用的是什么算力呀?

原文用了128核的TPUv3,完全是我企及不到的算力。请问你们用了什么算力啊,我评估一下我手上的算力配不配做预训练

the problem of imagenet datasets

Hi,firstly congradulate you!
However,the imagenet has too many pictures,I cannot choose which one.
can you tell me the details of datasets? it's better to upload baiduDisk or Drive.
thank you!

Weights & Biases logger for MAE-pytorch

Hi @pengzhiliang

I am an ML Engineer at Weights & Biases and I wanted to know if you were actively reviewing PRs at the moment? We would love to make a PR to add Weights & Biases' experiment tracking and image logging if you have the time to review it?

I think adding experiment tracking will help you and your users keep the fine-tuning experiments organized and image logging will be helpful in visualising and storing the image reconstructions for better model evaluation :)

We have built integrations into transformers, YOLOv5, PyTorch Lightning etc so we should be able to make a quick and clean PR that shouldn't take too much time for you to review.

Run with pretrain_mae_vit_base_mask_0.75_400e.pth get an model initialization error

Traceback (most recent call last):
File "run_mae_vis.py", line 138, in
main(opts)
File "run_mae_vis.py", line 79, in main
model = get_model(args)
File "run_mae_vis.py", line 63, in get_model
model = create_model(
File "/home/ub/miniconda3/envs/torch1.8/lib/python3.6/site-packages/timm/models/factory.py", line 57, in create_model
model = create_fn(**model_args, **kwargs)
File "/home/ub/bwj/MAE-pytorch/modeling_pretrain.py", line 317, in pretrain_mae_base_patch16_224
**kwargs)
TypeError: init() got an unexpected keyword argument 'num_classes'

Positional embedding not stored in checkpoints - problem for tuning/inference at higher resolution

Hi,

I'm very impressed with the quick reproduction, nice work!

I have tried running inference with the provided models and noticed that the current checkpoints do not contain the positional embedding. This is not an issue when running on the same resolution (224,224). However, it makes it tricky to run inference at higher resolutions, since there is no positional encoding to interpolate.

I have been using the following hard-coded workaround but as far as I can see, the only solution is to change the model so that the positional encoding is stored as part of the checkpoint. Here is the hard-coded solution for loading current models and running tuning/inference:

# this replaces the code from line 334 in run_class_finetuning.py

        # Maybe interpolate position embedding
        old_n_positions = int((224/16)**2)
        if model.pos_embed.shape[1] != old_n_positions:
            embedding_size = model.pos_embed.shape[-1]
            old_pos_embed = modeling_finetune.get_sinusoid_encoding_table(old_n_positions, embedding_size)
            num_patches = model.patch_embed.num_patches
            num_extra_tokens = model.pos_embed.shape[-2] - num_patches
            assert num_extra_tokens == 0, "No support for class tokens"
            # height (== width) for the checkpoint position embedding
            orig_size = int((old_pos_embed.shape[-2] - num_extra_tokens) ** 0.5)
            # height (== width) for the new position embedding
            new_size = int(num_patches ** 0.5)
            # class_token and dist_token are kept unchanged
            if orig_size != new_size:
                print("Position interpolate from %dx%d to %dx%d" % (orig_size, orig_size, new_size, new_size))
                extra_tokens = old_pos_embed[:, :num_extra_tokens]
                # only the position tokens are interpolated
                pos_tokens = old_pos_embed[:, num_extra_tokens:]
                pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
                pos_tokens = torch.nn.functional.interpolate(
                    pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
                pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(1, 2)
                new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
                model.pos_embed = new_pos_embed

TypeError: __init__() got an unexpected keyword argument 'in_chans'

分享一个issue,我一开始用的是11.22更新的requirements.txt,会报如下错误:

File"/MAE-pytorch/modeling_pretrain.py", line 319, in pretrain_mae_base_patch16_224
**kwargs)
**kwargs)
TypeError: init() got an unexpected keyword argument 'in_chans'

我发现好像是timm==0.3.2导致的,后来升至0.4.12可以解决这个问题。
测试了V100 (CentOS) 和 A100 (Ubuntu),都存在这个issue。
卸载timm好像也一样可以跑……我第一次看CV的code,了解的还不是很多。
顺便想问一下你们在V100选用的batch size是64吗

A little question about the visualization

Hi, thanks a lot for the excellent work!!

When I executed run_mae_vis.py, there is a strange phenomenon that the model is able to reconstruct the origin img without lossing much information even if only one patch is given.

As shown in the following example, the mask_ratio is set as 0.999, making sure that the model can only get one patch. However, it is really strange to find out that the model can almost reconstruct the origin img (e.g. the shape of the bird, the yellow feather...)

I think it is impossible for the model to do this. Therefore, may I ask whether there is some information of the origin img other than the visible patch that is sent into the model? Really looking forward to your reply! Thanks!
image

Grad Norm Becomes Inf

image
On two gpus.

Epoch: [24] [1230/1251] eta: 0:00:06 lr: 0.000375 min_lr: 0.000375 loss: 0.6870 (0.6848) loss_scale: 2097152.0000 (2046895.3111) weight_decay: 0.0500 (0.0500) grad_norm: 0.0929 (0.0969) time: 0.3023 data: 0.0010 max mem: 8361 Epoch: [24] [1240/1251] eta: 0:00:03 lr: 0.000375 min_lr: 0.000375 loss: 0.6877 (0.6848) loss_scale: 2097152.0000 (2047300.2804) weight_decay: 0.0500 (0.0500) grad_norm: 0.0942 (0.0971) time: 0.2731 data: 0.0018 max mem: 8361 Epoch: [24] [1250/1251] eta: 0:00:00 lr: 0.000375 min_lr: 0.000375 loss: 0.6856 (0.6849) loss_scale: 2097152.0000 (2047698.7754) weight_decay: 0.0500 (0.0500) grad_norm: 0.0942 (0.0971) time: 0.2560 data: 0.0012 max mem: 8361 Epoch: [24] Total time: 0:06:23 (0.3067 s / it) Averaged stats: lr: 0.000375 min_lr: 0.000375 loss: 0.6856 (0.6851) loss_scale: 2097152.0000 (2047698.7754) weight_decay: 0.0500 (0.0500) grad_norm: 0.0942 (0.0971) Epoch: [25] [ 0/1251] eta: 1:25:25 lr: 0.000375 min_lr: 0.000375 loss: 0.6770 (0.6770) loss_scale: 2097152.0000 (2097152.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.0918 (0.0918) time: 4.0974 data: 3.7792 max mem: 8361 Epoch: [25] [ 10/1251] eta: 0:13:50 lr: 0.000375 min_lr: 0.000375 loss: 0.6854 (0.6838) loss_scale: 2097152.0000 (2097152.0000) weight_decay: 0.0500 (0.0500) grad_norm: 0.0910 (0.0949) time: 0.6694 data: 0.3704 max mem: 8361

How does the phenomenon occur?

Reported Result for Vit-large

Hi,

I notice the reported perfromance of vit-large is 84.5%. I was wondering, does this correspond to the baseline of the paper?

In other words, can we say the reproduced v.s. official is 84.5% : 84.9%?

Thanks

Info about the pretrained models

Can you please give more info about the pretrained models?

  • How many warmup epochs did you use?
  • How many GPUs, what kind, and what was the batch size and the learning rate?

Thank you very much!

normalize

Thanks for your implementation!

The input seems to be normalized twice: firstly in the DataAugmentationForMAE,then in train_one_epoch, the image is normalized again.

the problem of result

I'm sorry to interrupt you again and appreciate the previous helps. And the program has been run.But the result is not good.I choose the datasets which are created by me.I want ti know if I choose the large datasets the result will be better?And I find the patch size you set is 1616,if I choose smaller like 66,can it will be better and more clear?
thank you for your attention. @pengzhiliang

Positional embedding interpolation

I don't understand these lines of code in the fine-tuning script.

num_patches = model.patch_embed.num_patches
num_extra_tokens = model.pos_embed.shape[-2] - num_patches
orig_size = int((pos_embed_checkpoint.shape[-2] - num_extra_tokens) ** 0.5)

What is the meaning of extra tokens?
Is there any assumption that the new image size is smaller/larger than the original image size? Same goes for the patch size.

forward() takes 2 positional arguments but 3 were given

Is this a bug?
with torch.cuda.amp.autocast(): outputs = model(images, bool_masked_pos) loss = loss_func(input=outputs, target=labels)
Traceback (most recent call last): File "E:/ImageProject/MAE-OCT/run_mae_pretraining.py", line 264, in <module> main(opts) File "E:/ImageProject/MAE-OCT/run_mae_pretraining.py", line 238, in main normlize_target=args.normlize_target, File "E:\ImageProject\MAE-OCT\engine_for_pretraining.py", line 66, in train_one_epoch outputs = model(images, bool_masked_pos) File "D:\anaconda3\envs\mea\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) TypeError: forward() takes 2 positional arguments but 3 were given

get an error

Thank you for your contribution. when i want to run this code, get an error:
main(opts)
File "run_mae_pretraining.py", line 150, in main
model = get_model(args)
File "run_mae_pretraining.py", line 125, in get_model
model = create_model(
File "/opt/conda/lib/python3.8/site-packages/timm/models/factory.py", line 57, in create_model
model = create_fn(**model_args, **kwargs)
File "/work/MAE-pytorch/modeling_pretrain.py", line 305, in pretrain_mae_base_patch16_224
model = PretrainVisionTransformer(
TypeError: init() got an unexpected keyword argument 'in_chans'

How can i solve this problem~

help

Hello, I'm a novice. I want to ask. Download directly ImageNet_ILSVRC2012 dataset, can you get the pre training model according to your pre train process? If not, how can I get it. Can two GPUs pre train themselves?

Why use ema in finetuning?

Thanks for your great work!
I am confused that I don't think the authors mention ema in their paper but I find it in your implementation. Could you explain why you use that? And is it OK for me not to use it?

Why you memtion you cannot reproduce the results reported in the paper

Thank you for your contribution to implement the code for MAE. I notice you only provide the Vit/L and Vit/B model, pretrained for 400 epochs, but you say you can not reproduce the results provided in the paper. Maybe if you continue to pre-train the model for longer time, you can get that result in MAE paper.

finetune code about cls token

Hi, thanks for your great repo!
you fixed the code:
# return self.fc_norm(x[:, 1:].mean(1)) return self.fc_norm(x.mean(1))

but did not fixed the "else" branch, why?(maybe this branch will never been used?)
return x[:, 0]

Pretrained models

Thanks a lot for the code!
However, I only have 4 GPUS. And it may cost me about 33 days to get the Vit-Large pretrained model.
Would you mind uploading the pretrained model? Or the Vit-base pretrained model will be appreciated.

finetuning部分代码

非常感谢作者的工作。可以透露下finetuning部分代码什么时候会放出吗?谢谢!

the config

hello,
You have trained a total of 400 epochs, the parameter configuration is set to 1600, and then you took the 400th epoch model, or you directly set --epochs to 400

Why do you use cosine weight decay?

Hi,

Thanks for providing the code. I found that you are also using cosine annealing curve to decay weight decay. It seems that the paper does not mention this. Would you please tell me why do you use it ?

knn eval of MAE

I eval vit_base of 500/1600 pretraining on imagenet1000 using knn metric. By loading all the pretained parameter with vit GAP method (not need cls token), the knn 20-NN result is 33.4 in imagenet100 dataset, which is very low and not match the accuracy of linear prob.

ViT-Small

Any plan to pre-train a ViT-Small version?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.