Coder Social home page Coder Social logo

esvit's Introduction

Efficient Self-Supervised Vision Transformers (EsViT)

PWC

[Paper] [Slides]

PyTorch implementation for EsViT (accepted in ICLR, 2022), built with two techniques:

  • A multi-stage Transformer architecture. Three multi-stage Transformer variants are implemented under the folder models.
  • A non-contrastive region-level matching pre-train task. The region-level matching task is implemented in function DDINOLoss(nn.Module) (Line 648) in main_esvit.py. Please use --use_dense_prediction True, otherwise only the view-level task is used.
Efficiency vs accuracy comparison under the linear classification protocol on ImageNet with EsViT
Figure: Efficiency vs accuracy comparison under the linear classification protocol on ImageNet. Left: Throughput of all SoTA SSL vision systems, circle sizes indicates model parameter counts; Right: performance over varied parameter counts for models with moderate (throughout/#parameters) ratio. Please refer Section 4.1 for details.

Updates

$\qquad$ [Workshop] $\qquad$ [IC Challenge] $\qquad$ [OD Challenge]

Pretrained models

You can download the full checkpoint (trained with both view-level and region-level tasks, batch size=512 and ImageNet-1K.), which contains backbone and projection head weights for both student and teacher networks.

Note: The data is on Azure Storage Blob, a SAS with Read permission is provided. Please append the following SAS at the end of each link to download:

?sp=r&st=2023-08-28T01:36:35Z&se=3023-08-28T09:36:35Z&sv=2022-11-02&sr=c&sig=coos9vSl4Xk6S6KvqZffkVCUb7Ug%2FFR9cfyc3xacMJI%3D
  • EsViT (Swin) with network configurations of increased model capacities, pre-trained with both view-level and region-level tasks. ResNet-50 trained with both tasks is shown as a reference.
arch params tasks linear k-nn download logs
ResNet-50 23M V+R 75.7% 71.3% full ckpt train linear knn
EsViT (Swin-T, W=7) 28M V+R 78.0% 75.7% full ckpt train linear knn
EsViT (Swin-S, W=7) 49M V+R 79.5% 77.7% full ckpt train linear knn
EsViT (Swin-B, W=7) 87M V+R 80.4% 78.9% full ckpt train linear knn
EsViT (Swin-T, W=14) 28M V+R 78.7% 77.0% full ckpt train linear knn
EsViT (Swin-S, W=14) 49M V+R 80.8% 79.1% full ckpt train linear knn
EsViT (Swin-B, W=14) 87M V+R 81.3% 79.3% full ckpt train linear knn
  • EsViT with view-level task only
arch params tasks linear k-nn download logs
ResNet-50 23M V 75.0% 69.1% full ckpt train linear knn
EsViT (Swin-T, W=7) 28M V 77.0% 74.2% full ckpt train linear knn
EsViT (Swin-S, W=7) 49M V 79.2% 76.9% full ckpt train linear knn
EsViT (Swin-B, W=7) 87M V 79.6% 77.7% full ckpt train linear knn
  • EsViT (Swin-T, W=7) with different pre-train datasets (view-level task only)
arch params batch size pre-train dataset linear k-nn download logs
EsViT 28M 1024 ImageNet-1K 77.1% 73.7% full ckpt train linear knn
EsViT 28M 1024 WebVision-v1 75.4% 69.4% full ckpt train linear knn
EsViT 28M 1024 OpenImages-v4 69.6% 60.3% full ckpt train linear knn
EsViT 28M 1024 ImageNet-22K 73.5% 66.1% full ckpt train linear knn
  • EsViT with more multi-stage vision Transformer architectures, pre-trained with View-level and Region-level tasks.
arch params pre-train task linear k-nn download logs
EsViT (ViL, W=7) 28M V 77.3% 73.9% full ckpt train linear knn
EsViT (ViL, W=7) 28M V+R 77.5% 74.5% full ckpt train linear knn
EsViT (CvT, W=7) 29M V 77.6% 74.8% full ckpt train linear knn
EsViT (CvT, W=7) 29M V+R 78.5% 76.7% full ckpt train linear knn

Pre-training

One-node training

To train on 1 node with 16 GPUs for Swin-T model size:

PROJ_PATH=your_esvit_project_path
DATA_PATH=$PROJ_PATH/project/data/imagenet

OUT_PATH=$PROJ_PATH/output/esvit_exp/ssl/swin_tiny_imagenet/
python -m torch.distributed.launch --nproc_per_node=16 main_esvit.py --arch swin_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml 

The main training script is main_esvit.py and conducts the training loop, taking the following options (among others) as arguments:

  • --use_dense_prediction: whether or not to use the region matching task in pre-training
  • --arch: switch between different sparse self-attention in the multi-stage Transformer architecture. Example architecture choices for EsViT training include [swin_tiny, swin_small, swin_base, swin_large,cvt_tiny, vil_2262]. The configuration files should be adjusted accrodingly, we provide example below. One may specify the network configuration by editing the YAML file under experiments/imagenet/*/*.yaml. The default window size=7; To consider a multi-stage architecture with window size=14, please choose yaml files with window14 in filenames.

To train on 1 node with 16 GPUs for Convolutional vision Transformer (CvT) models:

python -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch cvt_tiny --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/cvt_v4/s1.yaml

To train on 1 node with 16 GPUs for Vision Longformer (ViL) models:

python -m torch.distributed.launch --nproc_per_node=16 main_evsit.py --arch vil_2262 --data_path $DATA_PATH/train --output_dir $OUT_PATH --batch_size_per_gpu 32 --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --use_dense_prediction True --aug-opt dino_aug --cfg experiments/imagenet/vil/vil_small/base.yaml MODEL.SPEC.MSVIT.ARCH 'l1,h3,d96,n2,s1,g1,p4,f7,a0_l2,h6,d192,n2,s1,g1,p2,f7,a0_l3,h12,d384,n6,s0,g1,p2,f7,a0_l4,h24,d768,n2,s0,g0,p2,f7,a0' MODEL.SPEC.MSVIT.MODE 1 MODEL.SPEC.MSVIT.VIL_MODE_SWITCH 0.75

Multi-node training

To train on 2 nodes with 16 GPUs each (total 32 GPUs) for Swin-Small model size:

OUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_small/bl_lr0.0005_gpu16_bs16_multicrop_epoch300_dino_aug_window14
python main_evsit_mnodes.py --num_nodes 2 --num_gpus_per_node 16 --data_path $DATA_PATH/train --output_dir $OUT_PATH/continued_from0200_dense --batch_size_per_gpu 16 --arch swin_small --zip_mode True --epochs 300 --teacher_temp 0.07 --warmup_epochs 10 --warmup_teacher_temp_epochs 30 --norm_last_layer false --cfg experiments/imagenet/swin/swin_small_patch4_window14_224.yaml --use_dense_prediction True --pretrained_weights_ckpt $OUT_PATH/checkpoint0200.pth

Evaluation:

k-NN and Linear classification on ImageNet

To train a supervised linear classifier on frozen weights on a single node with 4 gpus, run eval_linear.py. To train a k-NN classifier on frozen weights on a single node with 4 gpus, run eval_knn.py. Please specify --arch, --cfg and --pretrained_weights to choose a pre-trained checkpoint. If you want to evaluate the last checkpoint of EsViT with Swin-T, you can run for example:

PROJ_PATH=your_esvit_project_path
DATA_PATH=$PROJ_PATH/project/data/imagenet

OUT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300
CKPT_PATH=$PROJ_PATH/exp_output/esvit_exp/swin/swin_tiny/bl_lr0.0005_gpu16_bs32_dense_multicrop_epoch300/checkpoint.pth

python -m torch.distributed.launch --nproc_per_node=4 eval_linear.py --data_path $DATA_PATH --output_dir $OUT_PATH/lincls/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --n_last_blocks 4 --num_labels 1000 MODEL.NUM_CLASSES 0

python -m torch.distributed.launch --nproc_per_node=4 eval_knn.py --data_path $DATA_PATH --dump_features $OUT_PATH/features/epoch0300 --pretrained_weights $CKPT_PATH --checkpoint_key teacher --batch_size_per_gpu 256 --arch swin_tiny --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml MODEL.NUM_CLASSES 0

Analysis/Visualization of correspondence and attention maps

You can analyze the learned models by running python run_analysis.py. One example to analyze EsViT (Swin-T) is shown.

For an invidiual image (with path --image_path $IMG_PATH), we visualize the attention maps and correspondence of the last layer:

python run_analysis.py --arch swin_tiny --image_path $IMG_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml --vis_attention True --vis_correspondence True MODEL.NUM_CLASSES 0 

For an image dataset (with path --data_path $DATA_PATH), we quantatively measure the correspondence:

python run_analysis.py --arch swin_tiny --data_path $DATA_PATH --output_dir $OUT_PATH --pretrained_weights $CKPT_PATH --learning ssl --seed $SEED --cfg experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml  --measure_correspondence True MODEL.NUM_CLASSES 0 

For more examples, please see scripts/scripts_local/run_analysis.sh.

Citation

If you find this repository useful, please consider giving a star ⭐ and citation 🍺:

@article{li2021esvit,
  title={Efficient Self-supervised Vision Transformers for Representation Learning},
  author={Li, Chunyuan and Yang, Jianwei and Zhang, Pengchuan and Gao, Mei and Xiao, Bin and Dai, Xiyang and Yuan, Lu and Gao, Jianfeng},
  journal={International Conference on Learning Representations (ICLR)},
  year={2022}
}

Related Projects/Codebase

[Swin Transformers] [Vision Longformer] [Convolutional vision Transformers (CvT)] [Focal Transformers]

Acknowledgement

Our implementation is built partly upon packages: [Dino] [Timm]

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

esvit's People

Contributors

chunyuanli avatar microsoft-github-operations[bot] avatar microsoftopensource avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

esvit's Issues

can't load swin-tiny checkpoint right

hi, I use the swin-transformer.py to load the swin-tiny model pretrained by imagenet1k. And the message is here:
msg: _IncompatibleKeys(missing_keys=['layers.0.blocks.1.attn_mask', 'layers.1.blocks.1.attn_mask', 'layers.2.blocks.1.attn_mask', 'layers.2.blocks.3.attn_mask', 'layers.2.blocks.5.attn_mask', 'head.weight', 'head.bias'], unexpected_keys=['head.mlp.0.weight', 'head.mlp.0.bias', 'head.mlp.2.weight', 'head.mlp.2.bias', 'head.mlp.4.weight', 'head.mlp.4.bias', 'head.last_layer.weight_g', 'head.last_layer.weight_v'])
why is there missing keys here?

PublicAccessNotPermitted

Hi, when I try to download the ckpt and train logs, I got an error as below, could you please to help?

This XML file does not appear to have any style information associated with it. The document tree is shown below. <Error> <Code>PublicAccessNotPermitted</Code> <Message>Public access is not permitted on this storage account. RequestId:9a468ee0-501e-00ad-1329-9977d2000000 Time:2023-06-07T10:18:37.1823499Z</Message> </Error>

Mode Collapse on Custom Dataset

Hi! First of all kudos on the great work!

So, I am experimenting on a custom dataset of about 70k images consisting of 7 different classes. However, the model seems to collapse after 3-4 epochs of training. I have tried playing around with different embedding dimensions for the out_dim parameter and lower values for teacher_temp to increase sharpening, but in vain.

Have you experimented with smaller datasets? Would you be able to provide any suggestions in this case?

Thanks!

Mixup & Cutmix during Pre-Training

Hi @ChunyuanLI, I've noticed the usage of mixup and cutmix during pre-training, which is not included in DINO. I'm wondering the performance gain brought by applying mixup & cutmix. Have you ever run any related experiments pre-trained w.o. mixup? I'm especially interested in vanilla DINO with Swin-T/Swin-B as the backbone, i.e., EsViT w. only view-level task, w.o. mixup & cutmix. It would be nice if you could inform me of those results.

Missing requirements

Hi!

I am trying to load esvit on Google Colaboratory with the following code:

!git clone https://github.com/microsoft/esvit.git
!pip install -r ./esvit/requirements.txt

import models.vision_transformer as vits

I got the following error:

...
[/usr/local/lib/python3.7/dist-packages/timm/models/layers/helpers.py](https://localhost:8080/#) in <module>
      4 """
      5 from itertools import repeat
----> 6 from torch._six import container_abcs
      7 
      8
ImportError: cannot import name 'container_abcs' from 'torch._six' (/usr/local/lib/python3.7/dist-packages/torch/_six.py)

which seems to be related to the torch version. However, downgrading torch (<1.11.0) I obtain errors on other torch imports.

Is there available a testing notebook?

Questions about downstream COCO detection

Hi, I’m wondering if you can provide a recipe to reproduce the results of CoCo detection?
I’ve tried to use your pre-trained checkpoint to train the downstream task with Mask R-CNN, but cannot get the results reported in the paper. Not sure if there was something wrong during the training. Could you please provide more details? Thank you!

evaluation for custom dataset

I am unable to find run_analysis.py file in the repository. How do I use the pretrained model to get the representations for my custom dataset.

Question about the Learning Rate used for pretraining

Hello.

Thank you for the wonderful work!
I have some questions about the learning rate used to pretrain the Swin model in Table 1.
As the logs show, the learning rate for the Swin-T model is 0.0005180447994195404 at 201 epoch, while the learning rate for the Swin-S/B model is 0.00025939212681290886 at 201 epoch. however, the parameters shown for the 'args' keyword in the pre-trained model are the same.

Could you please tell me why there is a difference in learning rate in the training log?

Thanks in advance.

Throughput comparison (Table 1)

Hello,
I have read your paper and found it very interesting. I was particularly intrigued by Table 1 where you compare the throughput against other methods, including DINO with a deit_tiny and patch size of 16. From the table, EsViT with Swin-T(/W=7) has a throughput of 808 and DINO with DeiT-T/16 has 1007. So I expected EsViT to be +- slower by 20%. Yet, when I run both I do not get this. I attached both logs below.

DINO

arch: deit_tiny
batch_size_per_gpu: 200
clip_grad: 3.0
data_path: /ilsvrc2012/ILSVRC2012_img_train
dist_url: env://
epochs: 100
freeze_last_layer: 1
global_crops_scale: (0.4, 1.0)
gpu: 0
local_crops_number: 8
local_crops_scale: (0.05, 0.4)
local_rank: 0
lr: 0.0005
min_lr: 1e-06
momentum_teacher: 0.996
norm_last_layer: True
num_workers: 24
optimizer: adamw
out_dim: 65536
output_dir: output_dir
patch_size: 16
rank: 0
saveckp_freq: 10
seed: 0
teacher_temp: 0.04
use_bn_in_head: False
use_fp16: True
warmup_epochs: 10
warmup_teacher_temp: 0.04
warmup_teacher_temp_epochs: 0
weight_decay: 0.04
weight_decay_end: 0.4
world_size: 4
Data loaded: there are 1281167 images.
Student and Teacher are built: they are both deit_tiny network.
Loss, optimizer and schedulers ready.
Starting DINO training !

Epoch: [0/100] Total time: 0:38:22 (1.438374 s / it)
Averaged stats: loss: 6.691907e+00 (8.885959e+00)  lr: 1.551861e-04 (7.808108e-05)  wd: 4.008760e-02 (4.002958e-02)

EsViT

aa: rand-m9-mstd0.5-inc1
arch: swin_tiny
aug_opt: dino_aug
batch_size_per_gpu: 48
cfg: experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml
clip_grad: 3.0
color_jitter: 0.4
cutmix: 1.0
cutmix_minmax: None
data_path: /ilsvrc2012/ILSVRC2012_img_train
dataset: imagenet1k
dist_url: env://
epochs: 100
freeze_last_layer: 1
global_crops_scale: (0.4, 1.0)
gpu: 0
local_crops_number: (8,)
local_crops_scale: (0.05, 0.4)
local_crops_size: (96,)
local_rank: 0
lr: 0.0005
min_lr: 1e-06
mixup: 0.8
mixup_mode: batch
mixup_prob: 1.0
mixup_switch_prob: 0.5
momentum_teacher: 0.996
norm_last_layer: False
num_mixup_views: 10
num_workers: 10
optimizer: adamw
opts: []
out_dim: 65536
output_dir: output_dir
patch_size: 16
pretrained_weights_ckpt: 
rank: 0
recount: 1
remode: pixel
reprob: 0.25
resplit: False
sampler: distributed
saveckp_freq: 5
seed: 0
smoothing: 0.0
teacher_temp: 0.07
train_interpolation: bicubic
tsv_mode: False
use_bn_in_head: False
use_dense_prediction: True
use_fp16: True
use_mixup: False
warmup_epochs: 10
warmup_teacher_temp: 0.04
warmup_teacher_temp_epochs: 30
weight_decay: 0.04
weight_decay_end: 0.4
world_size: 4
zip_mode: False
Data loaded: there are 1281167 images.
=> merge config from experiments/imagenet/swin/swin_tiny_patch4_window7_224.yaml
Unknow architecture: swin_tiny
Student and Teacher are built: they are both swin_tiny network.
Loss, optimizer and schedulers ready.
Starting training of EsViT ! from epoch 0

Epoch: [0/100] Total time: 2:09:19 (1.162958 s / it)
Averaged stats: loss: 4.714716 (6.780889)  lr: 0.000037 (0.000019)  wd: 0.040089 (0.040030)

So EsViT (with swin_tiny W=7) is about 3 times slower than DINO (with deit_tiny and P=16). This is run on a machine with 4xV100 GPUs. In both cases, I set the batch size to the +- highest value I could without having out of memory exceptions.

Is it the case that my run of EsViT should be this row in table 1?

EsViT, Swin-T 28 808 78.1 75.7

If so, do you know why I am getting such contradictory results?

Thank you!

MIssing Linear Evaluation Code for ResNet50

Hi, I was playing around with a custom dataset with SwinTiny and ResNet50.
SwinTiny works great (both training and Linear Evaluation). However, it seems like ResNet50 isn't supported in the Linear Evaluation.
Having seen that you have uploaded the logs for the ResNet50 training, would you mind updating the Linear Evaluation code, as well?

Thanks in advance :)

EDIT: I've added it myself, and I am running the Linear Evaluation now, but it would be good to have it in the official code, as well. Just a suggestion :)

[QUESTION] Results on correspondence learning

Hello,
I cannot seem to find in the paper which features are used for doing the correspondence matching in the appendix. Is it the last layer features (rough-grained) or the first layer features (fine-grained) or a combination of features at all depths (if so how is the combination?) ?
Thanks!

Allow arbitrary image sizes and upstream changes from Swin-Transformer-Object-Detection

It is useful in object detection context to allow arbitrary sizes by doing dynamic mask computation (probably possible only with relative position encoding).

These kinds of edits were done in https://github.com/SwinTransformer/Swin-Transformer-Object-Detection and in https://github.com/megvii-research/SOLQ/. It would be nice if you upstreamed these changes. This will simplify trying out ESviT checkpoints as pretraining for object detection.

Also, fyi I created a similar issue in SimMIM: microsoft/SimMIM#13. Overall, having some stable version of swin_transformer.py somewhere (maybe even in main SwinTransformer/Swin-Transformer repo?) supporting dynamic masking would help a lot :)

Thanks!

Results without multi-crop

Hello,
Thanks for the code. I have noticed that the multi-crop trick can boost the performance by about 5% top-1 acc (on DINO, SwAV). Since your code base supports disabling this trick, did you conduct the experiments without this multi-crop trick, and would you be so kind that share the results on ImageNet?

args used for first table in README

Hello,
Could you please provide the args used for running main_esvit.py with the right arguments for each run in the table below (first table in README)? Are the args used different for each entry?

  • EsViT (Swin) with network configurations of increased model capacities, pre-trained with both view-level and region-level tasks. ResNet-50 trained with both tasks is shown as a reference.
arch params linear k-nn download logs
ResNet-50 23M 75.7% 71.3% full ckpt train linear knn
EsViT (Swin-T, W=7) 28M 78.0% 75.7% full ckpt train linear knn
EsViT (Swin-S, W=7) 49M 79.5% 77.7% full ckpt train linear knn
EsViT (Swin-B, W=7) 87M 80.4% 78.9% full ckpt train linear knn
EsViT (Swin-T, W=14) 28M 78.7% 77.0% full ckpt train linear knn
EsViT (Swin-S, W=14) 49M 80.8% 79.1% full ckpt train linear knn
EsViT (Swin-B, W=14) 87M 81.3% 79.3% full ckpt train linear knn

Thank you!

Unable to reproduce the KNN results

Hi,
I am trying to reproduce the knn results but fail to do so. I am using the pretrained model from the checkpoint on ImageNet-1K following the script provided.

I got the following results:

10-NN classifier result: Top1: 1.876, Top5: 3.462
20-NN classifier result: Top1: 1.872, Top5: 3.912
100-NN classifier result: Top1: 1.85, Top5: 4.884
200-NN classifier result: Top1: 1.834, Top5: 5.352

Is there any chance that the model checkpoint is incorrect?

Thanks!

Training on custom dataset

What a custom dataset structure should be like and how to train on it?
Let's say I have a dataset of two classes as the folder (binary): 1. Has cat, 2. No cat. In each sub-folder, there are images.
What changes to the code and dataset should I make?
Thanks in advance.

Loss stops decreasing

Hi,

I'm retraining from scratch EsVIT on a custom dataset (1.7M images) with tiny swin, W=14, and a batch size of 64, default lr and wd, and the following hp
--teacher_temp 0.04
--warmup_teacher_temp 0.03
--momentum_teacher 0.9996
--warmup_epochs 10
--warmup_teacher_temp_epochs 30
--use_dense_prediction True
--use_fp16 True
--out_dim 65536
--epochs 300 \

The loss does not decrease from epoch 70 onwards.

Which hp would you recommend tuning now resuming from let's say epoch 70 ?

Thanks

image

Some questions in paper

1641399242(1)

Hello, I have been studying your article recently. I noticed that your PPT described pre-train Task 2: region-level as shown in the picture above. But doesn't the actual code input local images into the teacher model. In addition, I am not quite clear about region-level loss function. Is it to calculate the similarity matrix of local features output by student model and global features of teacher model? I hope you can answer my two doubts at your convenience

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.