swintransformer / video-swin-transformer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from open-mmlab/mmaction2

1.4K 1.4K 195.0 42.02 MB

This is an official implementation for "Video Swin Transformers".

Home Page: https://arxiv.org/abs/2106.13230

License: Apache License 2.0

Python 97.58% Dockerfile 0.30% Shell 2.06% Makefile 0.03% Batchfile 0.03%

swin-transformer video-recognition

video-swin-transformer's People

Contributors

Stargazers

Watchers

Forkers

liuguoyou jlqzzz zhongdao qgh1223 zeliu98 gaoxin627 benjamesbabala dralmadani shaharaz jjboy wenyuqing satoshirobatofujimoto shuvozitghose kejie-cn todibo99 hengxyz jazib-sudo elevanth lmyhokudai ysw961015 liyunfan13 dongzelian kiyoon sbreale xujinglin icyzhang0923 youngergao bruinxiong mikechen66 kyuuki93 teowu voducman ruifenggong echo960 hucui2022 azamatkhid flowerhwang vinayakshenoy survai-hrf thetushar006 h030162 weiyx16 idekazuki louisfinner ryanzhang22 sviprepetitioncounting camcx atlasgooo2 yarnerlee guoqiang2021 mishengyang 1152545264 singlaayush lv-tuan ygfrancois yhy08 alisure-fork luigidamico100 pvd0anh ombretta pdhananjaya lllirunze chiehchiu junjie2008v xlxlll vanpersie32 leifengsoul pengyao96 sayano-lee aicaicaicai chenwydj bryanyzhu belzx dgjung0220 borisachen zoonono zhiwenshao heliossun wykang seanbackstrom rivenll devpranjal bilalze chanbong hangfang6 atanco viditagarwal7479 gopalaniyengar huangyansen zero2er0 louis-ng1127 chenhaomingbob lpzhang dumbpy daiguangzhao gcxamy momilijaz96 zhn6818 georgey1 lecooo

video-swin-transformer's Issues

KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

Describe the bug

While running the training script "tools/train.py" this error occurs.

Reproduction
Run the command:

python Video-Swin-Transformer/tools/train.py _Video-Swin-Transformer/configs/recognition/swin/swin_base_patch244_window877_kinetics600_22k.py

Did you make any modifications on the code or config? - No. Did you understand what you have modified? - No
What dataset did you use? - Kinetics600

Environment

Please run PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py to collect necessary environment information and paste it here.

Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
sys.platform: linux
Python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0]
CUDA available: True
GPU 0: Tesla P100-PCIE-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GCC: gcc (Debian 8.3.0-6) 8.3.0
PyTorch: 1.7.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.0+cu102
OpenCV: 4.5.3
MMCV: 1.3.12
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMAction2: 0.17.0+

You may add addition that may be helpful for locating the problem, such as
- How you installed PyTorch [e.g., pip, conda, source] -- using pip
- Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.) - None

Error traceback

{'type': 'Recognizer3D', 'backbone': {'type': 'SwinTransformer3D', 'patch_size': (2, 4, 4), 'embed_dim': 128, 'depths': [2, 2, 18, 2], 'num_heads': [4, 8, 16, 32], 'window_size': (8, 7, 7), 'mlp_ratio': 4.0, 'qkv_bias': True, 'qk_scale': None, 'drop_rate': 0.0, 'attn_drop_rate': 0.0, 'drop_path_rate': 0.2, 'patch_norm': True}, 'cls_head': {'type': 'I3DHead', 'in_channels': 1024, 'num_classes': 600, 'spatial_type': 'avg', 'dropout_ratio': 0.5}, 'test_cfg': {'average_clips': 'prob', 'max_testing_views': 2}}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 52, in build_from_cfg
    return obj_cls(**args)
  File "/opt/conda/lib/python3.7/site-packages/mmaction/models/recognizers/base.py", line 75, in __init__
    self.backbone = builder.build_backbone(backbone)
  File "/opt/conda/lib/python3.7/site-packages/mmaction/models/builder.py", line 29, in build_backbone
    return BACKBONES.build(cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 212, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 45, in build_from_cfg
    f'{obj_type} is not in the {registry.name} registry')
KeyError: 'SwinTransformer3D is not in the models registry'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Video-Swin-Transformer/tools/train.py", line 196, in <module>
    main()
  File "Video-Swin-Transformer/tools/train.py", line 154, in main
    model = build_model(cfg.model,train_cfg=cfg.get('train_cfg'),test_cfg=cfg.get('test_cfg'))
  File "/opt/conda/lib/python3.7/site-packages/mmaction/models/builder.py", line 70, in build_model
    return build_localizer(cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmaction/models/builder.py", line 62, in build_localizer
    return LOCALIZERS.build(cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 212, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

Other packages versions

mmcv-full == 1.3.12
pytorch==1.7.0
mmaction2==0.18.0
mmdet == 2.16.0
scipy==1.6.3
numpy==1.19.5

The model's behavior is different from the picture in the paper.

Hello. Thank you for providing a good paper with a good code.

I had a question while experimenting with video swin transformer.

Input sizes (1, 3, 8, 384, 384)
swintransformer3D( patch_size=(2,4,4), all of rest is default settings)

The output size per layer was measured in the forward section.
The result is :
after 1 layer output shape : torch.Size([1, 192, 4, 48, 48])
after 2 layer output shape : torch.Size([1, 384, 4, 24, 24])
after 3 layer output shape : torch.Size([1, 768, 4, 12, 12])
after 4 layer output shape : torch.Size([1, 768, 4, 12, 12])

As the paper illustrates,
after 1 layer output shape : torch.Size([1, 96, 4, 96, 96])
after 2 layer output shape : torch.Size([1, 192, 4, 48, 48])
after 3 layer output shape : torch.Size([1, 384, 4, 24, 24])
after 4 layer output shape : torch.Size([1, 768, 4, 12, 12])
I think this is right.

I know it's hard work, but can I ask you to check it out?

Severely overfitting occurred.

Dear author:
I trained a lite-base version of video swin transformer, but I noticed very severely overfitting phonomenon occurred as :

, data_time: 0.001, memory: 20882, top1_acc: 0.7600, top5_acc: 0.9206, loss_cls: 0.9247, loss: 0.9247
2022-02-15 10:24:18,650 - mmaction - INFO - Epoch [13][5860/5929]	lr: 2.714e-05, eta: 2 days, 3:27:33, time: 0.669, data_time: 0.001, memory: 20882, top1_acc: 0.7569, top5_acc: 0.9269, loss_cls: 0.9281, loss: 0.9281
2022-02-15 10:24:31,952 - mmaction - INFO - Epoch [13][5880/5929]	lr: 2.714e-05, eta: 2 days, 3:27:20, time: 0.664, data_time: 0.000, memory: 20882, top1_acc: 0.7462, top5_acc: 0.9313, loss_cls: 0.9472, loss: 0.9472
2022-02-15 10:24:45,297 - mmaction - INFO - Epoch [13][5900/5929]	lr: 2.714e-05, eta: 2 days, 3:27:07, time: 0.668, data_time: 0.001, memory: 20882, top1_acc: 0.7556, top5_acc: 0.9250, loss_cls: 0.9117, loss: 0.9117
2022-02-15 10:24:58,546 - mmaction - INFO - Epoch [13][5920/5929]	lr: 2.714e-05, eta: 2 days, 3:26:53, time: 0.662, data_time: 0.001, memory: 20882, top1_acc: 0.7506, top5_acc: 0.9256, loss_cls: 0.9624, loss: 0.9624
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 33663/33663, 139.1 task/s, elapsed: 242s, ETA:     0s

2022-02-15 10:29:10,037 - mmaction - INFO - Evaluating top_k_accuracy ...
2022-02-15 10:29:12,502 - mmaction - INFO - 
top1_acc	0.5948
top5_acc	0.8161
2022-02-15 10:29:12,502 - mmaction - INFO - Evaluating mean_class_accuracy ...
2022-02-15 10:29:12,608 - mmaction - INFO - 
mean_acc	0.5943
2022-02-15 10:29:12,626 - mmaction - INFO - Epoch(val) [13][421]	top1_acc: 0.5948, top5_acc: 0.8161, mean_class_accuracy: 0.5943

after i trained for 30 epochs, the training top1 reached 90+%, but the validation acc keep ~59% still.

I follow most of the setting as swin-base :

        drop_rate=0.,
        attn_drop_rate=0.,
        drop_path_rate=0.2,
        patch_norm=True),


    cls_head=dict(
        type='I3DHead',
        in_channels=1024,
        num_classes=700,
        spatial_type='avg',
        dropout_ratio=0.5),


# optimizer
optimizer = dict(type='AdamW', lr=3e-4, betas=(0.9, 0.999), weight_decay=0.05,
                 paramwise_cfg=dict(custom_keys={'absolute_pos_embed': dict(decay_mult=0.),
                                                 'relative_position_bias_table': dict(decay_mult=0.),
                                                 'norm': dict(decay_mult=0.),
                                                 'backbone': dict(lr_mult=0.1)})

Anyone has the same case? could anyone give some tips? thank you.

swin config file work_dir = work_dir = ..?

Why does the swin transformer config files have work_dir = work_dir = ... at

Video-Swin-Transformer/configs/recognition/swin/swin_base_patch244_window877_kinetics400_22k.py

Line 109 in d13b5a3

work_dir = work_dir = './work_dirs/k400_swin_base_22k_patch244_window877.py'

KeyError: 'filename'

I refered to https://github.com/SwinTransformer/Video-Swin-Transformer/blob/master/docs/tutorials/3_new_dataset.md to prepared the custom dataset. My annotations are like

But get the error:

I have revised the results['filename'] to results['filename_tmpl'], it has other errors.
I want to know how to solve it, thanks!

Inaccessible Download Links

The download links for Kinetics 400 pretrained models are on pan.baid.com. Many people are not able to download these at all because you need to create an account (with a phone number) to download files from that site. If you are in germany or the UK, like me, it is not possible to create an account to download these. Please host them somewhere else to make them available to the general public.

KeyError: 'patch_embed.proj.weight'

Describe the bug
When trying to fine-tune a pretrained model, the following error occurs:
KeyError: 'patch_embed.proj.weight'
For the line:
state_dict['patch_embed.proj.weight'] = state_dict['patch_embed.proj.weight'].unsqueeze(2).repeat(1,1,self.patch_size[0],1,1) / self.patch_size[0]

Reproduction

What command or script did you run?

python3 tools/train.py configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py --cfg-options model.backbone.pretrained=pretrained/swin_small_patch244_window877_kinetics400_1k.pth model.backbone.use_checkpoint=True

Did you make any modifications on the code or config? Did you understand what you have modified?

Only changed the following:

--- a/mmaction/models/backbones/swin_transformer.py
+++ b/mmaction/models/backbones/swin_transformer.py
-        state_dict = checkpoint['model']
+       state_dict = checkpoint['state_dict'] #checkpoint['model']

What dataset did you use?
kinetics-based
Environment
Please run PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py to collect necessary environment information and paste it here.
sys.platform: linux
Python: 3.8.5 (default, Jan 27 2021, 15:41:15) [GCC 9.3.0]
CUDA available: True
GPU 0,1,2,3: Quadro RTX 8000
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.1.TC455_06.29190527_0
GCC: gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0
PyTorch: 1.9.0+cu102
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
CuDNN 7.6.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PT
HREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-
field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-st
rict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -falig
ned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TO
RCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.10.0+cu102
OpenCV: 4.5.3
MMCV: 1.3.13
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMAction2: 0.15.0+db018fb

Error traceback

If applicable, paste the error traceback here.

2021-09-12 22:44:38,014 - mmaction - INFO - load model from: pretrained/swin_small_patch244_window877_kinetics400_1k.pth
Traceback (most recent call last):
  File "<venv dir>/lib/python3.8/site-packages/mmcv/utils/registry.py", line 52, in build_from_cfg
    return obj_cls(**args)
  File "Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 109, in __init__
    self.init_weights()
  File "Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 126, in init_weights
    self.backbone.init_weights()
  File "Video-Swin-Transformer/mmaction/models/backbones/swin_transformer.py", line 641, in init_weights
    self.inflate_weights(logger)
  File "Video-Swin-Transformer/mmaction/models/backbones/swin_transformer.py", line 588, in inflate_weights
    state_dict['patch_embed.proj.weight'] = state_dict['patch_embed.proj.weight'].unsqueeze(2).repeat(1,1,self.patch_size[0],1,1) / self.patch_size[0]
KeyError: 'patch_embed.proj.weight'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 201, in <module>
    main()
  File "tools/train.py", line 156, in main
    model = build_model(
  File "Video-Swin-Transformer/mmaction/models/builder.py", line 70, in build_model
    return build_localizer(cfg)
  File "Video-Swin-Transformer/mmaction/models/builder.py", line 62, in build_localizer
    return LOCALIZERS.build(cfg)
  File "<venv_dir>/lib/python3.8/site-packages/mmcv/utils/registry.py", line 212, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "<venv dir>/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "<venv dir>/lib/python3.8/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
KeyError: "Recognizer3D: 'patch_embed.proj.weight'"

Bug fix

Looks like the pretrained models are compatible with an older version of mmaction - but I couldn't find which.

Thanks!

AttributeError: 'Recognizer2D' object has no attribute 'demo/label_map_k400'

I installed as instructed.

The following line
inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt')

Gave me an error
AttributeError: 'Recognizer2D' object has no attribute 'demo/label_map_k400'

My guess is that the kinetics400 dataset is not installed properly. Not sure of how to install as needed.

Thank You
Tom

Which version of Kinetics400 do you use?

There are many different version of kinetics 400 and some has more videos than others. Can I know which version do you use and what is the statistics of your train and test set, i.e. how many train and test videos do you have?

AttributeError: Module tools/data/kinetics/label_map_k400.txt not found

I got to the end of the installation and was running the script to do the check to see if installation was done correctly.

In the open_mmlab environment, the very last line of the script
inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt')
received an error
AttributeError: 'Recognizer2D' object has no attribute 'demo/label_map_k400'

How to proceed?

Thank You
Tom

How long does it take to train an epoch with SWIN-B?

I used swin-B to train on the epic-kitchens dataset, but it takes me almost 27 hours for one epoch training (mixed precision was already applied).
I used 4 V100 GPUs, batch_size=8.
Is this the normal time for training?

About the 3D relative position bias

In the subsection 3D relative position bias of your paper, a bias is added in the self-attention computaion.
I don't fully understand it.

According to your description, Q,K,V are all matrices with P*M^2 rows and d columns, so QK^T will be a square matrix with P*M^2 rows and P*M^2 columns. To make the summation valid, the 3D relative position bias B should also be a square matrix wtih P*M^2 rows and P*M^2 columns. So how are the values in B are set? Specifically, how the member B(i,j) of B is set ?
I can't get any link between B and

skeleton based

skeleton based action recognition?

Any plan for spatial temporal localization?

Hi~ Thanks for your great work!
Do you have any plan to make experiments on the spatial-temporal localization task~(such as on AVA)?
I'm curious about the comparison of Swin and MViT on the Spatial-Temporal Localization task.
Looking forward to your reply~ Thanks a lot~

Are you interested in creating a PR under MMAction2?

Firstly, congratulations on the work “Video Swin Transformer”, and thanks for open-sourcing the code of this project. Are you interested in creating a PR under MMAction2?

Also, if you find MMAction2 useful in your research, please consider cite:

@misc{2020mmaction2,
    title={OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark},
    author={MMAction2 Contributors},
    howpublished = {\url{https://github.com/open-mmlab/mmaction2}},
    year={2020}
}

Can "demo/demo.py" run on MAC without GPU?

Thanks for sharing your great work!

I want to run some demo on my mac and have installed needed packages. But there are always some error on demo.py.
So, Can i run "demo/demo.py" on MAC without GPU?

Thanks in advance.

Where can I find the <PRETRAIN_MODEL>?

Hi, thanks for this fascinating work!
I want to follow the instructions bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments] to run the program, but I don't know where I can find the pretrain model.
So, I need some help, thanks all of you!

Swin-L weight

Dear researchers,

Thank you for this very nice piece of of work.

Can you also provide the weight of the Swin-L as described in the paper ?

Best regards,

lr is very small??

Dear all:
during training on video-swin, i noticed the learning rate printed is very small, say 9.6e-5.

2022-02-11 11:25:41,507 - mmaction - INFO - Epoch [8][9220/11857]	lr: 9.668e-05, eta: 4 days, 12:07:34, time: 0.665, data_time: 0.000, memory: 20882, top1_acc: 0.6312, top5_acc: 0.8387, loss_cls: 1.5623, loss: 1.5623

however, i find that in the swin base config file, the lr is set to 1e-3. So is this normal?? thank you.

Performance Reproducing of Swin-S

Hi,
Thanks for your great work.
I'm trying to reproducing the performance of Swin-S on K-400. Using the released checkpoint for evaluation, I got an 80.11% accuracy; Evaluating the Swin-S model trained by myself, I got an 80.35% accuracy (still ~0.2% worse than the paper reported one).
I wonder if anything is wrong. I doubt different validation data causes this as some videos are missing in the current K-400 dataset. My validation set contains 19,870 videos and the training set contains 239,687 videos, and how about the one you use?
Thanks a lot in advance.
Best.

Abour factorized spatiotemporal model as in Table 4

Thank you for your work and the codes.

In addition to your released model and weights, I'm wondering if you can also release the model and pretrained weights for factorized spatiotemporal attention (Video-Swin-T), as discussed in Table 4 in your paper.

training from scratch? without imagenet1k/21k pretrained?

Dear all:
Did anyone tried training from scratch, namely randomly initialized, without imagenet1k/21k pretrained weights? How is the performance attained? any tips are welcome. thanks.

Reproducing results

Hi,

Thanks for the great work.

I'm having the same issue as #5 even when I tested the models with the same val split.

I played with SwinT and SwinB, and both of them gave 04%~0.5% lower top-1 accuracy than reported. They are still pretty neat but I just want to make sure I am not doing anything wrong.

Would you confirm that the models and the split files uploaded are the correct ones?

Also, if anyone has successfully reproduce it, please kindly comment here about whether there is anything else I need to do besides downloading the models and configs and run the test scripts.

Thanks,

the log file during training the swin tiny(pretrain by the imagenet 1k)

Before raising a question, you may need to check the following listed items.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.

ib2to3.pgen2.parse.parseerror

the training iteration is abnormally large

I used 4 gpus(2080ti) to train swin_small with config swin_small_patch244_window877_kinetics400_1k.py. The dataset I used is hacs(50w videos). The following is some of my training log:

I find that the train iteration is abnormally large based on my config(dataset size(50w) and batch_size(8)), which leads to the long training time. Is that normal?

Do you have any result on other video dataset, like Charades?

I finetuned the swin-base on Charades with the setting as follow:

optimizer: I used AdamW with lr=75e-6, betas=(0.9,0.999), weight_decay=5e-2, other settings just follow the config that you provided.
learning policy: CosineAnnealing with linear warmup by 2.5 epochs.
loss function: AsymmetricLoss [1] with neg=4 and pos=1
train_pipeline: clip_len=32, frame_intreval=2, num_clip=1, with RandomRescale (256,340) followed the setting in slowfast network, RandomResizedCrop, Resize(224,224) and Flip(0.5)
val_pipelne: clip_len=32, frame_intreval=2, num_clip=10, Resize(-1, 256), CenterCrop(256), Flip(0.5)
When the total epoch is 30, I got final val map=44.96
When the total epoch is 60, I got final val map=45.88
Is my result correct? Do you have any suggestions about fine-tuning swin on other dataset?

ref: [1] Ben-Baruch, E., Ridnik, T., Zamir, N., Noy, A., Friedman, I., Protter, M., & Zelnik-Manor, L. (2020). Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119.
code: https://github.com/Alibaba-MIIL/ASL

area_range in RandomResizedCrop?

Dear author:
I noticed the default area_range in RandomResizedCrop in (0.08, 1.0), which is not appropriate for video recognition when area_range is too small.
I guess we need to specify the area_range like this:

dict(type='RandomResizedCrop', area_range=(0.75, 1.0))

anyone realized it? or i got wroong. thank you.

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

I was using the command: python tools/test.py ./configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py ./swin_small_patch244_window877_kinetics400_1k.pth --eval top_k_accuracy to do inference, the error occurred. someone says that I need to decrease the batch_size, but I didn't find the correlative parameter.

ETA:Traceback (most recent call last):
File "/root/obelisk/Collection/Video-Swin-Transformer-master/tools/test.py", line 364, in
main()
File "/root/obelisk/Collection/Video-Swin-Transformer-master/tools/test.py", line 349, in main
outputs = inference_pytorch(args, cfg, distributed, data_loader)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/tools/test.py", line 160, in inference_pytorch
outputs = single_gpu_test(model, data_loader)
File "/root/anaconda3/lib/python3.9/site-packages/mmcv/engine/test.py", line 33, in single_gpu_test
result = model(return_loss=False, **data)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
return super().forward(*inputs, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/base.py", line 258, in forward
return self.forward_test(imgs, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/recognizer3d.py", line 90, in forward_test
return self._do_test(imgs).cpu().numpy()
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/recognizer3d.py", line 47, in _do_test
x = self.extract_feat(batch_imgs)
File "/root/anaconda3/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/base.py", line 157, in extract_feat
x = self.backbone(imgs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/backbones/swin_transformer.py", line 652, in forward
x = self.patch_embed(x)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/backbones/swin_transformer.py", line 449, in forward
x = self.proj(x) # B C D Wh Ww
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 590, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 585, in _conv_forward
return F.conv3d(
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Official Pytorch API or model?

Hi!

I'm a researcher planning to use this to classify time-lapse of biomedical data. Is there any official Pytorch API with pretrained weights?

I'm currently using ResNet 3D that is available off-the-shelf in Pytorch
https://pytorch.org/vision/stable/models.html#video-classification

But I believe transformers will give me better results.

There are also these repos:
https://github.com/haofanwang/video-swin-transformer-pytorch
https://github.com/berniwal/swin-transformer-pytorch

But I'm having trouble to get it to work, I'd like to use official code if possible. I have also searched here without results:
https://paperswithcode.com/paper/video-swin-transformer#code

We only have grayscale images so if it was possible to choose number of channels (and classes) it would be great.

ValueError: batch_size should be a positive integer value, but got batch_size=0

If you feel we have help you, give us a STAR! 😆

Notice

There are several common situations in the reimplementation issues as below

Reimplement a model in the model zoo using the provided configs
Reimplement a model in the model zoo on other dataset (e.g., custom datasets)
Reimplement a custom model but all the components are implemented in MMAction2
Reimplement a custom model with new modules implemented by yourself

There are several things to do for different cases as below.

For case 1 & 3, please follow the steps in the following sections thus we could help to quick identify the issue.
For case 2 & 4, please understand that we are not able to do much help here because we usually do not know the full code and the users should be responsible to the code they write.
One suggestion for case 2 & 4 is that the users should first check whether the bug lies in the self-implemented code or the original code. For example, users can first make sure that the same model runs well on supported datasets. If you still need help, please describe what you have done and what you obtain in the issue, and follow the steps in the following sections and try as clear as possible so that we can better help you.

Checklist

I have searched related issues but cannot get the expected help.
The issue has not been fixed in the latest version.

Describe the issue

The problem of CUDA out of memory appeared during model reimplementation. I adjusted videos_per_gpu to 1 (https://github.com/SwinTransformer/Video-Swin-Transformer/blob/db018fb8896251711791386bbd2127562fd8d6a6/configs/recognitionow py#L66), a new problem has occurred

Reproduction

What command or script did you run?
python tools/train.py 'configs/recognition/swin/swin_base_patch244_window1677_sthv2.py'

A placeholder for the command.

What config dir you run?
configs/recognition/swin/swin_base_patch244_window1677_sthv2.py

A placeholder for the config.

Did you make any modifications on the code or config? Did you understand what you have modified?
I adjusted videos_per_gpu to 1 (https://github.com/SwinTransformer/Video-Swin-Transformer/blob/db018fb8896251711791386bbd2127562fd8d6a6/configs/recognitionow py#L66)
What dataset did you use?
sthv2
Environment
Please run PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py to collect necessary environment information and paste it here.
fatal: Not a git repository (or any parent up to mount point /home)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
sys.platform: linux
Python: 3.6.10 (default, Dec 19 2019, 23:04:32) [GCC 5.4.0 20160609]
CUDA available: True
GPU 0,1,2,3,4,5: TITAN Xp
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
PyTorch: 1.6.0
PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
CuDNN 7.6.5
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.7.0
OpenCV: 4.4.0
MMCV: 1.3.14
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 10.2
MMAction2: 0.18.0+

You may add addition that may be helpful for locating the problem, such as
1. How you installed PyTorch [e.g., pip, conda, source]
2. Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Results

"but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=0

A placeholder for results comparison

Issue fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

Resume from Checkpoint 'meta' key error

When I try to train a model on a dataset that resumes from a SWIN-T checkpoint (such as https://github.com/SwinTransformer/storage/releases/download/v1.0.4/swin_tiny_patch244_window877_kinetics400_1k.pth), I run into the following error:

Any suggestions on how to fix this?

INSTALL

Dear the Authors,

I would like to ask you how can we Install Video-Swin-Transformer and there are tutorial by notebook for training?

Thank you very much.

the common dataset setting

Dear author:
the error is "ValueError: VideoDataset: too many values to unpack (expected 2)"
Here is my thoughtsfor this error:
The input data is rawframes(i have extracted frames), but the data_type in the configs/congnition/swin/~~.py is “dataset_type = 'VideoDataset' “。Should I change the datase_type? If so, what type is it?
Thank you!

if results['frame_inds'].ndim != 1: KeyError: 'frame_inds'

KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/envs/pytorch/lib/python3/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/envs/pytorch/lib/python3/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/envs/pytorch/lib/python3/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/Video-Swin-Transformer-master/mmaction/datasets/base.py", line 287, in getitem
return self.prepare_train_frames(idx)
File "/home/Video-Swin-Transformer-master/mmaction/datasets/rawframe_dataset.py", line 168, in prepare_train_frames
return self.pipeline(results)
File "/home/Video-Swin-Transformer-master/mmaction/datasets/pipelines/compose.py", line 41, in call
data = t(data)
File "/home/ideo-Swin-Transformer-master/mmaction/datasets/pipelines/loading.py", line 1153, in call
if results['frame_inds'].ndim != 1:
KeyError: 'frame_inds'

Details about input frames

Hi there,

Could you please explain "we sample a clip of 32 frames from each full length video using a temporal stride of 2 and spatial size
of 224 ×224, resulting in 16×56×56 input 3D tokens" in detail? How do you sample a clip? Does the temporal stride of 2 means 2 FPS?

THUMOS14 fetch_tag_proposal.sh doesn't work

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version.

Describe the bug

fetch_tag_proposal.sh in THUMOS14 dataset doesn't work because these links are forbidden as follow.

https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/filelist/thumos14_tag_val_normalized_proposal_list.txt
https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/filelist/thumos14_tag_test_normalized_proposal_list.txt

Reproduction

What command or script did you run?

cd $MMACTION2/tools/data/thumos14/
bash fetch_tag_proposals.sh

Did you make any modifications on the code or config? Did you understand what you have modified?
Links of thumos14_tag_val_normalized_proposal_list.txt and thumos14_tag_test_normalized_proposal_list.txt are invalid.
What dataset did you use?
THUMOS14

Environment

Please run PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py to collect necessary environment information and paste it here.

sys.platform: linux
Python: 3.8.5 (default, Sep  4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA Quadro RTX 8000
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.7.1
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.8.2
OpenCV: 4.5.3
MMCV: 1.3.6
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.0
MMAction2: 0.17.0+acce52d

Error traceback

If applicable, paste the error traceback here.

root@###########:/mmaction2/tools/data/thumos14# bash fetch_tag_proposals.sh
../../../data/thumos14/proposals does not exist. Creating
--2021-09-10 00:25:51--  https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/filelist/thumos14_tag_val_normalized_proposal_list.txt
Resolving open-mmlab.s3.ap-northeast-2.amazonaws.com (open-mmlab.s3.ap-northeast-2.amazonaws.com)... 52.219.60.147
Connecting to open-mmlab.s3.ap-northeast-2.amazonaws.com (open-mmlab.s3.ap-northeast-2.amazonaws.com)|52.219.60.147|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-09-10 00:25:51 ERROR 403: Forbidden.

--2021-09-10 00:25:51--  https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/filelist/thumos14_tag_test_normalized_proposal_list.txt
Resolving open-mmlab.s3.ap-northeast-2.amazonaws.com (open-mmlab.s3.ap-northeast-2.amazonaws.com)... 52.219.56.71
Connecting to open-mmlab.s3.ap-northeast-2.amazonaws.com (open-mmlab.s3.ap-northeast-2.amazonaws.com)|52.219.56.71|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-09-10 00:25:51 ERROR 403: Forbidden.

Thank you.

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

modify the config file

when i use swin_base_patch244_window877_kinetics400_22k to train my dataset, the config file:

dataset settings

dataset_type = 'VideoDataset'
data_root = 'data/kinetics400/train'
data_root_val = 'data/kinetics400/val'
ann_file_train = 'data/kinetics400/kinetics400_train_list.txt'
ann_file_val = 'data/kinetics400/kinetics400_val_list.txt'
ann_file_test = 'data/kinetics400/kinetics400_val_list.txt'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)
train_pipeline = [
dict(type='DecordInit'),
dict(type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='RandomResizedCrop'),
dict(type='Resize', scale=(224, 224), keep_ratio=False),
dict(type='Flip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]

because my frame txt format is :

    some/directory-1 163 1
    some/directory-2 122 1
    some/directory-3 258 2
    some/directory-4 234 2
    some/directory-5 295 3
    some/directory-6 121 3

I want to change dataset_type = 'RawframeDataset', do i need modify "dict(type='DecordInit')" ?

the input is the video?

Before raising a question, you may need to check the following listed items.

Checklist

I have searched related issues but cannot get the expected help.
I have read the FAQ documentation but cannot get the expected help.

How to get parameters and FLOPs values in video swin transformer model？

Can mmaction2 get this function by config files? Thanks for your work.

0 demo.py is not the newest, please update.

Original in your docs: inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt')
The newest: inference_recognizer(model, 'demo/demo.mp4')
If not，when you run demo.py, error occoured such as " inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt')"

can't run Video-Swin-Transformer

The download path of checkpoint

Anyone else know the path of download checkpoint?

Drop path rate

Hi,

Video-Swin-Transformer/configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py

Line 4 in db018fb

    
           model=dict(backbone=dict(patch_size=(2,4,4), drop_path_rate=0.1), test_cfg=dict(max_testing_views=4))

Code claims Swin small uses 0.1 drop path rate, but does it match with the report which reads 0.2?
Swin-T and Swin-B uses 0.1, and 0.3 respectively as follows:

Video-Swin-Transformer/configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py

Line 4 in db018fb

    
           model=dict(backbone=dict(patch_size=(2,4,4), drop_path_rate=0.1), test_cfg=dict(max_testing_views=4))

Video-Swin-Transformer/configs/recognition/swin/swin_base_patch244_window877_kinetics400_1k.py

Line 4 in db018fb

    
           model=dict(backbone=dict(patch_size=(2,4,4), drop_path_rate=0.3), test_cfg=dict(max_testing_views=4))

Thanks,

Keeping the temporal dimension

Hi, thanks for your fascinating work!

I want to use the video swin transfomer as a backbone, but my model should produce an output for each input frame.
Thus I want to keep the temporal dimension of input after the forward pass.

So I'm thinking of changing the parameter like this patch_size=(1,4,4), but I am concerned about whether this could violate the authors' intention to make spatio-temporal feature.

Apart from the memory usage issue, is it okay to make the temporal window size of the patch embedding to 1?

how to run inference in a single video based on your pretrained model?

trying to figure out how to run inference in a single video based on your pretrained models

What head to use ?

Hi!

There is a problem with the Video Swin Transformer code at the moment, as it is done in a way that makes it impossible to change the number of target classes in an end-to-end fashion. Wanting to use your model on another dataset containing for example 10 or 50 classes, the network gives me an output for the head.

I build a model:

model_VST = SwinTransformer3D()
model_VST.cuda()

You can see that I don't have any class number argument in the template paranthese, indeed your code doesn't take that as an argument. Here are the argumentsthat your Video Swin Transformer model takes as input:
There is nothing for the number of target classes. Right now for an input shape of torch.Size([1, 8, 3, 64, 64]), I get an output shape of torch.Size([1, 768, 2, 2, 2]) from the model_VST (which is the SwinTransformer3D).

**I understand that I need to add an head to it, but it is not clear at all in your code how to properly manage that. What head to use ? **

Maybe you can make something like Facebook did with their TimeSformer, they made an end-to-end version of it for classification of videos.

KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

when i use :
python tools/train.py configs/recognition/swin/swin_base_patch244_window877_kinetics400_22k.py

an error occurred:

Traceback (most recent call last):
File "tools/train.py", line 199, in
main()
File "tools/train.py", line 154, in main
model = build_model(
File "/home/pytorch/lib/python3/site-packages/mmaction/models/builder.py", line 70, in build_model
return build_localizer(cfg)
File "/home/pytorch/lib/python3/site-packages/mmaction/models/builder.py", line 62, in build_localizer
return LOCALIZERS.build(cfg)
File "/home/pytorch/lib/python3/site-packages/mmcv/utils/registry.py", line 210, in build
return self.build_func(*args, **kwargs, registry=self)
File "/home/pytorch/lib/python3/site-packages/mmcv/cnn/builder.py", line 26, in build_model_from_cfg
return build_from_cfg(cfg, registry, default_args)
File "/home/pytorch/lib/python3/site-packages/mmcv/utils/registry.py", line 54, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

How to solve it?

swintransformer / video-swin-transformer Goto Github PK

video-swin-transformer's People

Contributors

Stargazers

Watchers

Forkers

video-swin-transformer's Issues

dataset settings

Recommend Projects

Recommend Topics

Recommend Org