Coder Social home page Coder Social logo

swintransformer / video-swin-transformer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from open-mmlab/mmaction2

1.4K 1.4K 195.0 42.02 MB

This is an official implementation for "Video Swin Transformers".

Home Page: https://arxiv.org/abs/2106.13230

License: Apache License 2.0

Python 97.58% Dockerfile 0.30% Shell 2.06% Makefile 0.03% Batchfile 0.03%
swin-transformer video-recognition

video-swin-transformer's People

Contributors

carolinecheng233 avatar congee524 avatar dreamerlin avatar hellock avatar hust-nj avatar hypnosxc avatar innerlee avatar irvingzhang0512 avatar jackytown avatar jin-s13 avatar joannalxy avatar kennymckormick avatar magicdream2222 avatar mmeendez8 avatar parskatt avatar rlleshi avatar sczwangxiao avatar sebastienlinker avatar shoufachen avatar sunnyxiaohu avatar tangh avatar wangruohui avatar wjn922 avatar wwdok avatar xwen99 avatar yaochaorui avatar yrquni avatar yuta1125tp avatar yzfly avatar zeliu98 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

video-swin-transformer's Issues

KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

Describe the bug

While running the training script "tools/train.py" this error occurs.

Reproduction
Run the command:

python Video-Swin-Transformer/tools/train.py _Video-Swin-Transformer/configs/recognition/swin/swin_base_patch244_window877_kinetics600_22k.py
  1. Did you make any modifications on the code or config? - No. Did you understand what you have modified? - No
  2. What dataset did you use? - Kinetics600

Environment

  1. Please run PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py to collect necessary environment information and paste it here.
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
sys.platform: linux
Python: 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37) [GCC 9.3.0]
CUDA available: True
GPU 0: Tesla P100-PCIE-16GB
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GCC: gcc (Debian 8.3.0-6) 8.3.0
PyTorch: 1.7.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.10.0+cu102
OpenCV: 4.5.3
MMCV: 1.3.12
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMAction2: 0.17.0+
  1. You may add addition that may be helpful for locating the problem, such as
    • How you installed PyTorch [e.g., pip, conda, source] -- using pip
    • Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.) - None

Error traceback

{'type': 'Recognizer3D', 'backbone': {'type': 'SwinTransformer3D', 'patch_size': (2, 4, 4), 'embed_dim': 128, 'depths': [2, 2, 18, 2], 'num_heads': [4, 8, 16, 32], 'window_size': (8, 7, 7), 'mlp_ratio': 4.0, 'qkv_bias': True, 'qk_scale': None, 'drop_rate': 0.0, 'attn_drop_rate': 0.0, 'drop_path_rate': 0.2, 'patch_norm': True}, 'cls_head': {'type': 'I3DHead', 'in_channels': 1024, 'num_classes': 600, 'spatial_type': 'avg', 'dropout_ratio': 0.5}, 'test_cfg': {'average_clips': 'prob', 'max_testing_views': 2}}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 52, in build_from_cfg
    return obj_cls(**args)
  File "/opt/conda/lib/python3.7/site-packages/mmaction/models/recognizers/base.py", line 75, in __init__
    self.backbone = builder.build_backbone(backbone)
  File "/opt/conda/lib/python3.7/site-packages/mmaction/models/builder.py", line 29, in build_backbone
    return BACKBONES.build(cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 212, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 45, in build_from_cfg
    f'{obj_type} is not in the {registry.name} registry')
KeyError: 'SwinTransformer3D is not in the models registry'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Video-Swin-Transformer/tools/train.py", line 196, in <module>
    main()
  File "Video-Swin-Transformer/tools/train.py", line 154, in main
    model = build_model(cfg.model,train_cfg=cfg.get('train_cfg'),test_cfg=cfg.get('test_cfg'))
  File "/opt/conda/lib/python3.7/site-packages/mmaction/models/builder.py", line 70, in build_model
    return build_localizer(cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmaction/models/builder.py", line 62, in build_localizer
    return LOCALIZERS.build(cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 212, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

Other packages versions

mmcv-full == 1.3.12
pytorch==1.7.0
mmaction2==0.18.0
mmdet == 2.16.0
scipy==1.6.3
numpy==1.19.5

The model's behavior is different from the picture in the paper.

Hello. Thank you for providing a good paper with a good code.

I had a question while experimenting with video swin transformer.

Input sizes (1, 3, 8, 384, 384)
swintransformer3D( patch_size=(2,4,4), all of rest is default settings)

The output size per layer was measured in the forward section.
The result is :
after 1 layer output shape : torch.Size([1, 192, 4, 48, 48])
after 2 layer output shape : torch.Size([1, 384, 4, 24, 24])
after 3 layer output shape : torch.Size([1, 768, 4, 12, 12])
after 4 layer output shape : torch.Size([1, 768, 4, 12, 12])

As the paper illustrates,
after 1 layer output shape : torch.Size([1, 96, 4, 96, 96])
after 2 layer output shape : torch.Size([1, 192, 4, 48, 48])
after 3 layer output shape : torch.Size([1, 384, 4, 24, 24])
after 4 layer output shape : torch.Size([1, 768, 4, 12, 12])
I think this is right.

I know it's hard work, but can I ask you to check it out?

Severely overfitting occurred.

Dear author:
I trained a lite-base version of video swin transformer, but I noticed very severely overfitting phonomenon occurred as :

, data_time: 0.001, memory: 20882, top1_acc: 0.7600, top5_acc: 0.9206, loss_cls: 0.9247, loss: 0.9247
2022-02-15 10:24:18,650 - mmaction - INFO - Epoch [13][5860/5929]	lr: 2.714e-05, eta: 2 days, 3:27:33, time: 0.669, data_time: 0.001, memory: 20882, top1_acc: 0.7569, top5_acc: 0.9269, loss_cls: 0.9281, loss: 0.9281
2022-02-15 10:24:31,952 - mmaction - INFO - Epoch [13][5880/5929]	lr: 2.714e-05, eta: 2 days, 3:27:20, time: 0.664, data_time: 0.000, memory: 20882, top1_acc: 0.7462, top5_acc: 0.9313, loss_cls: 0.9472, loss: 0.9472
2022-02-15 10:24:45,297 - mmaction - INFO - Epoch [13][5900/5929]	lr: 2.714e-05, eta: 2 days, 3:27:07, time: 0.668, data_time: 0.001, memory: 20882, top1_acc: 0.7556, top5_acc: 0.9250, loss_cls: 0.9117, loss: 0.9117
2022-02-15 10:24:58,546 - mmaction - INFO - Epoch [13][5920/5929]	lr: 2.714e-05, eta: 2 days, 3:26:53, time: 0.662, data_time: 0.001, memory: 20882, top1_acc: 0.7506, top5_acc: 0.9256, loss_cls: 0.9624, loss: 0.9624
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 33663/33663, 139.1 task/s, elapsed: 242s, ETA:     0s

2022-02-15 10:29:10,037 - mmaction - INFO - Evaluating top_k_accuracy ...
2022-02-15 10:29:12,502 - mmaction - INFO - 
top1_acc	0.5948
top5_acc	0.8161
2022-02-15 10:29:12,502 - mmaction - INFO - Evaluating mean_class_accuracy ...
2022-02-15 10:29:12,608 - mmaction - INFO - 
mean_acc	0.5943
2022-02-15 10:29:12,626 - mmaction - INFO - Epoch(val) [13][421]	top1_acc: 0.5948, top5_acc: 0.8161, mean_class_accuracy: 0.5943

after i trained for 30 epochs, the training top1 reached 90+%, but the validation acc keep ~59% still.

I follow most of the setting as swin-base :

        drop_rate=0.,
        attn_drop_rate=0.,
        drop_path_rate=0.2,
        patch_norm=True),


    cls_head=dict(
        type='I3DHead',
        in_channels=1024,
        num_classes=700,
        spatial_type='avg',
        dropout_ratio=0.5),


# optimizer
optimizer = dict(type='AdamW', lr=3e-4, betas=(0.9, 0.999), weight_decay=0.05,
                 paramwise_cfg=dict(custom_keys={'absolute_pos_embed': dict(decay_mult=0.),
                                                 'relative_position_bias_table': dict(decay_mult=0.),
                                                 'norm': dict(decay_mult=0.),
                                                 'backbone': dict(lr_mult=0.1)})

Anyone has the same case? could anyone give some tips? thank you.

Inaccessible Download Links

The download links for Kinetics 400 pretrained models are on pan.baid.com. Many people are not able to download these at all because you need to create an account (with a phone number) to download files from that site. If you are in germany or the UK, like me, it is not possible to create an account to download these. Please host them somewhere else to make them available to the general public.

KeyError: 'patch_embed.proj.weight'

Describe the bug
When trying to fine-tune a pretrained model, the following error occurs:
KeyError: 'patch_embed.proj.weight'
For the line:
state_dict['patch_embed.proj.weight'] = state_dict['patch_embed.proj.weight'].unsqueeze(2).repeat(1,1,self.patch_size[0],1,1) / self.patch_size[0]

Reproduction

  1. What command or script did you run?
python3 tools/train.py configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py --cfg-options model.backbone.pretrained=pretrained/swin_small_patch244_window877_kinetics400_1k.pth model.backbone.use_checkpoint=True
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
  • Only changed the following:
--- a/mmaction/models/backbones/swin_transformer.py
+++ b/mmaction/models/backbones/swin_transformer.py
-        state_dict = checkpoint['model']
+       state_dict = checkpoint['state_dict'] #checkpoint['model']

  1. What dataset did you use?
    kinetics-based
    Environment

  2. Please run PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py to collect necessary environment information and paste it here.
    sys.platform: linux
    Python: 3.8.5 (default, Jan 27 2021, 15:41:15) [GCC 9.3.0]
    CUDA available: True
    GPU 0,1,2,3: Quadro RTX 8000
    CUDA_HOME: /usr/local/cuda
    NVCC: Build cuda_11.1.TC455_06.29190527_0
    GCC: gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0
    PyTorch: 1.9.0+cu102
    PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  • CuDNN 7.6.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PT
    HREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-
    field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-st
    rict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -falig
    ned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TO
    RCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.10.0+cu102
OpenCV: 4.5.3
MMCV: 1.3.13
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.2
MMAction2: 0.15.0+db018fb

Error traceback

If applicable, paste the error traceback here.

2021-09-12 22:44:38,014 - mmaction - INFO - load model from: pretrained/swin_small_patch244_window877_kinetics400_1k.pth
Traceback (most recent call last):
  File "<venv dir>/lib/python3.8/site-packages/mmcv/utils/registry.py", line 52, in build_from_cfg
    return obj_cls(**args)
  File "Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 109, in __init__
    self.init_weights()
  File "Video-Swin-Transformer/mmaction/models/recognizers/base.py", line 126, in init_weights
    self.backbone.init_weights()
  File "Video-Swin-Transformer/mmaction/models/backbones/swin_transformer.py", line 641, in init_weights
    self.inflate_weights(logger)
  File "Video-Swin-Transformer/mmaction/models/backbones/swin_transformer.py", line 588, in inflate_weights
    state_dict['patch_embed.proj.weight'] = state_dict['patch_embed.proj.weight'].unsqueeze(2).repeat(1,1,self.patch_size[0],1,1) / self.patch_size[0]
KeyError: 'patch_embed.proj.weight'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 201, in <module>
    main()
  File "tools/train.py", line 156, in main
    model = build_model(
  File "Video-Swin-Transformer/mmaction/models/builder.py", line 70, in build_model
    return build_localizer(cfg)
  File "Video-Swin-Transformer/mmaction/models/builder.py", line 62, in build_localizer
    return LOCALIZERS.build(cfg)
  File "<venv_dir>/lib/python3.8/site-packages/mmcv/utils/registry.py", line 212, in build
    return self.build_func(*args, **kwargs, registry=self)
  File "<venv dir>/lib/python3.8/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "<venv dir>/lib/python3.8/site-packages/mmcv/utils/registry.py", line 55, in build_from_cfg
    raise type(e)(f'{obj_cls.__name__}: {e}')
KeyError: "Recognizer3D: 'patch_embed.proj.weight'"

Bug fix

Looks like the pretrained models are compatible with an older version of mmaction - but I couldn't find which.

Thanks!

AttributeError: 'Recognizer2D' object has no attribute 'demo/label_map_k400'

I installed as instructed.

The following line
inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt')

Gave me an error
AttributeError: 'Recognizer2D' object has no attribute 'demo/label_map_k400'

My guess is that the kinetics400 dataset is not installed properly. Not sure of how to install as needed.

Thank You
Tom

Which version of Kinetics400 do you use?

There are many different version of kinetics 400 and some has more videos than others. Can I know which version do you use and what is the statistics of your train and test set, i.e. how many train and test videos do you have?

AttributeError: Module tools/data/kinetics/label_map_k400.txt not found

I got to the end of the installation and was running the script to do the check to see if installation was done correctly.

In the open_mmlab environment, the very last line of the script
inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt')
received an error
AttributeError: 'Recognizer2D' object has no attribute 'demo/label_map_k400'

How to proceed?

Thank You
Tom

How long does it take to train an epoch with SWIN-B?

I used swin-B to train on the epic-kitchens dataset, but it takes me almost 27 hours for one epoch training (mixed precision was already applied).
I used 4 V100 GPUs, batch_size=8.
Is this the normal time for training?

About the 3D relative position bias

In the subsection 3D relative position bias of your paper, a bias is added in the self-attention computaion.
I don't fully understand it.
Image_20210826162825

According to your description, Q,K,V are all matrices with P*M^2 rows and d columns, so QK^T will be a square matrix with P*M^2 rows and P*M^2 columns. To make the summation valid, the 3D relative position bias B should also be a square matrix wtih P*M^2 rows and P*M^2 columns. So how are the values in B are set? Specifically, how the member B(i,j) of B is set ?
I can't get any link between B and Image_20210826175012

Any plan for spatial temporal localization?

Hi~ Thanks for your great work!
Do you have any plan to make experiments on the spatial-temporal localization task~(such as on AVA)?
I'm curious about the comparison of Swin and MViT on the Spatial-Temporal Localization task.
Looking forward to your reply~ Thanks a lot~

Are you interested in creating a PR under MMAction2?

Firstly, congratulations on the work “Video Swin Transformer”, and thanks for open-sourcing the code of this project. Are you interested in creating a PR under MMAction2?

Also, if you find MMAction2 useful in your research, please consider cite:

@misc{2020mmaction2,
    title={OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark},
    author={MMAction2 Contributors},
    howpublished = {\url{https://github.com/open-mmlab/mmaction2}},
    year={2020}
}

Can "demo/demo.py" run on MAC without GPU?

Thanks for sharing your great work!

I want to run some demo on my mac and have installed needed packages. But there are always some error on demo.py.
So, Can i run "demo/demo.py" on MAC without GPU?

Thanks in advance.

Where can I find the <PRETRAIN_MODEL>?

Hi, thanks for this fascinating work!
I want to follow the instructions bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments] to run the program, but I don't know where I can find the pretrain model.
So, I need some help, thanks all of you!

Swin-L weight

Dear researchers,

Thank you for this very nice piece of of work.

Can you also provide the weight of the Swin-L as described in the paper ?

Best regards,

lr is very small??

Dear all:
during training on video-swin, i noticed the learning rate printed is very small, say 9.6e-5.

2022-02-11 11:25:41,507 - mmaction - INFO - Epoch [8][9220/11857]	lr: 9.668e-05, eta: 4 days, 12:07:34, time: 0.665, data_time: 0.000, memory: 20882, top1_acc: 0.6312, top5_acc: 0.8387, loss_cls: 1.5623, loss: 1.5623

however, i find that in the swin base config file, the lr is set to 1e-3. So is this normal?? thank you.

Performance Reproducing of Swin-S

Hi,
Thanks for your great work.
I'm trying to reproducing the performance of Swin-S on K-400. Using the released checkpoint for evaluation, I got an 80.11% accuracy; Evaluating the Swin-S model trained by myself, I got an 80.35% accuracy (still ~0.2% worse than the paper reported one).
I wonder if anything is wrong. I doubt different validation data causes this as some videos are missing in the current K-400 dataset. My validation set contains 19,870 videos and the training set contains 239,687 videos, and how about the one you use?
Thanks a lot in advance.
Best.

Abour factorized spatiotemporal model as in Table 4

Thank you for your work and the codes.

In addition to your released model and weights, I'm wondering if you can also release the model and pretrained weights for factorized spatiotemporal attention (Video-Swin-T), as discussed in Table 4 in your paper.

Reproducing results

Hi,

Thanks for the great work.

I'm having the same issue as #5 even when I tested the models with the same val split.

I played with SwinT and SwinB, and both of them gave 04%~0.5% lower top-1 accuracy than reported. They are still pretty neat but I just want to make sure I am not doing anything wrong.

Would you confirm that the models and the split files uploaded are the correct ones?

Also, if anyone has successfully reproduce it, please kindly comment here about whether there is anything else I need to do besides downloading the models and configs and run the test scripts.

Thanks,

the training iteration is abnormally large

I used 4 gpus(2080ti) to train swin_small with config swin_small_patch244_window877_kinetics400_1k.py. The dataset I used is hacs(50w videos). The following is some of my training log:

image

I find that the train iteration is abnormally large based on my config(dataset size(50w) and batch_size(8)), which leads to the long training time. Is that normal?

Do you have any result on other video dataset, like Charades?

I finetuned the swin-base on Charades with the setting as follow:

  1. optimizer: I used AdamW with lr=75e-6, betas=(0.9,0.999), weight_decay=5e-2, other settings just follow the config that you provided.
  2. learning policy: CosineAnnealing with linear warmup by 2.5 epochs.
  3. loss function: AsymmetricLoss [1] with neg=4 and pos=1
  4. train_pipeline: clip_len=32, frame_intreval=2, num_clip=1, with RandomRescale (256,340) followed the setting in slowfast network, RandomResizedCrop, Resize(224,224) and Flip(0.5)
  5. val_pipelne: clip_len=32, frame_intreval=2, num_clip=10, Resize(-1, 256), CenterCrop(256), Flip(0.5)
    When the total epoch is 30, I got final val map=44.96
    When the total epoch is 60, I got final val map=45.88
    Is my result correct? Do you have any suggestions about fine-tuning swin on other dataset?

ref: [1] Ben-Baruch, E., Ridnik, T., Zamir, N., Noy, A., Friedman, I., Protter, M., & Zelnik-Manor, L. (2020). Asymmetric loss for multi-label classification. arXiv preprint arXiv:2009.14119.
code: https://github.com/Alibaba-MIIL/ASL

area_range in RandomResizedCrop?

Dear author:
I noticed the default area_range in RandomResizedCrop in (0.08, 1.0), which is not appropriate for video recognition when area_range is too small.
I guess we need to specify the area_range like this:

dict(type='RandomResizedCrop', area_range=(0.75, 1.0))

anyone realized it? or i got wroong. thank you.

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

I was using the command: python tools/test.py ./configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py ./swin_small_patch244_window877_kinetics400_1k.pth --eval top_k_accuracy to do inference, the error occurred. someone says that I need to decrease the batch_size, but I didn't find the correlative parameter.

ETA:Traceback (most recent call last):
File "/root/obelisk/Collection/Video-Swin-Transformer-master/tools/test.py", line 364, in
main()
File "/root/obelisk/Collection/Video-Swin-Transformer-master/tools/test.py", line 349, in main
outputs = inference_pytorch(args, cfg, distributed, data_loader)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/tools/test.py", line 160, in inference_pytorch
outputs = single_gpu_test(model, data_loader)
File "/root/anaconda3/lib/python3.9/site-packages/mmcv/engine/test.py", line 33, in single_gpu_test
result = model(return_loss=False, **data)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
return super().forward(*inputs, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/base.py", line 258, in forward
return self.forward_test(imgs, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/recognizer3d.py", line 90, in forward_test
return self._do_test(imgs).cpu().numpy()
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/recognizer3d.py", line 47, in _do_test
x = self.extract_feat(batch_imgs)
File "/root/anaconda3/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/base.py", line 157, in extract_feat
x = self.backbone(imgs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/backbones/swin_transformer.py", line 652, in forward
x = self.patch_embed(x)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/backbones/swin_transformer.py", line 449, in forward
x = self.proj(x) # B C D Wh Ww
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 590, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 585, in _conv_forward
return F.conv3d(
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

Official Pytorch API or model?

Hi!

I'm a researcher planning to use this to classify time-lapse of biomedical data. Is there any official Pytorch API with pretrained weights?

I'm currently using ResNet 3D that is available off-the-shelf in Pytorch
https://pytorch.org/vision/stable/models.html#video-classification

But I believe transformers will give me better results.

There are also these repos:
https://github.com/haofanwang/video-swin-transformer-pytorch
https://github.com/berniwal/swin-transformer-pytorch

But I'm having trouble to get it to work, I'd like to use official code if possible. I have also searched here without results:
https://paperswithcode.com/paper/video-swin-transformer#code

We only have grayscale images so if it was possible to choose number of channels (and classes) it would be great.

ValueError: batch_size should be a positive integer value, but got batch_size=0

If you feel we have help you, give us a STAR! 😆

Notice

There are several common situations in the reimplementation issues as below

  1. Reimplement a model in the model zoo using the provided configs
  2. Reimplement a model in the model zoo on other dataset (e.g., custom datasets)
  3. Reimplement a custom model but all the components are implemented in MMAction2
  4. Reimplement a custom model with new modules implemented by yourself

There are several things to do for different cases as below.

  • For case 1 & 3, please follow the steps in the following sections thus we could help to quick identify the issue.
  • For case 2 & 4, please understand that we are not able to do much help here because we usually do not know the full code and the users should be responsible to the code they write.
  • One suggestion for case 2 & 4 is that the users should first check whether the bug lies in the self-implemented code or the original code. For example, users can first make sure that the same model runs well on supported datasets. If you still need help, please describe what you have done and what you obtain in the issue, and follow the steps in the following sections and try as clear as possible so that we can better help you.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The issue has not been fixed in the latest version.

Describe the issue

The problem of CUDA out of memory appeared during model reimplementation. I adjusted videos_per_gpu to 1 (https://github.com/SwinTransformer/Video-Swin-Transformer/blob/db018fb8896251711791386bbd2127562fd8d6a6/configs/recognitionow py#L66), a new problem has occurred

Reproduction

  1. What command or script did you run?
    python tools/train.py 'configs/recognition/swin/swin_base_patch244_window1677_sthv2.py'
A placeholder for the command.
  1. What config dir you run?
    configs/recognition/swin/swin_base_patch244_window1677_sthv2.py
A placeholder for the config.
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    I adjusted videos_per_gpu to 1 (https://github.com/SwinTransformer/Video-Swin-Transformer/blob/db018fb8896251711791386bbd2127562fd8d6a6/configs/recognitionow py#L66)

  2. What dataset did you use?
    sthv2
    Environment

  3. Please run PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py to collect necessary environment information and paste it here.

  4. fatal: Not a git repository (or any parent up to mount point /home)
    Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
    sys.platform: linux
    Python: 3.6.10 (default, Dec 19 2019, 23:04:32) [GCC 5.4.0 20160609]
    CUDA available: True
    GPU 0,1,2,3,4,5: TITAN Xp
    CUDA_HOME: /usr/local/cuda-10.2
    NVCC: Cuda compilation tools, release 10.2, V10.2.89
    GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
    PyTorch: 1.6.0
    PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.2
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  • CuDNN 7.6.5
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.7.0
OpenCV: 4.4.0
MMCV: 1.3.14
MMCV Compiler: GCC 5.4
MMCV CUDA Compiler: 10.2
MMAction2: 0.18.0+

  1. You may add addition that may be helpful for locating the problem, such as
    1. How you installed PyTorch [e.g., pip, conda, source]
    2. Other environment variables that may be related (such as $PATH, $LD_LIBRARY_PATH, $PYTHONPATH, etc.)

Results

"but got batch_size={}".format(batch_size))
ValueError: batch_size should be a positive integer value, but got batch_size=0

A placeholder for results comparison

Issue fix

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

INSTALL

Dear the Authors,

I would like to ask you how can we Install Video-Swin-Transformer and there are tutorial by notebook for training?

Thank you very much.

the common dataset setting

Dear author:
the error is "ValueError: VideoDataset: too many values to unpack (expected 2)"
Here is my thoughtsfor this error:
The input data is rawframes(i have extracted frames), but the data_type in the configs/congnition/swin/~~.py is “dataset_type = 'VideoDataset' “。Should I change the datase_type? If so, what type is it?
Thank you!

if results['frame_inds'].ndim != 1: KeyError: 'frame_inds'

KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/envs/pytorch/lib/python3/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/envs/pytorch/lib/python3/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/envs/pytorch/lib/python3/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/Video-Swin-Transformer-master/mmaction/datasets/base.py", line 287, in getitem
return self.prepare_train_frames(idx)
File "/home/Video-Swin-Transformer-master/mmaction/datasets/rawframe_dataset.py", line 168, in prepare_train_frames
return self.pipeline(results)
File "/home/Video-Swin-Transformer-master/mmaction/datasets/pipelines/compose.py", line 41, in call
data = t(data)
File "/home/ideo-Swin-Transformer-master/mmaction/datasets/pipelines/loading.py", line 1153, in call
if results['frame_inds'].ndim != 1:
KeyError: 'frame_inds'

Details about input frames

Hi there,

Could you please explain "we sample a clip of 32 frames from each full length video using a temporal stride of 2 and spatial size
of 224 ×224, resulting in 16×56×56 input 3D tokens" in detail? How do you sample a clip? Does the temporal stride of 2 means 2 FPS?

THUMOS14 fetch_tag_proposal.sh doesn't work

Checklist

  • I have searched related issues but cannot get the expected help.
  • The bug has not been fixed in the latest version.

Describe the bug

fetch_tag_proposal.sh in THUMOS14 dataset doesn't work because these links are forbidden as follow.

https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/filelist/thumos14_tag_val_normalized_proposal_list.txt
https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/filelist/thumos14_tag_test_normalized_proposal_list.txt

Reproduction

  1. What command or script did you run?
cd $MMACTION2/tools/data/thumos14/
bash fetch_tag_proposals.sh
  1. Did you make any modifications on the code or config? Did you understand what you have modified?
    Links of thumos14_tag_val_normalized_proposal_list.txt and thumos14_tag_test_normalized_proposal_list.txt are invalid.

  2. What dataset did you use?
    THUMOS14

Environment

  1. Please run PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py to collect necessary environment information and paste it here.
sys.platform: linux
Python: 3.8.5 (default, Sep  4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA Quadro RTX 8000
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.0_bu.TC445_37.28845127_0
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.7.1
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.0
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.8.2
OpenCV: 4.5.3
MMCV: 1.3.6
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 11.0
MMAction2: 0.17.0+acce52d

Error traceback

If applicable, paste the error traceback here.

root@###########:/mmaction2/tools/data/thumos14# bash fetch_tag_proposals.sh
../../../data/thumos14/proposals does not exist. Creating
--2021-09-10 00:25:51--  https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/filelist/thumos14_tag_val_normalized_proposal_list.txt
Resolving open-mmlab.s3.ap-northeast-2.amazonaws.com (open-mmlab.s3.ap-northeast-2.amazonaws.com)... 52.219.60.147
Connecting to open-mmlab.s3.ap-northeast-2.amazonaws.com (open-mmlab.s3.ap-northeast-2.amazonaws.com)|52.219.60.147|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-09-10 00:25:51 ERROR 403: Forbidden.

--2021-09-10 00:25:51--  https://open-mmlab.s3.ap-northeast-2.amazonaws.com/mmaction/filelist/thumos14_tag_test_normalized_proposal_list.txt
Resolving open-mmlab.s3.ap-northeast-2.amazonaws.com (open-mmlab.s3.ap-northeast-2.amazonaws.com)... 52.219.56.71
Connecting to open-mmlab.s3.ap-northeast-2.amazonaws.com (open-mmlab.s3.ap-northeast-2.amazonaws.com)|52.219.56.71|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-09-10 00:25:51 ERROR 403: Forbidden.

Thank you.

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

I was using the command: python tools/test.py ./configs/recognition/swin/swin_small_patch244_window877_kinetics400_1k.py ./swin_small_patch244_window877_kinetics400_1k.pth --eval top_k_accuracy to do inference, the error occurred. someone says that I need to decrease the batch_size, but I didn't find the correlative parameter.

ETA:Traceback (most recent call last):
File "/root/obelisk/Collection/Video-Swin-Transformer-master/tools/test.py", line 364, in
main()
File "/root/obelisk/Collection/Video-Swin-Transformer-master/tools/test.py", line 349, in main
outputs = inference_pytorch(args, cfg, distributed, data_loader)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/tools/test.py", line 160, in inference_pytorch
outputs = single_gpu_test(model, data_loader)
File "/root/anaconda3/lib/python3.9/site-packages/mmcv/engine/test.py", line 33, in single_gpu_test
result = model(return_loss=False, **data)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
return super().forward(*inputs, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/base.py", line 258, in forward
return self.forward_test(imgs, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/recognizer3d.py", line 90, in forward_test
return self._do_test(imgs).cpu().numpy()
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/recognizer3d.py", line 47, in _do_test
x = self.extract_feat(batch_imgs)
File "/root/anaconda3/lib/python3.9/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/recognizers/base.py", line 157, in extract_feat
x = self.backbone(imgs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/backbones/swin_transformer.py", line 652, in forward
x = self.patch_embed(x)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/obelisk/Collection/Video-Swin-Transformer-master/mmaction/models/backbones/swin_transformer.py", line 449, in forward
x = self.proj(x) # B C D Wh Ww
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 590, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/root/anaconda3/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 585, in _conv_forward
return F.conv3d(
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

modify the config file

when i use swin_base_patch244_window877_kinetics400_22k to train my dataset, the config file:

dataset settings

dataset_type = 'VideoDataset'
data_root = 'data/kinetics400/train'
data_root_val = 'data/kinetics400/val'
ann_file_train = 'data/kinetics400/kinetics400_train_list.txt'
ann_file_val = 'data/kinetics400/kinetics400_val_list.txt'
ann_file_test = 'data/kinetics400/kinetics400_val_list.txt'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_bgr=False)
train_pipeline = [
dict(type='DecordInit'),
dict(type='SampleFrames', clip_len=32, frame_interval=2, num_clips=1),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='RandomResizedCrop'),
dict(type='Resize', scale=(224, 224), keep_ratio=False),
dict(type='Flip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='Collect', keys=['imgs', 'label'], meta_keys=[]),
dict(type='ToTensor', keys=['imgs', 'label'])
]

because my frame txt format is :

    some/directory-1 163 1
    some/directory-2 122 1
    some/directory-3 258 2
    some/directory-4 234 2
    some/directory-5 295 3
    some/directory-6 121 3

I want to change dataset_type = 'RawframeDataset', do i need modify "dict(type='DecordInit')" ?

the input is the video?

Before raising a question, you may need to check the following listed items.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. I have read the FAQ documentation but cannot get the expected help.

demo.py is not the newest, please update.

Original in your docs: inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt')
The newest: inference_recognizer(model, 'demo/demo.mp4')
If not,when you run demo.py, error occoured such as " inference_recognizer(model, 'demo/demo.mp4', 'demo/label_map_k400.txt')"

Drop path rate

Hi,

model=dict(backbone=dict(patch_size=(2,4,4), drop_path_rate=0.1), test_cfg=dict(max_testing_views=4))

Code claims Swin small uses 0.1 drop path rate, but does it match with the report which reads 0.2?
Swin-T and Swin-B uses 0.1, and 0.3 respectively as follows:

model=dict(backbone=dict(patch_size=(2,4,4), drop_path_rate=0.1), test_cfg=dict(max_testing_views=4))

model=dict(backbone=dict(patch_size=(2,4,4), drop_path_rate=0.3), test_cfg=dict(max_testing_views=4))

Thanks,

Keeping the temporal dimension

Hi, thanks for your fascinating work!

I want to use the video swin transfomer as a backbone, but my model should produce an output for each input frame.
Thus I want to keep the temporal dimension of input after the forward pass.

So I'm thinking of changing the parameter like this patch_size=(1,4,4), but I am concerned about whether this could violate the authors' intention to make spatio-temporal feature.

Apart from the memory usage issue, is it okay to make the temporal window size of the patch embedding to 1?

What head to use ?

Hi!

There is a problem with the Video Swin Transformer code at the moment, as it is done in a way that makes it impossible to change the number of target classes in an end-to-end fashion. Wanting to use your model on another dataset containing for example 10 or 50 classes, the network gives me an output for the head.

I build a model:

model_VST = SwinTransformer3D()
model_VST.cuda()

You can see that I don't have any class number argument in the template paranthese, indeed your code doesn't take that as an argument. Here are the argumentsthat your Video Swin Transformer model takes as input:
There is nothing for the number of target classes. Right now for an input shape of torch.Size([1, 8, 3, 64, 64]), I get an output shape of torch.Size([1, 768, 2, 2, 2]) from the model_VST (which is the SwinTransformer3D).

**I understand that I need to add an head to it, but it is not clear at all in your code how to properly manage that. What head to use ? **

Maybe you can make something like Facebook did with their TimeSformer, they made an end-to-end version of it for classification of videos.

KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

when i use :
python tools/train.py configs/recognition/swin/swin_base_patch244_window877_kinetics400_22k.py

an error occurred:

Traceback (most recent call last):
File "tools/train.py", line 199, in
main()
File "tools/train.py", line 154, in main
model = build_model(
File "/home/pytorch/lib/python3/site-packages/mmaction/models/builder.py", line 70, in build_model
return build_localizer(cfg)
File "/home/pytorch/lib/python3/site-packages/mmaction/models/builder.py", line 62, in build_localizer
return LOCALIZERS.build(cfg)
File "/home/pytorch/lib/python3/site-packages/mmcv/utils/registry.py", line 210, in build
return self.build_func(*args, **kwargs, registry=self)
File "/home/pytorch/lib/python3/site-packages/mmcv/cnn/builder.py", line 26, in build_model_from_cfg
return build_from_cfg(cfg, registry, default_args)
File "/home/pytorch/lib/python3/site-packages/mmcv/utils/registry.py", line 54, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
KeyError: "Recognizer3D: 'SwinTransformer3D is not in the models registry'"

How to solve it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.