Coder Social home page Coder Social logo

hmdb51-recognition's Introduction

Human Action Recognition - HMDB51 Dataset

In this repository, I propose a novel architecture combining 3DCNN and Attention Mechanism as a video classifier, and train it on HMDB51 dataset. The effect of adding cross attention and self attention modules was studied through ablation experiments.

The architecture based mostly on two prior arts R(2+1)D CNN - factorized 3D CNN, and Perceiver with three motivations:

  • Enhance the learning capacity on long temporal dependency of videos
  • Integrate inductive bias from convolution filters to attention
  • Make the video classifiier more computationally efficient

Method

The image above illustrates my proposed architecture. It contains of two main modules: CNN module and Attention module.

CNN Module:

  • Pretrained factorzied 3D CNN - R(2+1)D18 is used for this module.
    • It has 4 residual layers, each layer has 2 blocks of factorized 3DNN, each block has 5 CNN kernels
    • The number of layers are ablated in my experiments.

Attention module

  • Cross attention: use N trainable queries to performance attention operation on the tokens from the CNN module. Complexity: $O(MN)$
  • Self attention: model the interaction of tokens from cross-attention layers. Complexity: $O(N^2)$
  • There are L layers of self-attention per cross-attention. The total complexity for 1 group of cross and self attention is: $O(NM + LN^2)$

Compared Methods

  • R(2+1)D conv: computationally efficient network with good results
  • 3DCNN + BERT: replace the temporal global average pooling with BERT, which is large in model size
  • VideoMAE: SOTA art, fully attention based, self-supervising learning method, but heavy and need a large pretrained stage

Experiment Result

Ablation Study on Attention Depth

Architecture Train loss Val loss Val Acc CNN params Attention params
8 CNN blocks + 1 Cross Attention + 6 Self Attention 0.00676 1.585 0.6673 31.3M 28.6M
8 CNN blocks + 1 Cross Attention + 2 Self Attention 0.02675 1.484 0.6745 31.3M 11.8M
8 CNN blocks + 1 Cross Attention + 1 Self Attention 0.1199 1.271 0.7034 31.3M 7.6M
Baseline - R(2+1)D 0.5385 1.118 0.7270 31.5M 0

In my initial experiments, I explore the impact of attention module size on the model's performance. From the results shown in the table, it can be observed that increasing the size of the attention module leads to higher levels of overfitting. While the training losses of the first two models approach zero, the validation losses are significantly higher compared to the baseline. As a result, the validation accuracy of the larger model is notably lower. However, when I replace the classification head of the CNN module with 1 cross attention and 1 self-attention, I obtain results that closely resemble the baseline performance.

Ablation Study on CNN module

Architecture Train loss Val loss Val Acc CNN params Attention params
8 CNN blocks + 1 Cross Attention + 1 Self Attention 0.1199 1.271 0.7034 31.3M 7.6M
7 CNN blocks + 1 Cross Attention + 1 Self attention 0.2204 1.856 0.6437 17.1M 7.6M
6 CNN blocks + 1 Cross Attention + 6 Self attention 0.0071 2.005 0.5827 7.8M 28.6M
4 CNN blocks + 1 Cross Attention + 6 Self attention 0.9097 2.342 0.4062 1.9M 28.6M
4 CNN blocks + 2 Cross Attention + 12 Self attention 2.715 3.22 0.2815 1.9M 57.2M
Baseline - R(2+1)D 0.5385 1.118 0.7270 31.5M 0

The results of these experiments reveal a distinct pattern: the smaller the pretrained CNN, the poorer the overall performance. It becomes apparent that the model tends to overfit when utilizing more than 6 CNN blocks, whereas it encounters underfitting issues when employing only 4 CNN blocks, despite the attention module being larger to compensate for the model size. These findings indicate that the attention module fails to acquire substantial learning when trained on a small dataset.

Ablation Study on Cross and Self Attention

Architecture Train loss Val loss Val Acc CNN params Attention params
8 CNN blocks + 1 Cross Attention + 1 Self Attention 0.1199 1.271 0.7034 31.3M 7.6M
8 CNN blocks + 1 Cross Attention 0.2096 1.368 0. 7047 31.3M 3.4M
8 CNN blocks + 1 Self attention 0.0205 1.215 0.6752 31.3M 4.2M

Removing the self-attention module did not affect the validation accuracy. However, omitting the cross-attention module resulted in a considerable drop in performance.

Analysis and Discussion

I fail to outperform the baseline.

My model overfits on small training dataset due to Attention modules:

  • The attention modules have strong learning capacity when only small size of attention can lead to much lower training loss, but it does not generalize well.
  • Smaller depth and hidden dim of attention can reduce overfitting.

The model underfit when many CNN filters are removed. I compensate the model size by larger attention module, but limited data provides minimal learning for attention mechanisms.

Cross-attention is essential to pool important features because self-attention alone yields inferior results, and it is also computational expensive.

Future Improvement

  • Train with larger dataset, then finetune on small tasks
  • Design a pretrain stage, such as following VideoMAE
  • Analysis performance well-trained model on longer clips.

Running Experiment

The most convenient way to train the model is to edit file config.yaml beforehand,

python train.py

I also prepare a jupyter notebook train.ipynb to run on virtual machine services like Google Colab or Kaggle.

Citation

@misc{tran2018closer,
      title={A Closer Look at Spatiotemporal Convolutions for Action Recognition}, 
      author={Du Tran and Heng Wang and Lorenzo Torresani and Jamie Ray and Yann LeCun and Manohar Paluri},
      year={2018},
      eprint={1711.11248},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{jaegle2021perceiver,
      title={Perceiver: General Perception with Iterative Attention}, 
      author={Andrew Jaegle and Felix Gimeno and Andrew Brock and Andrew Zisserman and Oriol Vinyals and Joao Carreira},
      year={2021},
      eprint={2103.03206},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{kalfaoglu2020late,
      title={Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition}, 
      author={M. Esat Kalfaoglu and Sinan Kalkan and A. Aydin Alatan},
      year={2020},
      eprint={2008.01232},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{tong2022videomae,
      title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training}, 
      author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
      year={2022},
      eprint={2203.12602},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

hmdb51-recognition's People

Contributors

kietngt00 avatar

Stargazers

 avatar

Watchers

 avatar

Forkers

vuthanhtung2412

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.