Coder Social home page Coder Social logo

zhang-can / pan-pytorch Goto Github PK

View Code? Open in Web Editor NEW
102.0 12.0 10.0 49 KB

[Codes of paper]: PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

Home Page: https://arxiv.org/abs/2008.03462

License: Apache License 2.0

Python 98.54% Shell 1.46%
action-recognition video-understanding motion-representation

pan-pytorch's Introduction

PAN: Persistent Appearance Network

PWC PWC PWC

PyTorch Implementation of paper:

PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

Can Zhang, Yuexian Zou*, Guang Chen and Lei Gan.

[ArXiv]

Updates

[12 Aug 2020] We have released the codebase and models of the PAN.

Main Contribution

Efficiently modeling dynamic motion information in videos is crucial for action recognition task. Most state-of-the-art methods heavily rely on dense optical flow as motion representation. Although combining optical flow with RGB frames as input can achieve excellent recognition performance, the optical flow extraction is very time-consuming. This undoubtably will count against real-time action recognition. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. We design a novel motion cue called Persistence of Appearance (PA) that focuses more on distilling the motion information at boundaries. Extensive experiments show that our PA is over 1000x faster (8196fps vs. 8fps) than conventional optical flow in terms of motion modeling speed.

Content

Dependencies

Please make sure the following libraries are installed successfully:

Data Preparation

Following the common practice, we need to first extract videos into frames for fast reading. Please refer to TSN repo for the detailed guide of data pre-processing. We have successfully trained on Kinetics, UCF101, HMDB51, Something-Something-V1 and V2, Jester datasets with this codebase. Basically, the processing of video data can be summarized into 3 steps:

  1. Extract frames from videos:

  2. Generate file lists needed for dataloader:

    • Each line of the list file will contain a tuple of (extracted video frame folder name, video frame number, and video groundtruth class). A list file looks like this:

      video_frame_folder 100 10
      video_2_frame_folder 150 31
      ...
      
    • Or you can use off-the-shelf tools provided by other repos:

  3. Add the information to ops/dataset_configs.py

Core Codes

PA Module

PA module aims to speed up the motion modeling procedure, it can be simply injected at the bottom of the network to lift the reliance on optical flow.

from ops.PAN_modules import PA

PA_module = PA(n_length=4) # adjacent '4' frames are sampled for computing PA
# shape of x: [N*T*m, 3, H, W]
x = torch.randn(5*8*4, 3, 224, 224)
# shape of PA_out: [N*T, m-1, H, W]
PA_out = PA_module(x) # torch.Size([40, 3, 224, 224])

VAP Module

VAP module aims to adaptively emphasize expressive features and suppress less informative ones by observing global information across various timescales. It is adopted at the top of the network to achieve long-term temporal modeling.

from ops.PAN_modules import VAP

VAP_module = VAP(n_segment=8, feature_dim=2048, num_class=174, dropout_ratio=0.5)
# shape of x: [N*T, D]
x = torch.randn(5*8, 2048)
# shape of VAP_out: [N, num_class]
VAP_out = VAP_module(x) # torch.Size([5, 174])

Pretrained Models

Here, we provide the pretrained models of PAN models on Something-Something-V1 & V2 datasets. Recognizing actions in these datasets requires strong temporal modeling ability, as many action classes are symmetrical. PAN achieves state-of-the-art performance on these datasets. Notably, our method even surpasses optical flow based methods while with only RGB frames as input.

Something-Something-V1

Model Backbone FLOPs * views Val Top1 Val Top5 Checkpoints
PANLite ResNet-50 35.7G * 1 48.0 76.1 [Google Drive] or [Weiyun]
PANFull 67.7G * 1 50.5 79.2
PANEn (46.6G+88.4G) * 2 53.4 81.1
PANEn ResNet-101 (85.6G+166.1G) * 2 55.3 82.8 [Google Drive] or [Weiyun]

Something-Something-V2

Model Backbone FLOPs * views Val Top1 Val Top5 Checkpoints
PANLite ResNet-50 35.7G * 1 60.8 86.7 [Google Drive] or [Weiyun]
PANFull 67.7G * 1 63.8 88.6
PANEn (46.6G+88.4G) * 2 66.2 90.1
PANEn ResNet-101 (85.6G+166.1G) * 2 66.5 90.6 [Google Drive] or [Weiyun]

Testing

For example, to test the PAN models on Something-Something-V1, you can first put the downloaded .pth.tar files into the "pretrained" folder and then run:

# test PAN_Lite
bash scripts/test/sthv1/Lite.sh

# test PAN_Full
bash scripts/test/sthv1/Full.sh

# test PAN_En
bash scripts/test/sthv1/En.sh

Training

We provided several scripts to train PAN with this repo, please refer to "scripts" folder for more details. For example, to train PAN on Something-Something-V1, you can run:

# train PAN_Lite
bash scripts/train/sthv1/Lite.sh

# train PAN_Full RGB branch
bash scripts/train/sthv1/Full_RGB.sh

# train PAN_Full PA branch
bash scripts/train/sthv1/Full_PA.sh

Notice that you should scale up the learning rate with batch size. For example, if you use a batch size of 256 you should set learning rate to 0.04.

Other Info

References

This repository is built upon the following baseline implementations for the action recognition task.

Citation

Please [★star] this repo and [cite] the following arXiv paper if you feel our PAN useful to your research:

@misc{zhang2020pan,
    title={PAN: Towards Fast Action Recognition via Learning Persistence of Appearance},
    author={Can Zhang and Yuexian Zou and Guang Chen and Lei Gan},
    year={2020},
    eprint={2008.03462},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Or if you prefer "publication", you can cite our preliminary work on ACM MM 2019:

@inproceedings{zhang2019pan,
  title={PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition},
  author={Zhang, Can and Zou, Yuexian and Chen, Guang and Gan, Lei},
  booktitle={Proceedings of the 27th ACM International Conference on Multimedia},
  pages={500--509},
  year={2019}
}

Contact

For any questions, please feel free to open an issue or contact:

Can Zhang: [email protected]

pan-pytorch's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pan-pytorch's Issues

Should I use tools/vid2img_sthv2.py to extract frames of JESTER dataset?

First of all, thank you for a great paper! :)

I am working on implementing your paper on 20BN-JESTER hand recognition dataset, and had a few questions.

  1. To extract frames of Jester Dataset, should I use (refer to):
  1. Do you happen to have any pre-trained model checkpoints for Jester dataset?

  2. Could I ask about how long it took to train different models (resnet50, resnet101, PAN_lite, PAN_full) with Jester dataset?

Thank you so much for your help.

Lucrece

RuntimeError: shape '[-1, 3, 224, 224]' is invalid for input of size 12288

在测试somethingv2数据集的时候出现如下问题,在En.sh,Full.sh,Lite.sh三个测试脚本上都会有如下情况。
I found RuntimeError when I running somethingv2 test scripts.

Traceback (most recent call last):
File "C:\Users\ganjunsi\Downloads\PAN\test_models.py", line 289, in
rst = eval_video((i, data, label), net, n_seg, modality)
File "C:\Users\ganjunsi\Downloads\PAN\test_models.py", line 128, in eval_video
rst = net(data_in)
File "C:\Users\ganjunsi.conda\envs\gjs-pan\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\ganjunsi.conda\envs\gjs-pan\lib\site-packages\torch\nn\parallel\data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "C:\Users\ganjunsi.conda\envs\gjs-pan\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\ganjunsi\Downloads\PAN\ops\models.py", line 290, in forward
PA = self.PA(input)
File "C:\Users\ganjunsi.conda\envs\gjs-pan\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, *kwargs)
File "C:\Users\ganjunsi\Downloads\PAN\ops\PAN_modules.py", line 28, in forward
PA = d.view(-1, 1
(self.n_length-1), h, w)
RuntimeError: shape '[-1, 3, 224, 224]' is invalid for input of size 12288

How to create a webcam demo with real-time video feed?

Firstly, many thanks for making the codebase public.
Now, I wanted to make a webcam based demo (creating a buffer of latest N video frames and recognizing actions for them) with models trained on something-something-v2 datasets with 'RGB', 'PA' modality. I am not able to find out a way to do the same using the existing code... :( Please help!

About PA as attention

Thank you for your great work! I am really interested with your work!

The Lite modality is computed as the equation(7)
image
,but the code released is not fit the equation(7).

elif self.modality == 'Lite':
                input = input.view((-1, sample_len) + input.size()[-2:])
                PA = self.PA(input)
                RGB = input.view((-1, self.data_length, sample_len) + input.size()[-2:])[:,0,:,:,:]
                base_out = torch.cat((RGB, PA), 1)  # is not fit the equation(7)
                base_out = self.base_model(base_out)

So could you release the code about equation(7).

about 'VAP moudle'

Is it effective in other types of timing modeling tasks?What kind of tasks do you think this model is suitable for?
Looking forward to your answer!

What is the label start number?

Hello, zhang-can

Thanks for your code sharing.
It was very helpful.

I have a question while using this code.
I inputed ucf 101 file like this...

in UCF-101 dataset, ApplyEyeMakeup is first lable so i make training.txt like this..
v_ApplyEyeMakeup_g01_c01 163 1
Am I wrong??

Thank you for your reply in advance.

miss dimension weights and graph BNInception

I try to train PAN_Lite using BNInception backbone, but when I load weights to graph. I got this error. I download weights and load from this downloaded file
File "main.py", line 419, in
main()
File "main.py", line 91, in main
non_local=args.non_local, data_length=data_length, has_VAP=args.VAP)
File "/workspace/quantt1/PAN-PyTorch/ops/models.py", line 76, in init
self.base_model = self._construct_pa_model(self.base_model)
File "/workspace/quantt1/PAN-PyTorch/ops/models.py", line 361, in _construct_pa_model
base_model.load_state_dict(torch.load("/workspace/quantt1/PAN-PyTorch/BNInceptionFlow-ef652051.pth.tar"))
File "/home/me/.conda/envs/quan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BNInception:
Missing key(s) in state_dict: "fc.TES.0.weight", "fc.TES.2.weight", "fc.pred.weight", "fc.pred.bias".
size mismatch for conv1_7x7_s2.weight: copying a param with shape torch.Size([64, 10, 7, 7]) from checkpoint, the shape in current model is torch.Size([64, 6, 7, 7]).
Do anyone can help me? Thanks

How to generate visualizations of PAN

In the repository you show how PAN works for different actions (e.g. Yoyo). This is a very useful insight and I was wondering how I can generate similar visuals from data that is given to the model for inference.

Environments

Could you please tell me about running environments?
python version, cuda version, pytorch version, and other things...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.