zhang-can / pan-pytorch Goto Github PK

View Code? Open in Web Editor NEW

102.0 12.0 10.0 49 KB

[Codes of paper]: PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

Home Page: https://arxiv.org/abs/2008.03462

License: Apache License 2.0

Python 98.54% Shell 1.46%

action-recognition video-understanding motion-representation

pan-pytorch's Introduction

PAN: Persistent Appearance Network

PyTorch Implementation of paper:

PAN: Towards Fast Action Recognition via Learning Persistence of Appearance

Can Zhang, Yuexian Zou*, Guang Chen and Lei Gan.

[ArXiv]

Updates

[12 Aug 2020] We have released the codebase and models of the PAN.

Main Contribution

Efficiently modeling dynamic motion information in videos is crucial for action recognition task. Most state-of-the-art methods heavily rely on dense optical flow as motion representation. Although combining optical flow with RGB frames as input can achieve excellent recognition performance, the optical flow extraction is very time-consuming. This undoubtably will count against real-time action recognition. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. We design a novel motion cue called Persistence of Appearance (PA) that focuses more on distilling the motion information at boundaries. Extensive experiments show that our PA is over 1000x faster (8196fps vs. 8fps) than conventional optical flow in terms of motion modeling speed.

Content

Dependencies
Data Preparation
Core Codes
- PA Module
- VAP Module
Pretrained Models
- Something-Something-V1
- Something-Something-V2
Testing
Training
Other Info

Dependencies

Please make sure the following libraries are installed successfully:

Data Preparation

Following the common practice, we need to first extract videos into frames for fast reading. Please refer to TSN repo for the detailed guide of data pre-processing. We have successfully trained on Kinetics, UCF101, HMDB51, Something-Something-V1 and V2, Jester datasets with this codebase. Basically, the processing of video data can be summarized into 3 steps:

Extract frames from videos:
- For Something-Something-V2 dataset, please use tools/vid2img_sthv2.py
- For Kinetics dataset, please use tools/vid2img_kinetics.py
Generate file lists needed for dataloader:
- Each line of the list file will contain a tuple of (extracted video frame folder name, video frame number, and video groundtruth class). A list file looks like this:
```
video_frame_folder 100 10
video_2_frame_folder 150 31
...
```
- Or you can use off-the-shelf tools provided by other repos:
  - For Something-Something-V1 & V2 datasets, please use tools/gen_label_sthv1.py & tools/gen_label_sthv2.py
  - For Kinetics dataset, please use tools/gen_label_kinetics.py
Add the information to ops/dataset_configs.py

Core Codes

PA Module

PA module aims to speed up the motion modeling procedure, it can be simply injected at the bottom of the network to lift the reliance on optical flow.

from ops.PAN_modules import PA

PA_module = PA(n_length=4) # adjacent '4' frames are sampled for computing PA
# shape of x: [N*T*m, 3, H, W]
x = torch.randn(5*8*4, 3, 224, 224)
# shape of PA_out: [N*T, m-1, H, W]
PA_out = PA_module(x) # torch.Size([40, 3, 224, 224])

VAP Module

VAP module aims to adaptively emphasize expressive features and suppress less informative ones by observing global information across various timescales. It is adopted at the top of the network to achieve long-term temporal modeling.

from ops.PAN_modules import VAP

VAP_module = VAP(n_segment=8, feature_dim=2048, num_class=174, dropout_ratio=0.5)
# shape of x: [N*T, D]
x = torch.randn(5*8, 2048)
# shape of VAP_out: [N, num_class]
VAP_out = VAP_module(x) # torch.Size([5, 174])

Pretrained Models

Here, we provide the pretrained models of PAN models on Something-Something-V1 & V2 datasets. Recognizing actions in these datasets requires strong temporal modeling ability, as many action classes are symmetrical. PAN achieves state-of-the-art performance on these datasets. Notably, our method even surpasses optical flow based methods while with only RGB frames as input.

Something-Something-V1

Model	Backbone	FLOPs * views	Val Top1	Val Top5	Checkpoints
PAN_Lite	ResNet-50	35.7G * 1	48.0	76.1	[Google Drive] or [Weiyun]
PAN_Full		67.7G * 1	50.5	79.2
PAN_En		(46.6G+88.4G) * 2	53.4	81.1
PAN_En	ResNet-101	(85.6G+166.1G) * 2	55.3	82.8	[Google Drive] or [Weiyun]

Something-Something-V2

Model	Backbone	FLOPs * views	Val Top1	Val Top5	Checkpoints
PAN_Lite	ResNet-50	35.7G * 1	60.8	86.7	[Google Drive] or [Weiyun]
PAN_Full		67.7G * 1	63.8	88.6
PAN_En		(46.6G+88.4G) * 2	66.2	90.1
PAN_En	ResNet-101	(85.6G+166.1G) * 2	66.5	90.6	[Google Drive] or [Weiyun]

Testing

For example, to test the PAN models on Something-Something-V1, you can first put the downloaded .pth.tar files into the "pretrained" folder and then run:

# test PAN_Lite
bash scripts/test/sthv1/Lite.sh

# test PAN_Full
bash scripts/test/sthv1/Full.sh

# test PAN_En
bash scripts/test/sthv1/En.sh

Training

We provided several scripts to train PAN with this repo, please refer to "scripts" folder for more details. For example, to train PAN on Something-Something-V1, you can run:

# train PAN_Lite
bash scripts/train/sthv1/Lite.sh

# train PAN_Full RGB branch
bash scripts/train/sthv1/Full_RGB.sh

# train PAN_Full PA branch
bash scripts/train/sthv1/Full_PA.sh

Notice that you should scale up the learning rate with batch size. For example, if you use a batch size of 256 you should set learning rate to 0.04.

Other Info

References

This repository is built upon the following baseline implementations for the action recognition task.

Citation

Please [★star] this repo and [cite] the following arXiv paper if you feel our PAN useful to your research:

@misc{zhang2020pan,
    title={PAN: Towards Fast Action Recognition via Learning Persistence of Appearance},
    author={Can Zhang and Yuexian Zou and Guang Chen and Lei Gan},
    year={2020},
    eprint={2008.03462},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Or if you prefer "publication", you can cite our preliminary work on ACM MM 2019:

@inproceedings{zhang2019pan,
  title={PAN: Persistent Appearance Network with an Efficient Motion Cue for Fast Action Recognition},
  author={Zhang, Can and Zou, Yuexian and Chen, Guang and Gan, Lei},
  booktitle={Proceedings of the 27th ACM International Conference on Multimedia},
  pages={500--509},
  year={2019}
}

Contact

For any questions, please feel free to open an issue or contact:

Can Zhang: [email protected]

pan-pytorch's People

Stargazers

Watchers

Forkers

kobewangsky hzhang57 siyamsajeebkhan huangzushu amo5 topnetfish ktrehear dralmadani tankgit khm159

pan-pytorch's Issues

Should I use tools/vid2img_sthv2.py to extract frames of JESTER dataset?

First of all, thank you for a great paper! :)

I am working on implementing your paper on 20BN-JESTER hand recognition dataset, and had a few questions.

To extract frames of Jester Dataset, should I use (refer to):

this repo's tools/vid2img_sthv2.py OR
"dump_frame" function in TSN repo (https://github.com/yjxiong/temporal-segment-networks/blob/master/tools/build_of.py)

Do you happen to have any pre-trained model checkpoints for Jester dataset?
Could I ask about how long it took to train different models (resnet50, resnet101, PAN_lite, PAN_full) with Jester dataset?

Thank you so much for your help.

Lucrece

What do the parameters mean？

hello,
shape of x: [N*T*m, 3, H, W]
What do the parameters(N,T,m) mean？

RuntimeError: shape '[-1, 3, 224, 224]' is invalid for input of size 12288

在测试somethingv2数据集的时候出现如下问题，在En.sh,Full.sh,Lite.sh三个测试脚本上都会有如下情况。
I found RuntimeError when I running somethingv2 test scripts.

Traceback (most recent call last):
File "C:\Users\ganjunsi\Downloads\PAN\test_models.py", line 289, in
rst = eval_video((i, data, label), net, n_seg, modality)
File "C:\Users\ganjunsi\Downloads\PAN\test_models.py", line 128, in eval_video
rst = net(data_in)
File "C:\Users\ganjunsi.conda\envs\gjs-pan\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\ganjunsi.conda\envs\gjs-pan\lib\site-packages\torch\nn\parallel\data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "C:\Users\ganjunsi.conda\envs\gjs-pan\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\ganjunsi\Downloads\PAN\ops\models.py", line 290, in forward
PA = self.PA(input)
File "C:\Users\ganjunsi.conda\envs\gjs-pan\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, *kwargs)
File "C:\Users\ganjunsi\Downloads\PAN\ops\PAN_modules.py", line 28, in forward
PA = d.view(-1, 1(self.n_length-1), h, w)
RuntimeError: shape '[-1, 3, 224, 224]' is invalid for input of size 12288

Can we inference online video based on test_module?

test_module may need n_segments=8, but can we inference with n_segments=1 and get same accuracy?

How to create a webcam demo with real-time video feed?

Firstly, many thanks for making the codebase public.
Now, I wanted to make a webcam based demo (creating a buffer of latest N video frames and recognizing actions for them) with models trained on something-something-v2 datasets with 'RGB', 'PA' modality. I am not able to find out a way to do the same using the existing code... :( Please help!

About PA as attention

Thank you for your great work! I am really interested with your work!

The Lite modality is computed as the equation(7)

,but the code released is not fit the equation(7).

elif self.modality == 'Lite':
                input = input.view((-1, sample_len) + input.size()[-2:])
                PA = self.PA(input)
                RGB = input.view((-1, self.data_length, sample_len) + input.size()[-2:])[:,0,:,:,:]
                base_out = torch.cat((RGB, PA), 1)  # is not fit the equation(7)
                base_out = self.base_model(base_out)

So could you release the code about equation(7).

AttributeError: 'PAN' object has no attribute 'PA'

Hi, thanks for your good job and i want to know if it has no PA module when we use mobilenetv2 as backbone. Because the code dosen't implement PA module for the light-weight model, right?

Does it detect actions of multiple person?

how to extract features

how to extract features from pan-pytorch

About Picture Visualization

Thank you for your great work!
Could you tell me how to draw the picture of PA?

How to use PAN for custom action recognition datasets?

Thanks a lot for providing the code and documentation. Could you please provide some instructions on how I could use this for my own custom dataset that has 20 action classes?

about 'VAP moudle'

Is it effective in other types of timing modeling tasks?What kind of tasks do you think this model is suitable for?
Looking forward to your answer！

What is the label start number?

Hello, zhang-can

Thanks for your code sharing.
It was very helpful.

I have a question while using this code.
I inputed ucf 101 file like this...

in UCF-101 dataset, ApplyEyeMakeup is first lable so i make training.txt like this..
v_ApplyEyeMakeup_g01_c01 163 1
Am I wrong??

Thank you for your reply in advance.

miss dimension weights and graph BNInception

I try to train PAN_Lite using BNInception backbone, but when I load weights to graph. I got this error. I download weights and load from this downloaded file
File "main.py", line 419, in
main()
File "main.py", line 91, in main
non_local=args.non_local, data_length=data_length, has_VAP=args.VAP)
File "/workspace/quantt1/PAN-PyTorch/ops/models.py", line 76, in init
self.base_model = self._construct_pa_model(self.base_model)
File "/workspace/quantt1/PAN-PyTorch/ops/models.py", line 361, in _construct_pa_model
base_model.load_state_dict(torch.load("/workspace/quantt1/PAN-PyTorch/BNInceptionFlow-ef652051.pth.tar"))
File "/home/me/.conda/envs/quan/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BNInception:
Missing key(s) in state_dict: "fc.TES.0.weight", "fc.TES.2.weight", "fc.pred.weight", "fc.pred.bias".
size mismatch for conv1_7x7_s2.weight: copying a param with shape torch.Size([64, 10, 7, 7]) from checkpoint, the shape in current model is torch.Size([64, 6, 7, 7]).
Do anyone can help me? Thanks