Hi Fan Yang. Would it be possible to live demo the JHMDB model using

Demo from Camera Input using Openpose about dd-net HOT 8 CLOSED

fandulu commented on July 20, 2024

Demo from Camera Input using Openpose

from dd-net.

Comments (8)

anhminh3105 commented on July 20, 2024 1

I have tried to trained ddnet on the whole dataset by concatenating the splits into one in order to obtain better training results (94-95% val_acc). Yet the results when testing from input source of camera weren't good in the way that predictions sometimes flickered for poses that were a bit difficult that maybe the dataset didn't contain. For example, sitting but with a rather relaxed laid back pose rather than straight up, the model would flicker to 'stand', 'wave', etc. I wondered training on the combined splits of data didn't cut it and the training results weren't that representative. I tried to train on a subset of the dataset with a few selective classes (e.g. walk, stand, sit) but the problem remained.

I also tried to improve training performance in terms of the splits with selective class subsets of data and I managed to pull val_acc of each split over 80% using weighted class to alleviate the skewed effect since 'walk' has ~3 times more data compared to others. So I guessed this is more representative of what the model would perform in real life. When testing the inference performance of-course wasn't there.

note: Due to the data of 'pos_world' was normalised with the scale of the puppetflow. I couldn't obtain that data with just openpose, or at least I don't know how to do that, so I used 'pos_img' data instead and normalised it by mean (using your norm_scale() function).

Do you have any suggestions to improve it?

Br.

from dd-net.

fandulu commented on July 20, 2024 1

It is really appreciated that many of you help me to improve my code! I am not an expert on action recognition and still on the way to learn, but I would like to share what I know. Although this work use skeletons, but I find that RGB usually helps more to obtain better action recognition performance because it is easy to introduce noise in skeleton estimation, and we also lose the context information, which is very useful to identify the actions. For some real-applications, you even can use a simple but decent method https://www.pyimagesearch.com/2019/07/15/video-classification-with-keras-and-deep-learning/

from dd-net.

fandulu commented on July 20, 2024

Hi, you may make a temporal window (e.g., window_size = 3s, steps = 0.5s) to convert the on-line stream to off-line clips. The prediction results of several clips (e.g., 3 clips) could be averaged together to obtain a reasonable accuracy for online-stream.

from dd-net.

anhminh3105 commented on July 20, 2024

Thanks for your reply.

Apologies that I don't fully understand your idea, could you explain a bit more on making a temporal window to convert input stream to offline clips? Is what you mean similar to using a 3D convolutional layer?

I am thinking of feeding to DD-Net a pose keypoint volume input of shape (num_people_poses, 32, 15, 2) collected after 32 frames of the input stream. The action labels are then get assigned to the people poses and visualised Do you think it would also work?

Br.

from dd-net.

fandulu commented on July 20, 2024

(1) A simple way to use the temporal window: suppose you want the pose class at time T, to utilize the temporal information, you may use the poses information a few moments ago, with a temporal window W, then you collect pose information from time T-W to T, which can be feed into this model. How far away you want to use the old information? If it is too far, your pose action already changed; reversely, not enough temporal information to be used. That is something you may need to balance. After you have a window, how frequently do you want to do action classifications? You use a step L, so your next window will start at T-W+L to T+L. If you suppose the action class is similar within N steps, you may average the predicted action class score for N*T.

(2) For multiple people case, you may take the statistical values (e.g., mean, max, min) of features for several persons, and then use another network to fuse them.

from dd-net.

anhminh3105 commented on July 20, 2024

Thank you for your supportive suggestion and detailed elaboration, I'm very appreciated.

I suppose that (2) would be for multiple people action, am I correct? In case of predicting multiple people and multiple actions, I suppose I would need to just average the predicted action for each person and the fusing network should not be needed, right?

Br.

from dd-net.

fandulu commented on July 20, 2024

Sorry for misunderstanding your points, when I saw Openpose I thought you were doing group activities recognition but you could use it for individuals by pose tracking. For multiple person action, it is not to average the final actions but could average the middle layer features. Anyhow, it seems to be unrelated.

from dd-net.

anhminh3105 commented on July 20, 2024

Many thanks in return for your sharing. I'm going to look into it.

Br.

from dd-net.

Demo from Camera Input using Openpose about dd-net HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent