Coder Social home page Coder Social logo

Comments (8)

anhminh3105 avatar anhminh3105 commented on July 20, 2024 1

I have tried to trained ddnet on the whole dataset by concatenating the splits into one in order to obtain better training results (94-95% val_acc). Yet the results when testing from input source of camera weren't good in the way that predictions sometimes flickered for poses that were a bit difficult that maybe the dataset didn't contain. For example, sitting but with a rather relaxed laid back pose rather than straight up, the model would flicker to 'stand', 'wave', etc. I wondered training on the combined splits of data didn't cut it and the training results weren't that representative. I tried to train on a subset of the dataset with a few selective classes (e.g. walk, stand, sit) but the problem remained.

I also tried to improve training performance in terms of the splits with selective class subsets of data and I managed to pull val_acc of each split over 80% using weighted class to alleviate the skewed effect since 'walk' has ~3 times more data compared to others. So I guessed this is more representative of what the model would perform in real life. When testing the inference performance of-course wasn't there.

note: Due to the data of 'pos_world' was normalised with the scale of the puppetflow. I couldn't obtain that data with just openpose, or at least I don't know how to do that, so I used 'pos_img' data instead and normalised it by mean (using your norm_scale() function).

Do you have any suggestions to improve it?

Br.

from dd-net.

fandulu avatar fandulu commented on July 20, 2024 1

It is really appreciated that many of you help me to improve my code! I am not an expert on action recognition and still on the way to learn, but I would like to share what I know. Although this work use skeletons, but I find that RGB usually helps more to obtain better action recognition performance because it is easy to introduce noise in skeleton estimation, and we also lose the context information, which is very useful to identify the actions. For some real-applications, you even can use a simple but decent method https://www.pyimagesearch.com/2019/07/15/video-classification-with-keras-and-deep-learning/

from dd-net.

fandulu avatar fandulu commented on July 20, 2024

Hi, you may make a temporal window (e.g., window_size = 3s, steps = 0.5s) to convert the on-line stream to off-line clips. The prediction results of several clips (e.g., 3 clips) could be averaged together to obtain a reasonable accuracy for online-stream.

from dd-net.

anhminh3105 avatar anhminh3105 commented on July 20, 2024

Thanks for your reply.

Apologies that I don't fully understand your idea, could you explain a bit more on making a temporal window to convert input stream to offline clips? Is what you mean similar to using a 3D convolutional layer?

I am thinking of feeding to DD-Net a pose keypoint volume input of shape (num_people_poses, 32, 15, 2) collected after 32 frames of the input stream. The action labels are then get assigned to the people poses and visualised Do you think it would also work?

Br.

from dd-net.

fandulu avatar fandulu commented on July 20, 2024

(1) A simple way to use the temporal window: suppose you want the pose class at time T, to utilize the temporal information, you may use the poses information a few moments ago, with a temporal window W, then you collect pose information from time T-W to T, which can be feed into this model. How far away you want to use the old information? If it is too far, your pose action already changed; reversely, not enough temporal information to be used. That is something you may need to balance. After you have a window, how frequently do you want to do action classifications? You use a step L, so your next window will start at T-W+L to T+L. If you suppose the action class is similar within N steps, you may average the predicted action class score for N*T.

(2) For multiple people case, you may take the statistical values (e.g., mean, max, min) of features for several persons, and then use another network to fuse them.

from dd-net.

anhminh3105 avatar anhminh3105 commented on July 20, 2024

Thank you for your supportive suggestion and detailed elaboration, I'm very appreciated.

I suppose that (2) would be for multiple people action, am I correct? In case of predicting multiple people and multiple actions, I suppose I would need to just average the predicted action for each person and the fusing network should not be needed, right?

Br.

from dd-net.

fandulu avatar fandulu commented on July 20, 2024

Sorry for misunderstanding your points, when I saw Openpose I thought you were doing group activities recognition but you could use it for individuals by pose tracking. For multiple person action, it is not to average the final actions but could average the middle layer features. Anyhow, it seems to be unrelated.

from dd-net.

anhminh3105 avatar anhminh3105 commented on July 20, 2024

Many thanks in return for your sharing. I'm going to look into it.

Br.

from dd-net.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.