fandulu / dd-net Goto Github PK

View Code? Open in Web Editor NEW

257.0 12.0 52.0 54.61 MB

A lightweight network for body/hand action recognition

License: MIT License

Jupyter Notebook 99.30% Python 0.70%

skeleton-application action-recognition jhmdb shrec hand-gesture-recognition simple-tutorial

dd-net's Introduction

👋 Hi, I am Fan Yang.

🔭 I’m currently working on multiple object tracking (MOT) and action recognition.
✍️ I am writing codes and papers to introduce some of my ideas on these topics.

dd-net's People

Contributors

Stargazers

Watchers

Forkers

ml-lab chaoso willamezhang trantorrepository jacoblee121 ngiraffe miaochenguo jasonleezju muhammadfraz bugsanderrors gesturepose fuzirong anhminh3105 wangtiancai mbencherif elitcenk dzungtx hsclyy hyzcn fzqneo just-to bachng rathnakarsp samueuh f18298335152h pister ai-machine-vision-lab yltsai0609 aakgun zhangjf2018 fengzi-lv zongyaowang josephgesnouin nicatio xfxcwynlc pdkyll llgzr shawnshenjx ffletcherr gakkispy khooinguyeen dreamst00 sangothesis xiaowenchen-maker stephenfang51 jack841002 nyanlinnmya musaru yurkar2333 akmmes ling233333 feiqial

dd-net's Issues

how to convert mediapipe to SHRED dataset format?

Demo from Camera Input using Openpose

Hi Fan Yang.

Would it be possible to live demo the JHMDB model using pose output from Openpose on a webcam input stream?

I find your project very fascinating and am playing around it recently. I would like to try to combine your action recognition model with the Openpose pose estimation model and make a live demo programme.

Since Openpose uses a lightly different output schema from that of the JHMDB dataset I may need to modify the preprocessing procedure in your code and re-train the model to make 2 models compatible with each other.

However, I don't quite understand what you did in processing the data before feeding it to the model (specifically the zoom() function in data_generator()) and the normalisation done by the dataset (specifically the 'pos_world' data). Could you help elaborate it a bit? Also, I am thinking the nomalisation technique used by JHMDB (normalising w.r.t to frame size and the puppet flow) might be too specialised and could not be implemented on pose output of OpenPose. I hope I'm wrong here and if so, could you shed some light on how to do it as well.

Thanks in advance.

Scaling and selection of features?

Hi thanks for sharing your great repo. I'd like to apply my custom dataset to a set of keypoints from a different domain. I' d like to use a pre-trained model from the JHDB dataset as i have 2D skeleton information.

How can i scale all my keypoints similar to how the JHMBD input has been done or can i pass raw keypoint through the zoom function.
Does there need to be an order of the keypoints?
In the below snippet What would be feat_d (why 105) and filter? Do i need to modify it?
What does the zoom function do?

class Config():
    def __init__(self):
        self.frame_l = 32 # the length of frames
        self.joint_n = 15 # the number of joints
        self.joint_d = 2 # the dimension of joints
        self.clc_num = 21 # the number of class
        self.feat_d = 105
        self.filters = 64
C = Config()

SHREC dataset link

Hi mate, I wonder if I can have others for the link of the SHREC dataset. The google drive link attached in the article seems to be too old and cannot be used, thanks.

method to average results across J-HMDB splits

Hi
I wish to understand your method to average the results obtained from 3 J-HMDB splits in detail.
One way that i think this average could be taken is:

run the model for 600 epochs(say), independently 10 times(say) on each split and get the best accuracy achieved in any epoch for every run. This gives us 10 best accuracies, one for each run.
average the 10 best test accuracies recorded from step 1 for each split.
Now, we have the 3 values, each value is the average of 10 runs on split 1,2, and 3, respectively.
average these 3 values to get the final average across 3 splits.

Is this method correct? If not, could you please elaborate your method?

Thank you

Application of DD-Net on OpenPose/AlphaPose/... results

Hey,

thanks for the amazing paper.
As the number of state of the art models dealing with the issue that only 2D skeletons may be available for many real world applications, I wonder if it would be possible to train your model on results of OpenPose/AlphaPose that I apply to RGB-videos.

I read through the JHMDB docu, and went through the data-set you provided (thanks again).
It seems that there are 433 individuals, each represented by an (X, 15, 2) array, X being some number like 35,38, 40,....
Could you elaborate a little what that array contains?
It would help me to rearrange the outcome of OpenPose/AlphaPose accordingly to apply DD-Net.

My guess is that X is the number of frames and the 15 x 2 array gives the x,y-coordinates of the skeleton-keypoints.
Also I learned that before applying DD-Net, I need to normalize the input as:
pos_world(1,:,:) = (pos_img(1,:,:)/W-0.5)*W/H./scale;
pos_world(2,:,:) = (pos_img(2,:,:)/H-0.5)./scale;
where W and H are the width and height of the frame and scale is given as spine length correct?

What exactly is meant by pos_world(1,:,:), pos_img(1,:,:), pos_world(2,:,:), and pos_img(2,:,:) ?

Thanks alot,
JFM

do predict with same error...

NameError Traceback (most recent call last)
in ()
8 print(X_1[0].shape)
9 model = build_DD_Net(C)
---> 10 model = load_model("jhmdb_split1_wn.h5")
11 layer_name = 'dense_5'
12 # data = [X_0,X_1]

~/anaconda3/lib/python3.6/site-packages/keras/models.py in load_model(filepath, custom_objects, compile)
268 raise ValueError('No model found in config file.')
269 model_config = json.loads(model_config.decode('utf-8'))
--> 270 model = model_from_config(model_config, custom_objects=custom_objects)
271
272 # set weights

~/anaconda3/lib/python3.6/site-packages/keras/models.py in model_from_config(config, custom_objects)
345 'Maybe you meant to use '
346 'Sequential.from_config(config)?')
--> 347 return layer_module.deserialize(config, custom_objects=custom_objects)
348
349

~/anaconda3/lib/python3.6/site-packages/keras/layers/init.py in deserialize(config, custom_objects)
53 module_objects=globs,
54 custom_objects=custom_objects,
---> 55 printable_module_name='layer')

~/anaconda3/lib/python3.6/site-packages/keras/utils/generic_utils.py in deserialize_keras_object(identifier, module_objects, custom_objects, printable_module_name)
142 return cls.from_config(config['config'],
143 custom_objects=dict(list(_GLOBAL_CUSTOM_OBJECTS.items()) +
--> 144 list(custom_objects.items())))
145 with CustomObjectScope(custom_objects):
146 return cls.from_config(config['config'])

~/anaconda3/lib/python3.6/site-packages/keras/engine/topology.py in from_config(cls, config, custom_objects)
2523 # First, we create all layers and enqueue nodes to be processed
2524 for layer_data in config['layers']:
-> 2525 process_layer(layer_data)
2526 # Then we process nodes in order of layer depth.
2527 # Nodes that cannot yet be processed (if the inbound node

~/anaconda3/lib/python3.6/site-packages/keras/engine/topology.py in process_layer(layer_data)
2509
2510 layer = deserialize_layer(layer_data,
-> 2511 custom_objects=custom_objects)
2512 created_layers[layer_name] = layer
2513

~/anaconda3/lib/python3.6/site-packages/keras/engine/topology.py in from_config(cls, config, custom_objects)
2533 if layer in unprocessed_nodes:
2534 for node_data in unprocessed_nodes.pop(layer):
-> 2535 process_node(layer, node_data)
2536
2537 name = config.get('name')

~/anaconda3/lib/python3.6/site-packages/keras/engine/topology.py in process_node(layer, node_data)
2490 if input_tensors:
2491 if len(input_tensors) == 1:
-> 2492 layer(input_tensors[0], **kwargs)
2493 else:
2494 layer(input_tensors, **kwargs)

~/anaconda3/lib/python3.6/site-packages/keras/engine/topology.py in call(self, inputs, **kwargs)
617
618 # Actually call the layer, collecting output(s), mask(s), and shape(s).
--> 619 output = self.call(inputs, **kwargs)
620 output_mask = self.compute_mask(inputs, previous_mask)
621

~/anaconda3/lib/python3.6/site-packages/keras/layers/core.py in call(self, inputs, mask)
683 if has_arg(self.function, 'mask'):
684 arguments['mask'] = mask
--> 685 return self.function(inputs, **arguments)
686
687 def compute_mask(self, inputs, mask=None):

~/anaconda3/lib/python3.6/site-packages/keras/layers/core.py in (x)
6
7 def pose_motion(P,frame_l):
----> 8 P_diff_slow = Lambda(lambda x: poses_diff(x))(P)
9 P_diff_slow = Reshape((frame_l,-1))(P_diff_slow)
10 P_fast = Lambda(lambda x: x[:,::2,...])(P)

NameError: name 'poses_diff' is not defined

confusion matrix doesn't match

Hello DD-Net team,
i was really excited to discover this great research paper on git.

I tried to run your code and was expecting the same confusion matrix by running your jupyter notebooks with the provided pickles containers. The accuracy of more than 95% is really something.

Unfortunately I don't achieve any results near 95%. The best result I got was 60 % by running your SHREC 1D lite / heavy jupyter notebooks.

Could you tell me what I need to do or to change to get similar results as in the research paper ?

Best regards

Tflite model

Is there a possibility of a tflite model in the colab

Apply this model?

Thanks for sharing this nice model.

I'm pretty new to Machine Learning and I'm trying to classify several desired hand gestures in real-time. I'm wondering how to implement this model to do that.

An error of poses_diff() function

def poses_diff(x):
    H, W = x.get_shape()[1], x.get_shape()[2]
    x = tf.subtract(x[:, :1, ...], x[:, :-1, ...])
    x = tf.image.resize_nearest_neighbor(x, size=[H.value, W.value], align_corners=False)  # should not alignment here
    return x

the subtract operation subtract x[:, :1, ...] and x[:, :-1, ...], while the first input might be wrong, it should be x[:, 1:, ...]. Current operation in fact subtract the first postion for all frames with array broadcast.

I have test that train the lite network on SHREC coarse data, the acc of validation is almost the same while the acc of training dataset is lower than validation. Which might be the reason of too much dropout. There might be still some space for improving.

If I'm wrong, please let me know

Can't find GT_splits

Dear, thanks for making this code available.
i couldn't access those files, it seem the website changed the structure of the files.
could you please help with it.

GT_split_lists = glob.glob(C.data_dir + 'GT_splits/.txt')
GT_pose_list = glob.glob(C.data_dir + 'GT_joint_positions//*')

pretrained model

would you mind sharing your pretrained model or demo for more exploring

custom skeleton dataset and puppet scale

Hi, thanks for your nice work!
Currently, I'm working on skeleton-based action recognition from the video clip.
The dataset in your paper, JHMDB is a well-annotated dataset with a puppet scaling factor for all skeleton.
the explanation on the JHMDB website is

1. See a sample annotation of the 15 positions at http://jhmdb.is.tue.mpg.de/puppet_tool
2. Due to the nature of the puppet annotation tool, all 15 joint positions are available even if they are not annotated when they are occluded or outiside the frame.
In this case, the joints are in the neutral puppet positions.
3. The right and left correspond to the right and left side of the annotated person. For example, a person facing the camera has his right side on the left side of the image, and a person back-facing the camera has his right side on the right side of the image.

(4) pos_world is the normalization of pos_img with respect to the frame size and puppet scale,Â
the formula is as below

pos_world(1,:,:) = (pos_img(1,:,:)/W-0.5)*W/H./scale;
pos_world(2,:,:) = (pos_img(2,:,:)/H-0.5)./scale;

W and H are the width and height of the frame, respectively.

But in a custom dataset extracting from a video clip. there is no puppet scaling factor.
In my experiments, accuracy on JHMDB without puppet scaling dropped the accuracy from 77% to 62%, which is huge.

So, is there any suggestion for tuning/keyword/formula/idea to simulate the puppet scaling factor?

Thanks!

88% accuracy on JHMDB

Hi, as you mentioned in README.md
We got an improvement from 77% to 88% by some bug fixing.
But I saw JHMDB/jhmdb_1D_heavy.ipynb, it was still 77%.
Do I miss anything?

How to calculate puppet scale in custom dataset

Thank you for great project but i have a question. JHMDB have "scale" key in dataset so how can i get it with frame extract from video? And i see "pos_world" is normalize data from "pos_img". So how can normalize it?

About norm_scale function for JHMDB

I find this line https://github.com/fandulu/DD-Net/blob/f26a9994b0bafc41096fa269eab89c2757d71499/JHMDB/utils.py#L38 is
(x - np.mean(x)) / np.mean(x) to norm the data. I'm confuse for this line. Could you give me some advises?

I think it should be (x - np.mean(x)) / np.std(x)

Unable to create conda env

Nice work, authors. But I'm not able to create the env following the instruction.

$ conda env create -n ddnet -f DD-Net_env.yml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:                                                       
  - libuuid==1.0.3=1                                                        
  - jpeg==9c=h470a237_0                                                     
  - nbconvert==5.3.1=py_1                                                   
  - qt==5.6.2=h50c60fd_8

I'm running on Ubuntu 18.04 with conda 4.7.12. Any chance you ran on a different platform?

cann't get the accuracy in the paper

Hi, @fandulu I run the code you provide on SHREC dataset,but I cann't get the accuracy described in the paper,particularly,I got 90.1% on 14 gestures, and 75% on 28 gestures using DD-Net (filters=16),which is lower than the accuracy in the paper ,what's the problem?

training new gestures with unfixed length of input frames

Hello DD - Net team,

I am using your Network to detect hand gestures for a really promising project. The gestures from the SHREC Dataset are too complicated for my application. Therefore I will be using own gestures.

Right now I am recording my own gestures but i encounter an issue regarding the number of hand samples that I will be feed into the network.
Some gestures have less than 32 frames. Others more than 32.
32 is the length of frames that I feed into the network.

Could you please tell me how I can feed gestures with less than 32 into the network.
How did you trained your model with the SHREC Dataset when some of the gestures had over 100 samples (rows).

Do you do re-sampling of the input data?

Best regards,

chillcloud-dev

accuracy changed in paper?

I saw that previously the best accuracy reported in your paper was 78% on J-HMDB. But now it is 77.2%. Can you explain why was this changed?

Thank you

JHMDB dataset size

Hello, I noticed that in your JHMDB notebook, you only have 433+176 samples. However JHMDB has 928 samples in total. I wonder why the smaller size?

Apply model to static gesture

Hi, thank you for your nice work. And I am working about hand gesture recognition, concretely the static gesture.However, the DD-Net only accepts multiple frames as input. So in order to static hand gesture recognition, how should I modify the model? As I know, may I can capture mutiple frames of same hand gesture for training. And when applying, I just capture a series of video which corresponding to the num of frame of DD-Net and then get the final result as my label of static hand gesture.

Getting keypoints from video

Thank you for great work and sharing code!
Do you maybe know some framework which can be used to extract keypoints in format used in the code?

Can I test the trained model of gesture recognition in RGB video?

SHREC 2017 Dataset

您好，SHREC17 数据集的官方网站 "http://www-rech.telecom-lille.fr/shrec2017-hand/" 已无法访问。

您能否提供 SHREC17 数据集中的以下文件？如果您能提供的话，将对我的研究有很大帮助。具体是 "gesture_1 "至 "gesture_14 "文件夹的整个文件夹。

这个整体文件夹包含以下文件：
+---gesture_1
| +---finger_1
| | +---subject_1
| | | +---essai_1
| | | | | |
| | | | depth_0.png
| | | | depth_1.png
| | | | | | | | ...
| | | | depth_N-1.png
| | | | general_informations.txt
| | | skeletons_image.txt
| | | skeletons_world.txt
| | | | | |
| | | ---essai_2
| | | ...
| | ---essai_5
| ---subject_2
| ...
| ---subject_20
| ---finger_2
...
---gesture_14

Is this extensible to other 2D/3D skeleton annotations?

Maybe an error for GT_split_lists

I noticed in jhmdb_data_preprocessing.ipynb this snippet

GT_lists_1 = []
GT_lists_2 = []
GT_lists_3 = []
for file in GT_split_lists:
    if file.split('/')[-1].split('.')[0].split('_')[2] == 'split1':
        GT_lists_1.append(file) 
    elif file.split('/')[-1].split('.')[0].split('_')[2] == 'split2':
        GT_lists_2.append(file)
    elif file.split('/')[-1].split('.')[0].split('_')[2] == 'split3':
        GT_lists_3.append(file)

fetches just 1 word labels (e.g. catch, stand, push, etc.) and skipped 2 word labels (e.g. climb_stairs, kick_ball, shoot_gun, etc.). Therefore, the pickled dataset contains just 14 actions out of 21 actions of the JHMDB dataset. If this is intentional then you can close this issue but otherwise I made changes as follow to preprocess data of all 21 actions.

GT_lists_1 = []
GT_lists_2 = []
GT_lists_3 = []
for file in GT_split_lists:
    if file.split('/')[-1].split('.')[0].split('_')[-1] == 'split1':
        GT_lists_1.append(file) 
    elif file.split('/')[-1].split('.')[0].split('_')[-1] == 'split2':
        GT_lists_2.append(file)
    elif file.split('/')[-1].split('.')[0].split('_')[-1] == 'split3':
        GT_lists_3.append(file)

jhmdb acc?

How do you train, so that the network accuracy rate is above 77%, and the network I train is only about 70%. The data set is 21 categories, not 14 categories.
I hope you can explain the training method.

Frame Rate - Real Time Prediction

HI @fandulu ,

I would like to ask you about my problem.

I've got the camera that only produces the 5 FPS video files. In comparison with JHMDB dataset, it's about 32 frame lengths. Therefore, base on your wide range of knowledge, is it a serious problem when applying your model to my data?

I've tried to use the Open Pose so that I can the joints information and then preprocessing as inputs to your model. However, I got a really bad performance at around 50% accuracy.

Furthermore, I would like to make your model become a real-time prediction with window = 8. However, I don't know what is suitable values for setting up. Do you have any advice?

Thank you so much for reading my silly questions. I hope that I can receive valuable suggestions from your.

Considering classwise accuracy

When considering class-wise precision recall values in JHMDB dataset for 6 classes 0 value is getting returned.

3D skeleton datasets

Hi, Can I run the code with 3D skeleton datasets?