youngwoo-yoon / co-speech_gesture_generation Goto Github PK

This is an implementation of Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots.

Home Page: https://sites.google.com/view/youngwoo-yoon/projects/co-speech-gesture-generation

License: Other

Python 100.00%

co-speech_gesture_generation's Introduction

Co-Speech Gesture Generator

This is an implementation of Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots (Paper, Project Page)

The original paper used TED dataset, but, in this repository, we modified the code to use Talking With Hands 16.2M for GENEA Challenge 2022. The model is also changed to estimate rotation matrices for upper-body joints instead of estimating Cartesian coordinates.

Environment

The code was developed using python 3.8 on Ubuntu 18.04. Pytorch 1.5.0 was used.

Prepare

Install dependencies
```
pip install -r requirements.txt
```
Download the FastText vectors from here and put crawl-300d-2M-subword.bin to the resource folder (resource/crawl-300d-2M-subword.bin).

Train

Make LMDB

cd scripts
python twh_dataset_to_lmdb.py [PATH_TO_DATASET]

Update paths and parameters in config/seq2seq.yml and run train.py
```
python train.py --config=../config/seq2seq.yml
```

Inference

Do training or use a pretrained model (output/train_seq2seq/baseline_icra19_checkpoint_100.bin). When you use the pretrained model, please put vocab_cache.pkl file into lmdb train path.

Inference. Output a BVH motion file from speech text (TSV file).

python inference.py [PATH_TO_MODEL_CHECKPOINT] [PATH_TO_TSV_FILE]

Sample result

Result video for val_2022_v1_006.tsv by using the challenge visualization server.

val_2022_v1_006_generated.mp4

Remarks

I found this model was not successful when all the joints were considered, so I trained the model only with upper-body joints excluding fingers and used fixed values for remaining joints (using JointSelector in PyMo). You can easily try a different set of joints (e.g., full-body including fingers) by specifying joint names in target_joints variable in twh_dataset_to_lmdb.py. Please update data_mean and data_std in the config file if you change target_joints. You can find data mean and std values in the console output of the step 3 (Make LMDB) above.

License

Please see LICENSE.md

Citation

@INPROCEEDINGS{
  yoonICRA19,
  title={Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots},
  author={Yoon, Youngwoo and Ko, Woo-Ri and Jang, Minsu and Lee, Jaeyeon and Kim, Jaehong and Lee, Geehyuk},
  booktitle={Proc. of The International Conference in Robotics and Automation (ICRA)},
  year={2019}
}

co-speech_gesture_generation's People

Contributors

Stargazers

Watchers

Forkers

kasierzh blnjjj tomkingsforduoa zf223669 chenxingshensecond kiranchhatre kasrs owuqqq techthiyanes

co-speech_gesture_generation's Issues

Why do 15 joints of one sample has shape [30,135]?

in the file: ../Co-Speech_Gesture_Generation/blob/master/scripts/data_loader/lmdb_data_loader.py
i run the line 69: word_seq, pose_seq, audio, aux_info = sample

I got the pose_seq has the shape [30,135]. i know 30 is the frame size of each example. and 135 is for 15 selected joints. as 135/15=9, namely each joint has 9 values. what do those 9 values mean resepctively?

Thank you in advance.

Missing file for pretrained model

args.train_data_path[0] saved by the checkpoint_and_model indicates a missing file vocab_cache.pkl.


  File "/snap/pycharm-community/281/plugins/python-ce/helpers/pydev/pydevd.py", line 1491, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script

  File "/snap/pycharm-community/281/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)

  File "/.../Co-Speech_Gesture_Generation/scripts/inference.py", line 174, in <module>
    main(args.ckpt_path, args.transcript_path)

  File "/.../Co-Speech_Gesture_Generation/scripts/inference.py", line 106, in main
    with open(vocab_cache_path, 'rb') as f:

FileNotFoundError: [Errno 2] No such file or directory: `'/mnt/work3/GENEA22_dataset/v1/trn/lmdb/vocab_cache.pkl'```

pymo 'to position' for Trinity

Hello, for Trinity dataset, in GENEA challenge 2020, is the _to_pos function in its preprocessing.py in pymo not working, because I can't get it to work successfully using ('param', MocapParameterizer('position')) in the pipeline.

How can this be solved? Thanks!

TRANSCRIPT demo

hi,guys, would you pleases provide some TRANSCRIPT demo,i want to run test, tks.

dim of output

Hello,
I notice that the output of the model is 216, which is 18x12, and can be converted to 18x6 from rotation matrix to euler angle, where 18 is the number of joints. From my knowledge, for each joint, euler angle is a 1x3 vector and I am confused about the meaning of your ouput. Can you please provide more information about the meaning of the output, e.g. the origin of each joint, relative rotation respect to its parent joint or absolute rotaion?
Thanks in advance!

trinity_data_to_lmdb 관련 질문입니다.

Trinity Dataset을 신청하여 받은 데이터셋은 .fbx 파일과 .wav 파일인 것 같은데,

trinity_data_to_lmdb.py 에서는 .bvh의 모션파일과 .json의 subtitle이 필요한 듯 보입니다.

혹시 fbx파일을 bvh로 변환하여 사용하여하 하나요?
json파일은 어디서 구할 수 있는지 알 수 있을까요?

이상입니다.

Visualization BVH

Hello. After inference step, we can get the .bvh.

Is there any way or code to visualize it and get the effect same as the paper?

I find this, but I do not have the user and password

meet error "KeyError: 'LeftUpLeg'" when run the function "process_bvh()"?

this is the debug information.

Thanks in advance.

KeyError Traceback (most recent call last)
in
----> 1 out_data=process_bvh(base_path+'Recording_001.bvh')

in process_bvh(gesture_filename)
20 ])
21
---> 22 out_data = data_pipe.fit_transform(data_all)
23
24

~.conda\envs\tf2\lib\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
383 """
384 last_step = self._final_estimator
--> 385 Xt, fit_params = self._fit(X, y, **fit_params)
386 with _print_elapsed_time('Pipeline',
387 self._log_message(len(self.steps) - 1)):

~.conda\envs\tf2\lib\site-packages\sklearn\pipeline.py in _fit(self, X, y, **fit_params)
313 message_clsname='Pipeline',
314 message=self._log_message(step_idx),
--> 315 **fit_params_steps[name])
316 # Replace the transformer of the step with the fitted
317 # transformer. This is necessary when loading the transformer

~.conda\envs\tf2\lib\site-packages\joblib\memory.py in call(self, *args, **kwargs)
353
354 def call(self, *args, **kwargs):
--> 355 return self.func(*args, **kwargs)
356
357 def call_and_shelve(self, *args, **kwargs):

~.conda\envs\tf2\lib\site-packages\sklearn\pipeline.py in _fit_transform_one(transformer, X, y, weight, message_clsname, message, **fit_params)
726 with _print_elapsed_time(message_clsname, message):
727 if hasattr(transformer, 'fit_transform'):
--> 728 res = transformer.fit_transform(X, y, **fit_params)
729 else:
730 res = transformer.fit(X, y, **fit_params).transform(X)

~.conda\envs\tf2\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
569 if y is None:
570 # fit method of arity 1 (unsupervised transformation)
--> 571 return self.fit(X, **fit_params).transform(X)
572 else:
573 # fit method of arity 2 (supervised transformation)

~.conda\envs\tf2\lib\site-packages\pymo-0.0.1-py3.7.egg\pymo\preprocessing.py in transform(self, X, y)
36 return X
37 elif self.param_type == 'position':
---> 38 return self._to_pos(X)
39 elif self.param_type == 'expmap2pos':
40 return self._expmap_to_pos(X)

~.conda\envs\tf2\lib\site-packages\pymo-0.0.1-py3.7.egg\pymo\preprocessing.py in _to_pos(self, X)
108
109 for joint in track.traverse():
--> 110 parent = track.skeleton[joint]['parent']
111 rot_order = track.skeleton[joint]['order']
112

KeyError: 'LeftUpLeg'

TSV_FILE

Where to get TSV files? or at least an example of TSV file containing transcripts?

upper body pre-processing and visualization

Hello, thank you for the code.

I noticed that when using the data pipeline to select upper body joints, and then visualizing with the genea visualizer, the motions look unnatural. It seems all joint rotations are global. Could you shed some light on how to make upper body rotations relative so that the final animation looks more natural?

Thank you!
David

Pretrained Weight Download

pretrained weight를 받을 수 있는 링크 https://www.dropbox.com/s/ijxjvot1o4s76ww/baseline_icra19_checkpoint_100.bin?dl=0 가

정상적이지 않은 것 같습니다. 404 not found 인 것 같은데 확인해주실 수 있으신가요?

about the Pymo?

I compared the pipeline of processing BVH with and without ('root', RootTransformer('hip_centric')).

and useuse the minus operation on two of them:

(1) 1st frame: out_data[0,0,:] - out_data_no_hip_centerlize[0,0,:]

[ -8.70151 -92.36 22.8466 -10.39751256 -91.92271148
17.93777897 -9.9069114 -92.01331065 18.83401856 -8.85100984
-92.27675895 21.76176833 -7.41557735 -92.6173533 25.48583606
-7.21810007 -93.1461758 33.03037571 -7.20744117 -93.44068702
37.31824573 -7.11192502 -93.68603708 40.81519583 -7.50652435
-93.56490843 39.39021581 -41.48320158 -90.6755312 26.71166777
-59.24560098 -88.82517754 15.12506976 -65.7639758 -89.55378643
31.41154973 -7.50652435 -93.56490843 39.39021581 27.106017
-94.99550065 30.23240403 38.51608676 -95.07362667 21.47251716
34.77629702 -96.09374976 39.60158339]

(2) 2nd frame: out_data[0,1,:] - out_data_no_hip_centerlize[0,1,:]

[ -8.7089 -92.3608 22.8468 -10.40049018 -91.9227756
17.93282527 -9.90991655 -92.01329658 18.82668927 -8.8534612
-92.27710578 21.75373938 -7.41920296 -92.61791248 25.4751766
-7.22715858 -93.14742825 33.01414725 -7.21539753 -93.44282783
37.30217912 -7.11850819 -93.68885993 40.79844885 -7.51008948
-93.56788589 39.37607352 -41.49073148 -90.68246661 26.70952861
-59.2436811 -88.83288051 15.11401733 -65.75475154 -89.56565497
31.39371763 -7.51008948 -93.56788589 39.37607352 27.10272821
-94.9907481 30.2164225 38.50972614 -95.06565673 21.4647733
34.77442387 -96.09076935 39.6032621 ]

they are similar but different. So I wonder what is the function of the RootTransformer('hip_centric'). how it works?

I see the code of pymo preprocesssing.py. it looks like only make the hip_x=0 hip_y=0 hip_z=0, but the result did not look like what I thought. why?

Thanks in advance.

No audio input

Dear,
This project only support text input. It seems that you did not use audio data~

 in_text = in_text.to(device)
 in_audio = in_audio.to(device)
 target_vec = target_vec.to(device)
 loss = train_iter_seq2seq(args, epoch, in_text, text_lengths, target_vec, generator, gen_optimizer)

Loss is nan

Hello, thanks for the great work and code, but I seem to be having some problems.

While I was training, the printout didn't seem right:
2022-05-17 16:12:20,277: (loss terms) l1 nan, cont nan, var nan 2022-05-17 16:12:24,159: EP 42 ( 75) | 32m 55s, 1139 samples/s | loss: nan, 2022-05-17 16:12:33,106: (loss terms) l1 nan, cont nan, var nan 2022-05-17 16:12:34,003: EP 42 (150) | 33m 4s, 841 samples/s | loss: nan,
And I checked three loss:
tensor(nan, device='cuda:1', grad_fn=<MulBackward0>) tensor(nan, device='cuda:1', grad_fn=<MulBackward0>) tensor(nan, device='cuda:1', grad_fn=<MulBackward0>) loss

And the output of the net:
tensor([[[-1.1089e-02, 2.0337e+00, 1.5299e+00, ..., -7.2902e-01, -9.6424e-04, 8.3192e-02], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan],... [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan]], [[ 3.3295e-02, 3.7979e-01, 4.9924e+00, ..., -3.4652e+00, -1.3517e-03, -6.6605e-02], [ nan, nan, nan, ..., nan, nan, nan],...

Is there something wrong with the code?