mingyuan-zhang / motiondiffuse Goto Github PK

View Code? Open in Web Editor NEW

804.0 29.0 70.0 30.49 MB

MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model

Home Page: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html

License: Other

Python 100.00%

3d-generation diffusion-model motion-generation text-driven

motiondiffuse's People

Stargazers

Watchers

Forkers

adhamghazali 3a1b2c3 fastflair codeaudit kiranchhatre imclab ybyangjing c0rvus-ix serviteur techthiyanes phoenixdigitalfx dguo98 pythoner-code lupin4 hansefd micus indrajeetdatta wnazario muruga86 nopeanuts tokoshie fastrocket bruinxiong hermessecund rob813 moileehyeji redchew-fork philmaas phrixs xujinglin nanditho conrekatsu biaggii exitudio emor1 yamgifhub nanhaizhiyun aistoy bill007bill couteau69586 iarrationality n12iki auzxb wangjiongw xxxnhb greatfeel jackzhousz jiro-zhang confence mummyk zf223669 gogit2194 godmapper chicook peterzs dakilaledesma loken14 david20080125 linsen20220222 hyzwz kimx3966 jags111 tiyife cxhcmhhh 5l1v3r1 gunonsword keyoneten pengoosedev keneyr michaelaq shimomurakei

motiondiffuse's Issues

What is "zero_module" for?

Hi Mingyuan,

Why zero out the parameters of the "self.out" projection module in transformers.py?

Thanks,
Jeremy

Waht is "xf_proj" for?

Hi Mingyuan,

I am wondering what is "xf_proj" for in https://github.com/mingyuan-zhang/MotionDiffuse/blob/main/text2motion/models/transformer.py#L394.

Why do you select the word with the maximum embedding value?

Thanks,
Jeremy

Questions about the part-aware text controlling

Hi,
Thanks for sharing this excellent work. My questions are mainly about the part-aware text controlling.

First, how does the 'noise interpolation' work? Is it only conducted during the sampling procedure?

Second, will you release the code for the part-aware motion generation and the time-varied motion generation?

How to recognize the padding in generation process

Normally in the inference process, we only provide the text to guide the generation, and the generated motion can contain the zero padding, since we add padding when training. My question is how can we remove the predicted padding in the generated motion?

Implementation of the training loss.

When I read the code in GaussianDiffusion, I think there is some difference between the paper and the code.

Here, the model_output is the predicted noise $\epsilon_{\theta}(x_t, t, text)$ and the target is the $\tilde{\mu}_{t}(x_t, x_0)$. Is it right?

If so, it ( $\epsilon_{\theta}(x_t, t, text)-\tilde{\mu}_{t}(x_t, x_0)$ ) is not as same as the statement in Equation 4.

Evaluation Error

It seems some lines for evaluation are hard coded? I encountered the following error:

FileNotFoundError: [Errno 2] No such file or directory: './data/glove/our_vab_data.npy'

What is "opt.times" for?

Hi Mingyuan,

What is "opt.times" for? I see it is multiplied on the real dataset length in len() method, why do this? Why not just increase the epoch number?

Thanks,
Jeremy

About text mask

Thanks a lot for your paper and code!
In your implementation, you didn't set attention mask for text sequence both in textTransformer layers and LinearTemporalCrossAttention layers, why it didn't cause any influence? Below is the related code.

def encode_text(self, text, device):
with torch.no_grad():
text = clip.tokenize(text, truncate=True).to(device)
x = self.clip.token_embedding(text).type(self.clip.dtype) # [batch_size, n_ctx, latent_dim]
x = x + self.clip.positional_embedding.type(self.clip.dtype)
x = x.permute(1, 0, 2) # NLD -> LND
x = self.clip.transformer(x)
x = self.clip.ln_final(x).type(self.clip.dtype)
# T, B, D
x = self.text_pre_proj(x)
xf_out = self.textTransEncoder(x)
xf_out = self.text_ln(xf_out)
xf_proj = self.text_proj(xf_out[text.argmax(dim=-1), torch.arange(xf_out.shape[1])])
# B, T, D
xf_out = xf_out.permute(1, 0, 2)
return xf_proj, xf_out

class LinearTemporalCrossAttention(nn.Module):\

  def __init__(self, seq_len, latent_dim, text_latent_dim, num_head, dropout, time_embed_dim):
      super().__init__()
      self.num_head = num_head
      self.norm = nn.LayerNorm(latent_dim)
      self.text_norm = nn.LayerNorm(text_latent_dim)
      self.query = nn.Linear(latent_dim, latent_dim)
      self.key = nn.Linear(text_latent_dim, latent_dim)
      self.value = nn.Linear(text_latent_dim, latent_dim)
      self.dropout = nn.Dropout(dropout)
      self.proj_out = StylizationBlock(latent_dim, time_embed_dim, dropout)

  def forward(self, x, xf, emb):
      """
      x: B, T, D
      xf: B, N, L
      """
      B, T, D = x.shape
      N = xf.shape[1]
      H = self.num_head
      # B, T, D
      query = self.query(self.norm(x))
      # B, N, D
      key = self.key(self.text_norm(xf))
      query = F.softmax(query.view(B, T, H, -1), dim=-1)
      key = F.softmax(key.view(B, N, H, -1), dim=1)
      # B, N, H, HD
      value = self.value(self.text_norm(xf)).view(B, N, H, -1)
      # B, H, HD, HD
      attention = torch.einsum('bnhd,bnhl->bhdl', key, value)
      y = torch.einsum('bnhd,bhdl->bnhl', query, attention).reshape(B, T, D)
      y = x + self.proj_out(y, emb)
      return y

About motion feature process

Nice Job!Thanks for your sharing.But I have a question about how you process the motion data.In your args,I find an arg called 'opt.feat_bias', and you set it as 25.And you scale motion features and foot contact with this.Why you need to do so,I compare with motionGPT, it seems that they have not done this

Google Colab?

Any chance for a google colab notebook to test this out?

How to implement SMPL skin model?

Hello

First of all thank you very much for posting this repository.

The result of my reproduction is a skeleton model of the human body, how should the SMPL skin model shown in your project be realized?

Thanks in advance

would you privode the pretrained model?

Hi,

Thanks for your great work? I wonder would you plan to privode pretrained models?

Training

I run the training code on a single GPU but it seems to have Nan value. Is anything Wrong?

About training time and hardware

Brilliant work as is shown in this project! However, I'd be appreciate that if you could show your GPU type and total training time.

SMPL Visualization

Thank you for your work, it's really interesting.
I have one question. How can I get SMPL format from pose?

I tried to get SMPL from cont6d_params (data[..., 67:193]) by using smplpytorch
but the pose looks different from your visualization based on joint positions

Positional encoding for gesture sequence (self.sequence_embedding)

Hi Mingyuan,

Why the positional encoding of gesture sequence is a random initialized learnable tensor (self.sequence_embedding)?
h = h + self.sequence_embedding.unsqueeze(0)[:, :T, :]

Thanks,
Jeremy

Performance on the KIT Dataset

Hi, excellent work and thanks for sharing the code! I tried out the training code on the KIT dataset. After training, the evaluation results are not as good as those in the paper:

This is my training setting:

Could you help me? Thanks a lot!

Question on number of frames

Hi
I noticed that for generating the animation you have a limit of 196 frames, is this just so that it provides a quick result?If it is a limitation of the current model, would it possible to train a model to do more frames? I have been doing a quick look but couldn't find a limit in the training anywhere.

Also am I correct in understanding that the dim_pose variable is the number of unique poses in the dataset?
Thanks

Cannot able to save the result in .mp4

when i tried to run, it first gave this warning:
MovieWriter ffmpeg unavailable; using Pillow instead
then i installed ffmpeg-python using pip command
still the error is as same above
and the file is not getting saved in .mp4 format

Are poses exportable?

Hi,

First of all thank you very much for submitting your code here. :)

I was wondering if its possible to export the sequence of 3D poses that a text prompt generates. I could use those poses afterward to blend them with another model or maybe to play a little.

Thanks in advance,

Evaluation Error

Hi, thanks for sharing the great work!

I tried to run the evaluation procedure following the instructions in the install.md file and the Evaluation section.
However, it seems that the process fails at FID calculation after telling me that there is a NaN value.
(The matching score is also nan for the ground truth data as well...)
I did download all the pretrained models and placed them where they should be.
Do you know what the cause could be? Thank you in advance!

what's the topology of the prediction motion data

like can export to bvh? and the bvh's skeleton is ? can using in mocap?

Export animation

Hey there, total noob here :)

Is there a way to export the animation data for further processing in Blender for example?

Question on attention mask type

Thanks for the paper and code! I'm curious on how causal masking and bidirectional masking perform differently in motion diffuse.

About FEAT_BIAS in dataset.

Hello, I'm curious about the std processing in your Dataset. I found u divide root rot velocity, root linear velocity,root y and foot contact by 25 in your implementation. Could you tell me the motivation of doing it? Are there any reason to determine the feat_bias?

# root_rot_velocity (B, seq_len, 1)
std[0:1] = std[0:1] / FEAT_BIAS
# root_linear_velocity (B, seq_len, 2)
std[1:3] = std[1:3] / FEAT_BIAS
# root_y (B, seq_len, 1)
std[3:4] = std[3:4] / FEAT_BIAS
    # foot contact (B, seq_len, 4)
std[4 + (JOINTS_NUM - 1) * 9 + JOINTS_NUM * 3:] = std[
                                                  4 + (JOINTS_NUM - 1) * 9 + JOINTS_NUM * 3:] / FEAT_BIAS

expecting code

Multiple GPU on a single node without slurm

Hi Mingyuan,

How to run with multiple GPUs on a single server?

Thanks,
Jeremy

Where can I find the fine-grained controlling part

Hi,

In your paper (https://arxiv.org/pdf/2208.15001.pdf), I found part 3.5 find-grained controlling interesting and relevant to my work. However, I cannot fully understand what is described and I therefore tried to look for the implementations in the code. Unfortunately, I can't find any related part. Is it possible that you could help me point out the related lines of code?

Regards

Permission error when downloading with gdown in colab

Hi.
When running colab notebook the error arises when downloading MotionDiffuse.zip from gdrive using gdown and as a result FileNotFoundError raises:

About DDP traing?

If I want to train my model via multi-GPUs and I don't have srun and slurm in my system, how can I run the code?

How to recognize the padding in generation process

Evaluation on My Dataset: How to Get the 251-dimensional Motion Vectors?

Hi Mingyuan,

Do you know how to get the 251-dimensional motion vectors as provided in the KiT dataset?

I am computing the FID on my dataset, but our data only has two channels (x, y) instead of 251. Therefore, I wonder how to map the low-dimensional motion sequence to 251-dimensional motion vectors.

Thanks,
Jeremy

Hand + Face of Human Pose

Hi,
Is it possible to generate a single character from the Pose for about 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controlled (pose) input?

I have a video of OpenPose+hands+face and i want to generate human like animation (No matter what, but just a consistent Character/Avatar)
Sample Video

P.S. Any Model that could supports Pose+Hand+Face, can be used!

Thanks
Best regards

mingyuan-zhang / motiondiffuse Goto Github PK

motiondiffuse's People

Stargazers

Watchers

Forkers

motiondiffuse's Issues

Recommend Projects

Recommend Topics

Recommend Org