Coder Social home page Coder Social logo

mingyuan-zhang / motiondiffuse Goto Github PK

View Code? Open in Web Editor NEW
804.0 29.0 70.0 30.49 MB

MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model

Home Page: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html

License: Other

Python 100.00%
3d-generation diffusion-model motion-generation text-driven

motiondiffuse's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

motiondiffuse's Issues

What is "zero_module" for?

Hi Mingyuan,

Why zero out the parameters of the "self.out" projection module in transformers.py?

Thanks,
Jeremy

Questions about the part-aware text controlling

Hi,
Thanks for sharing this excellent work. My questions are mainly about the part-aware text controlling.

First, how does the 'noise interpolation' work? Is it only conducted during the sampling procedure?

Second, will you release the code for the part-aware motion generation and the time-varied motion generation?

How to recognize the padding in generation process

Normally in the inference process, we only provide the text to guide the generation, and the generated motion can contain the zero padding, since we add padding when training. My question is how can we remove the predicted padding in the generated motion?

Implementation of the training loss.

When I read the code in GaussianDiffusion, I think there is some difference between the paper and the code.

Here, the model_output is the predicted noise $\epsilon_{\theta}(x_t, t, text)$ and the target is the $\tilde{\mu}_{t}(x_t, x_0)$. Is it right?

If so, it ( $\epsilon_{\theta}(x_t, t, text)-\tilde{\mu}_{t}(x_t, x_0)$ ) is not as same as the statement in Equation 4.

Evaluation Error

It seems some lines for evaluation are hard coded? I encountered the following error:

FileNotFoundError: [Errno 2] No such file or directory: './data/glove/our_vab_data.npy'

What is "opt.times" for?

Hi Mingyuan,

What is "opt.times" for? I see it is multiplied on the real dataset length in len() method, why do this? Why not just increase the epoch number?

Thanks,
Jeremy

About text mask

Thanks a lot for your paper and code!
In your implementation, you didn't set attention mask for text sequence both in textTransformer layers and LinearTemporalCrossAttention layers, why it didn't cause any influence? Below is the related code.

def encode_text(self, text, device):
with torch.no_grad():
text = clip.tokenize(text, truncate=True).to(device)
x = self.clip.token_embedding(text).type(self.clip.dtype) # [batch_size, n_ctx, latent_dim]
x = x + self.clip.positional_embedding.type(self.clip.dtype)
x = x.permute(1, 0, 2) # NLD -> LND
x = self.clip.transformer(x)
x = self.clip.ln_final(x).type(self.clip.dtype)
# T, B, D
x = self.text_pre_proj(x)
xf_out = self.textTransEncoder(x)
xf_out = self.text_ln(xf_out)
xf_proj = self.text_proj(xf_out[text.argmax(dim=-1), torch.arange(xf_out.shape[1])])
# B, T, D
xf_out = xf_out.permute(1, 0, 2)
return xf_proj, xf_out

class LinearTemporalCrossAttention(nn.Module):\

  def __init__(self, seq_len, latent_dim, text_latent_dim, num_head, dropout, time_embed_dim):
      super().__init__()
      self.num_head = num_head
      self.norm = nn.LayerNorm(latent_dim)
      self.text_norm = nn.LayerNorm(text_latent_dim)
      self.query = nn.Linear(latent_dim, latent_dim)
      self.key = nn.Linear(text_latent_dim, latent_dim)
      self.value = nn.Linear(text_latent_dim, latent_dim)
      self.dropout = nn.Dropout(dropout)
      self.proj_out = StylizationBlock(latent_dim, time_embed_dim, dropout)

  def forward(self, x, xf, emb):
      """
      x: B, T, D
      xf: B, N, L
      """
      B, T, D = x.shape
      N = xf.shape[1]
      H = self.num_head
      # B, T, D
      query = self.query(self.norm(x))
      # B, N, D
      key = self.key(self.text_norm(xf))
      query = F.softmax(query.view(B, T, H, -1), dim=-1)
      key = F.softmax(key.view(B, N, H, -1), dim=1)
      # B, N, H, HD
      value = self.value(self.text_norm(xf)).view(B, N, H, -1)
      # B, H, HD, HD
      attention = torch.einsum('bnhd,bnhl->bhdl', key, value)
      y = torch.einsum('bnhd,bhdl->bnhl', query, attention).reshape(B, T, D)
      y = x + self.proj_out(y, emb)
      return y

About motion feature process

Nice Job!Thanks for your sharing.But I have a question about how you process the motion data.In your args,I find an arg called 'opt.feat_bias', and you set it as 25.And you scale motion features and foot contact with this.Why you need to do so,I compare with motionGPT, it seems that they have not done this

Google Colab?

Any chance for a google colab notebook to test this out?

How to implement SMPL skin model?

Hello

First of all thank you very much for posting this repository.

The result of my reproduction is a skeleton model of the human body, how should the SMPL skin model shown in your project be realized?

Thanks in advance

Training

I run the training code on a single GPU but it seems to have Nan value. Is anything Wrong?

6345F08F-AD73-4CA2-941D-BC77A23B2077

About training time and hardware

Brilliant work as is shown in this project! However, I'd be appreciate that if you could show your GPU type and total training time.

Performance on the KIT Dataset

Hi, excellent work and thanks for sharing the code! I tried out the training code on the KIT dataset. After training, the evaluation results are not as good as those in the paper:
aa

This is my training setting:
bb

Could you help me? Thanks a lot!

Question on number of frames

Hi
I noticed that for generating the animation you have a limit of 196 frames, is this just so that it provides a quick result?If it is a limitation of the current model, would it possible to train a model to do more frames? I have been doing a quick look but couldn't find a limit in the training anywhere.

Also am I correct in understanding that the dim_pose variable is the number of unique poses in the dataset?
Thanks

Cannot able to save the result in .mp4

when i tried to run, it first gave this warning:
MovieWriter ffmpeg unavailable; using Pillow instead
then i installed ffmpeg-python using pip command
still the error is as same above
and the file is not getting saved in .mp4 format

Are poses exportable?

Hi,

First of all thank you very much for submitting your code here. :)

I was wondering if its possible to export the sequence of 3D poses that a text prompt generates. I could use those poses afterward to blend them with another model or maybe to play a little.

Thanks in advance,

Evaluation Error

Hi, thanks for sharing the great work!

I tried to run the evaluation procedure following the instructions in the install.md file and the Evaluation section.
However, it seems that the process fails at FID calculation after telling me that there is a NaN value.
(The matching score is also nan for the ground truth data as well...)
I did download all the pretrained models and placed them where they should be.
Do you know what the cause could be? Thank you in advance!

Export animation

Hey there, total noob here :)

Is there a way to export the animation data for further processing in Blender for example?

Question on attention mask type

Thanks for the paper and code! I'm curious on how causal masking and bidirectional masking perform differently in motion diffuse.

About FEAT_BIAS in dataset.

Hello, I'm curious about the std processing in your Dataset. I found u divide root rot velocity, root linear velocity,root y and foot contact by 25 in your implementation. Could you tell me the motivation of doing it? Are there any reason to determine the feat_bias?

# root_rot_velocity (B, seq_len, 1)
std[0:1] = std[0:1] / FEAT_BIAS
# root_linear_velocity (B, seq_len, 2)
std[1:3] = std[1:3] / FEAT_BIAS
# root_y (B, seq_len, 1)
std[3:4] = std[3:4] / FEAT_BIAS
    # foot contact (B, seq_len, 4)
std[4 + (JOINTS_NUM - 1) * 9 + JOINTS_NUM * 3:] = std[
                                                  4 + (JOINTS_NUM - 1) * 9 + JOINTS_NUM * 3:] / FEAT_BIAS 

Where can I find the fine-grained controlling part

Hi,

In your paper (https://arxiv.org/pdf/2208.15001.pdf), I found part 3.5 find-grained controlling interesting and relevant to my work. However, I cannot fully understand what is described and I therefore tried to look for the implementations in the code. Unfortunately, I can't find any related part. Is it possible that you could help me point out the related lines of code?

image

image

Regards

About DDP traing?

If I want to train my model via multi-GPUs and I don't have srun and slurm in my system, how can I run the code?

How to recognize the padding in generation process

Normally in the inference process, we only provide the text to guide the generation, and the generated motion can contain the zero padding, since we add padding when training. My question is how can we remove the predicted padding in the generated motion?

Evaluation on My Dataset: How to Get the 251-dimensional Motion Vectors?

Hi Mingyuan,

Do you know how to get the 251-dimensional motion vectors as provided in the KiT dataset?

I am computing the FID on my dataset, but our data only has two channels (x, y) instead of 251. Therefore, I wonder how to map the low-dimensional motion sequence to 251-dimensional motion vectors.

Thanks,
Jeremy

Hand + Face of Human Pose

Hi,
Is it possible to generate a single character from the Pose for about 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controlled (pose) input?

I have a video of OpenPose+hands+face and i want to generate human like animation (No matter what, but just a consistent Character/Avatar)
Sample Video

P.S. Any Model that could supports Pose+Hand+Face, can be used!

Thanks
Best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.