mingyuan-zhang / motiondiffuse Goto Github PK
View Code? Open in Web Editor NEWMotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
Home Page: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html
License: Other
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
Home Page: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html
License: Other
Hi Mingyuan,
Why zero out the parameters of the "self.out" projection module in transformers.py?
Thanks,
Jeremy
Hi Mingyuan,
I am wondering what is "xf_proj" for in https://github.com/mingyuan-zhang/MotionDiffuse/blob/main/text2motion/models/transformer.py#L394.
Why do you select the word with the maximum embedding value?
Thanks,
Jeremy
Hi,
Thanks for sharing this excellent work. My questions are mainly about the part-aware text controlling.
First, how does the 'noise interpolation' work? Is it only conducted during the sampling procedure?
Second, will you release the code for the part-aware motion generation and the time-varied motion generation?
Normally in the inference process, we only provide the text to guide the generation, and the generated motion can contain the zero padding, since we add padding when training. My question is how can we remove the predicted padding in the generated motion?
When I read the code in GaussianDiffusion, I think there is some difference between the paper and the code.
Here, the model_output
is the predicted noise target
is the
If so, it (
It seems some lines for evaluation are hard coded? I encountered the following error:
FileNotFoundError: [Errno 2] No such file or directory: './data/glove/our_vab_data.npy'
Hi Mingyuan,
What is "opt.times" for? I see it is multiplied on the real dataset length in len() method, why do this? Why not just increase the epoch number?
Thanks,
Jeremy
Thanks a lot for your paper and code!
In your implementation, you didn't set attention mask for text sequence both in textTransformer layers and LinearTemporalCrossAttention layers, why it didn't cause any influence? Below is the related code.
def encode_text(self, text, device):
with torch.no_grad():
text = clip.tokenize(text, truncate=True).to(device)
x = self.clip.token_embedding(text).type(self.clip.dtype) # [batch_size, n_ctx, latent_dim]
x = x + self.clip.positional_embedding.type(self.clip.dtype)
x = x.permute(1, 0, 2) # NLD -> LND
x = self.clip.transformer(x)
x = self.clip.ln_final(x).type(self.clip.dtype)
# T, B, D
x = self.text_pre_proj(x)
xf_out = self.textTransEncoder(x)
xf_out = self.text_ln(xf_out)
xf_proj = self.text_proj(xf_out[text.argmax(dim=-1), torch.arange(xf_out.shape[1])])
# B, T, D
xf_out = xf_out.permute(1, 0, 2)
return xf_proj, xf_out
class LinearTemporalCrossAttention(nn.Module):\
def __init__(self, seq_len, latent_dim, text_latent_dim, num_head, dropout, time_embed_dim):
super().__init__()
self.num_head = num_head
self.norm = nn.LayerNorm(latent_dim)
self.text_norm = nn.LayerNorm(text_latent_dim)
self.query = nn.Linear(latent_dim, latent_dim)
self.key = nn.Linear(text_latent_dim, latent_dim)
self.value = nn.Linear(text_latent_dim, latent_dim)
self.dropout = nn.Dropout(dropout)
self.proj_out = StylizationBlock(latent_dim, time_embed_dim, dropout)
def forward(self, x, xf, emb):
"""
x: B, T, D
xf: B, N, L
"""
B, T, D = x.shape
N = xf.shape[1]
H = self.num_head
# B, T, D
query = self.query(self.norm(x))
# B, N, D
key = self.key(self.text_norm(xf))
query = F.softmax(query.view(B, T, H, -1), dim=-1)
key = F.softmax(key.view(B, N, H, -1), dim=1)
# B, N, H, HD
value = self.value(self.text_norm(xf)).view(B, N, H, -1)
# B, H, HD, HD
attention = torch.einsum('bnhd,bnhl->bhdl', key, value)
y = torch.einsum('bnhd,bhdl->bnhl', query, attention).reshape(B, T, D)
y = x + self.proj_out(y, emb)
return y
Nice Job!Thanks for your sharing.But I have a question about how you process the motion data.In your args,I find an arg called 'opt.feat_bias', and you set it as 25.And you scale motion features and foot contact with this.Why you need to do so,I compare with motionGPT, it seems that they have not done this
Any chance for a google colab notebook to test this out?
Hello
First of all thank you very much for posting this repository.
The result of my reproduction is a skeleton model of the human body, how should the SMPL skin model shown in your project be realized?
Thanks in advance
Hi,
Thanks for your great work? I wonder would you plan to privode pretrained models?
Brilliant work as is shown in this project! However, I'd be appreciate that if you could show your GPU type and total training time.
Thank you for your work, it's really interesting.
I have one question. How can I get SMPL format from pose?
I tried to get SMPL from cont6d_params (data[..., 67:193]) by using smplpytorch
but the pose looks different from your visualization based on joint positions
Hi Mingyuan,
Why the positional encoding of gesture sequence is a random initialized learnable tensor (self.sequence_embedding)?
h = h + self.sequence_embedding.unsqueeze(0)[:, :T, :]
Thanks,
Jeremy
Hi
I noticed that for generating the animation you have a limit of 196 frames, is this just so that it provides a quick result?If it is a limitation of the current model, would it possible to train a model to do more frames? I have been doing a quick look but couldn't find a limit in the training anywhere.
Also am I correct in understanding that the dim_pose variable is the number of unique poses in the dataset?
Thanks
when i tried to run, it first gave this warning:
MovieWriter ffmpeg unavailable; using Pillow instead
then i installed ffmpeg-python using pip command
still the error is as same above
and the file is not getting saved in .mp4 format
Hi,
First of all thank you very much for submitting your code here. :)
I was wondering if its possible to export the sequence of 3D poses that a text prompt generates. I could use those poses afterward to blend them with another model or maybe to play a little.
Thanks in advance,
Hi, thanks for sharing the great work!
I tried to run the evaluation procedure following the instructions in the install.md file and the Evaluation section.
However, it seems that the process fails at FID calculation after telling me that there is a NaN value.
(The matching score is also nan for the ground truth data as well...)
I did download all the pretrained models and placed them where they should be.
Do you know what the cause could be? Thank you in advance!
like can export to bvh? and the bvh's skeleton is ? can using in mocap?
Hey there, total noob here :)
Is there a way to export the animation data for further processing in Blender for example?
Thanks for the paper and code! I'm curious on how causal masking and bidirectional masking perform differently in motion diffuse.
Hello, I'm curious about the std processing in your Dataset. I found u divide root rot velocity, root linear velocity,root y and foot contact by 25 in your implementation. Could you tell me the motivation of doing it? Are there any reason to determine the feat_bias?
# root_rot_velocity (B, seq_len, 1)
std[0:1] = std[0:1] / FEAT_BIAS
# root_linear_velocity (B, seq_len, 2)
std[1:3] = std[1:3] / FEAT_BIAS
# root_y (B, seq_len, 1)
std[3:4] = std[3:4] / FEAT_BIAS
# foot contact (B, seq_len, 4)
std[4 + (JOINTS_NUM - 1) * 9 + JOINTS_NUM * 3:] = std[
4 + (JOINTS_NUM - 1) * 9 + JOINTS_NUM * 3:] / FEAT_BIAS
Hi Mingyuan,
How to run with multiple GPUs on a single server?
Thanks,
Jeremy
Hi,
In your paper (https://arxiv.org/pdf/2208.15001.pdf), I found part 3.5 find-grained controlling interesting and relevant to my work. However, I cannot fully understand what is described and I therefore tried to look for the implementations in the code. Unfortunately, I can't find any related part. Is it possible that you could help me point out the related lines of code?
Regards
If I want to train my model via multi-GPUs and I don't have srun
and slurm
in my system, how can I run the code?
Normally in the inference process, we only provide the text to guide the generation, and the generated motion can contain the zero padding, since we add padding when training. My question is how can we remove the predicted padding in the generated motion?
Hi Mingyuan,
Do you know how to get the 251-dimensional motion vectors as provided in the KiT dataset?
I am computing the FID on my dataset, but our data only has two channels (x, y) instead of 251. Therefore, I wonder how to map the low-dimensional motion sequence to 251-dimensional motion vectors.
Thanks,
Jeremy
Hi,
Is it possible to generate a single character from the Pose for about 5 seconds?
I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controlled (pose) input?
I have a video of OpenPose+hands+face and i want to generate human like animation (No matter what, but just a consistent Character/Avatar)
Sample Video
P.S. Any Model that could supports Pose+Hand+Face, can be used!
Thanks
Best regards
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.