<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

A question about gradient about diffusion-lm HOT 3 OPEN

xiangli1999 commented on July 4, 2024

A question about gradient

from diffusion-lm.

Comments (3)

XiangLi1999 commented on July 4, 2024

Hi, this is not a bug. We need to backdrop gradient signal from these two losses to the embedding function in order to jointly train.

from diffusion-lm.

summmeer commented on July 4, 2024

@XiangLi1999 Hi, I'm curious about the loss funtion too. I can not understand why we need to compute decoder_nll using x_start in decoder_nll = self.token_discrete_loss(x_start, get_logits, input_ids). I think x_start is the word embedding added with extra noise, and this decode loss is trying to recover the noise, and this has no relation with diffusion model. Besides, this is not consistent with the formulation in loss function, in the \log p_theta (w|x_0) part. Why can't we replace x_start to the predicted model_out_x_start? Is this more reasonable? (But the experiment results is not good)

from diffusion-lm.

AlonzoLeeeooo commented on July 4, 2024

@XiangLi1999 Hi, I'm curious about the loss funtion too. I can not understand why we need to compute decoder_nll using x_start in decoder_nll = self.token_discrete_loss(x_start, get_logits, input_ids). I think x_start is the word embedding added with extra noise, and this decode loss is trying to recover the noise, and this has no relation with diffusion model. Besides, this is not consistent with the formulation in loss function, in the \log p_theta (w|x_0) part. Why can't we replace x_start to the predicted model_out_x_start? Is this more reasonable? (But the experiment results is not good)

Hi @summmeer ,
I also share the same feelings about this problem. It seems that decoder_nll has less correlation with the training of diffusion models. And I wonder how is it performing during your experiments? As I was training the model, I found that NLL loss equals to zero for a long training period (about 8k iterations). At about 10k training steps, the NLL loss occurs with increasing values. Have you ever encountered the similar situation? How is the NLL loss during your experiments?

Thanks for your reply in advance. It would help me a lot.

Best,

from diffusion-lm.

Recommend Projects

A question about gradient about diffusion-lm HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent