jlian2 / improved-voice-conversion-with-conditional-dsvae Goto Github PK

Demo for 2022 Interspeech

HTML 91.72% CSS 8.28%

improved-voice-conversion-with-conditional-dsvae's Introduction

Towards Improved Voice Conversion with Conditional DSVAE

Paper for this work: https://arxiv.org/abs/2205.05227

This is based on our previous work:

@inproceedings{D-DSVAE-VC,
  title={Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion},
  author={Lian, Jiachen and Zhang, Chunlei and Yu, Dong},
  booktitle={IEEE ICASSP},
  year={2022},
  organization={IEEE}
}

The previous demo is here https://jlian2.github.io/Robust-Voice-Style-Transfer/.

Demo for this work: https://jlian2.github.io/Improved-Voice-Conversion-with-Conditional-DSVAE/.

improved-voice-conversion-with-conditional-dsvae's People

Contributors

Stargazers

Watchers

Forkers

janfschr wmaiga

improved-voice-conversion-with-conditional-dsvae's Issues

Additional Questions: Performance outside training dataset

Hi @jlian2! Thanks for the great work. I had a question about the experimental setup for data usage (for training and testing). As I understand, the VCTK dataset has 111 people's recordings using two microphones (mic1 and mic2). I had the following three questions:

Did you use mic1 or mic2 from the VCTK dataset to train the model? Why choose one over the other?
Did you use mic1 or mic2 from the VCTK dataset to test the model? Why choose one over the other?
In case the same mics were used for training and testing, did you try testing on a different mic while experimenting?

Content Encoder Output in Paper

Hi @jlian2. According to the paper the content encoder gives a 64 dim vector. Am i correct in assuming that 1.6 seconds of content is encoded in this 64 dim vector?

Clarifications regarding how is conditional prior modelled.

@jlian2 Thanks for the amazing work! I am implementing the paper and wanted to understand the following questions regarding the part of the loss that focuses on content embeddings.

How is this conditional prior modeled? As I understood (for C-DSVAW-WavLM), we have 50 different possibilities of Y(X) for the whole dataset.
For the case of DSVAE, pθ(zc) was modeled using a randomly initialized autoregressive LSTM. How was sampling happening for DSVAE using the autoregressive LSTMs? (usually, encoders output mean / std, and we sample, assuming them to be the parameters for a diagonal multivariate gaussian). I have drawn the LSTM and the two-layered NN. I don't understand how the sampling happens.
Now, for C-DSVAE, for a segment of audio (100 frames), we have 100 Y(X)'s corresponding to each frame. How are these content biases used to condition in the above prior model?

Clarification: Decoder pre/post nets?

Hi @jlian2, how is the decoder able to create an image of shape (80 x 100) with the final layer in the PostNet? It seems to me that Prenet is capable of creating an image of the said shape instead.

I was assuming DecoderPrenet and DecoderPostnet are two parts of the whole decoder. Prenet comes first and Postnet is closer to the reconstructed image. But based on the decoder architecture, I don't understand how it can reconstruct the mels.

about speaker KLD loss

Hi, very great paper!
I've tried to implement this model.
I try to use this model to do singing voice conversion, trained this model on my private singing dataset.

I have a question about the speaker kl-divergence loss function. When keep the speaker prior to be a Guassion prior, which is zero mean and unit variance，should I make some constraint function or activation function(like LeakyReLU) to the output of the speaker mean-linear and variance-linear layer?

Because during my model training, I found the speaker KLD loss couldn't get optimized(get smaller value), while the content KLD loss and reconstruction loss keeps going down.

Looking for your response.
Much appreciated!

simple question

Really great paper here !! I read it all ,, Will the code be released for the improved version too aka this repo DSVAE ?? or only the style transfer one ?

Thanks in advance

InstanceNorm question

Hi @jlian2,
I have a general question about model:

As you said in paper that in encoder part you used knowledge from DSVAE (authors there used model on video frames -Conv2d (frames x batchsize, channels, width*height)). So in your case for audio it will be Conv1D on tensor: (batch x frame,channels, frequency), and after that you swap to (batch,channels,frame,frequency) for instancenorm2d?
On the other side in the Decoder part you said that you used AutoVC as template, but there authors use Conv1d on (batch, frequency, frame). So freq is like those channels. So if i want use there InstanceNorm2d i should create something like (batch,1,frame,frequency) from that?
As i did that i got bad results, so i think i have some misunderstand somewhere. Moreover when i use instancenorm without parametr affine=True my reconstrution loss even doesnt decrease.
And there is a thing that my KLD content loss is always near zero.

I will be glad if you correct me somewhere in my understanding.

Regarding: Performing K-Means on WavLM embeddings

I preprocessed around ~65GB of speech embeddings for VCTK data. Do you use kmeans implementation from sklearn? What hyperparameters do you use to do the clustering?

Do you think using half precision to reduce the size of the dataset will be benificial here?

Clarification: Compute Requirements

Hi @jlian2, thank you for your clarifications. I need additional help from you to understand the compute requirements that were required to train the model. Can you please answer the following questions in order for me to successfully reproduce the paper results?

How many GPUs (and which) were used to train the model?
How many epochs were used to train the final model?

Clarification: Encoder Implementation Details

Hi @jlian2.

while understanding the implementation, I don't understand what is meant by pooling.
Is it torch.nn.AvgPool2d? If so, what are the hyperparameters used?

Additionally, in the shared encoder, how is InstanceNorm2d used? Reading the documentation on pytorch's website (image below), it mentions to be useful for 4D data. But as I understand the batches in the VAE model would be 3 dimensional. (batch_size, channels, sequence_length)

Clarificiation: WavLM requirements during cloning.

Do we need WavLM outputs when performing cloning? As I understand, we directly pass the extracted latent into the trained decoder, therefore not needing to run WavLM when performing zero-shot voice cloning.