Coder Social home page Coder Social logo

improved-voice-conversion-with-conditional-dsvae's Introduction

Towards Improved Voice Conversion with Conditional DSVAE

Paper for this work: https://arxiv.org/abs/2205.05227

This is based on our previous work:

@inproceedings{D-DSVAE-VC,
  title={Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion},
  author={Lian, Jiachen and Zhang, Chunlei and Yu, Dong},
  booktitle={IEEE ICASSP},
  year={2022},
  organization={IEEE}
}

The previous demo is here https://jlian2.github.io/Robust-Voice-Style-Transfer/.

Demo for this work: https://jlian2.github.io/Improved-Voice-Conversion-with-Conditional-DSVAE/.

improved-voice-conversion-with-conditional-dsvae's People

Contributors

jlian2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

janfschr wmaiga

improved-voice-conversion-with-conditional-dsvae's Issues

Additional Questions: Performance outside training dataset

Hi @jlian2! Thanks for the great work. I had a question about the experimental setup for data usage (for training and testing). As I understand, the VCTK dataset has 111 people's recordings using two microphones (mic1 and mic2). I had the following three questions:

  • Did you use mic1 or mic2 from the VCTK dataset to train the model? Why choose one over the other?
  • Did you use mic1 or mic2 from the VCTK dataset to test the model? Why choose one over the other?
  • In case the same mics were used for training and testing, did you try testing on a different mic while experimenting?

Content Encoder Output in Paper

Hi @jlian2. According to the paper the content encoder gives a 64 dim vector. Am i correct in assuming that 1.6 seconds of content is encoded in this 64 dim vector?

Clarifications regarding how is conditional prior modelled.

image

@jlian2 Thanks for the amazing work! I am implementing the paper and wanted to understand the following questions regarding the part of the loss that focuses on content embeddings.

equation

  • How is this conditional prior modeled? As I understood (for C-DSVAW-WavLM), we have 50 different possibilities of Y(X) for the whole dataset.

  • For the case of DSVAE, pθ(zc) was modeled using a randomly initialized autoregressive LSTM. How was sampling happening for DSVAE using the autoregressive LSTMs? (usually, encoders output mean / std, and we sample, assuming them to be the parameters for a diagonal multivariate gaussian). I have drawn the LSTM and the two-layered NN. I don't understand how the sampling happens.
    image

  • Now, for C-DSVAE, for a segment of audio (100 frames), we have 100 Y(X)'s corresponding to each frame. How are these content biases used to condition in the above prior model?

Clarification: Decoder pre/post nets?

image

Hi @jlian2, how is the decoder able to create an image of shape (80 x 100) with the final layer in the PostNet? It seems to me that Prenet is capable of creating an image of the said shape instead.

I was assuming DecoderPrenet and DecoderPostnet are two parts of the whole decoder. Prenet comes first and Postnet is closer to the reconstructed image. But based on the decoder architecture, I don't understand how it can reconstruct the mels.

about speaker KLD loss

Hi, very great paper!
I've tried to implement this model.
I try to use this model to do singing voice conversion, trained this model on my private singing dataset.

I have a question about the speaker kl-divergence loss function. When keep the speaker prior to be a Guassion prior, which is zero mean and unit variance,should I make some constraint function or activation function(like LeakyReLU) to the output of the speaker mean-linear and variance-linear layer?
image

Because during my model training, I found the speaker KLD loss couldn't get optimized(get smaller value), while the content KLD loss and reconstruction loss keeps going down.
image

Looking for your response.
Much appreciated!

simple question

Really great paper here !! I read it all ,, Will the code be released for the improved version too aka this repo DSVAE ?? or only the style transfer one ?

Thanks in advance

InstanceNorm question

Hi @jlian2,
I have a general question about model:
image
As you said in paper that in encoder part you used knowledge from DSVAE (authors there used model on video frames -Conv2d (frames x batchsize, channels, width*height)). So in your case for audio it will be Conv1D on tensor: (batch x frame,channels, frequency), and after that you swap to (batch,channels,frame,frequency) for instancenorm2d?
On the other side in the Decoder part you said that you used AutoVC as template, but there authors use Conv1d on (batch, frequency, frame). So freq is like those channels. So if i want use there InstanceNorm2d i should create something like (batch,1,frame,frequency) from that?
As i did that i got bad results, so i think i have some misunderstand somewhere. Moreover when i use instancenorm without parametr affine=True my reconstrution loss even doesnt decrease.
And there is a thing that my KLD content loss is always near zero.

I will be glad if you correct me somewhere in my understanding.

Clarification: Compute Requirements

Hi @jlian2, thank you for your clarifications. I need additional help from you to understand the compute requirements that were required to train the model. Can you please answer the following questions in order for me to successfully reproduce the paper results?

  • How many GPUs (and which) were used to train the model?
  • How many epochs were used to train the final model?

Clarification: Encoder Implementation Details

Hi @jlian2.

  • while understanding the implementation, I don't understand what is meant by pooling.
  • Is it torch.nn.AvgPool2d? If so, what are the hyperparameters used?

image

  • Additionally, in the shared encoder, how is InstanceNorm2d used? Reading the documentation on pytorch's website (image below), it mentions to be useful for 4D data. But as I understand the batches in the VAE model would be 3 dimensional. (batch_size, channels, sequence_length)

image

Clarificiation: WavLM requirements during cloning.

Do we need WavLM outputs when performing cloning? As I understand, we directly pass the extracted latent into the trained decoder, therefore not needing to run WavLM when performing zero-shot voice cloning.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.