jxzhanggg / nonparaseq2seqvc_code Goto Github PK

Implementation code of non-parallel sequence-to-sequence VC

License: MIT License

Python 99.15% Shell 0.85%

voice-conversion deep-learning text-to-speech pytorch-implementation

nonparaseq2seqvc_code's Introduction

Non-parallel Seq2seq Voice Conversion

Implementation code of Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations.

For audio samples, please visit our demo page.

Dependencies

Python 3.6
PyTorch 1.0.1
CUDA 10.0

Data

It is recommended you download the VCTK and CMU-ARCTIC datasets.

Usage

Installation

Install Python dependencies.

$ pip install -r requirements.txt

Feature Extraction

Extract Mel-Spectrograms, Spectrograms and Phonemes

You can use extract_features.py

Customize data reader

Write a snippet of code to walk through the dataset for generating list file for train, valid and test set.

Then you will need to modify the data reader to read your training data. The following are scripts you will need to modify.

For pre-training:

For fine-tuning:

Pre-train the model

Add correct paths to your local data, and run the bash script:

$ cd pre-train
$ bash run.sh

Run the inference code to generate audio samples on multi-speaker dataset. During inference, our model can be run on either TTS (using text inputs) or VC (using Mel-spectrogram inputs) mode.

$ python inference.py

Fine-tune the model

Fine-tune the model and generate audio samples on conversion pair. During inference, our model can be run on either TTS (using text inputs) or VC (using Mel-spectrogram inputs) mode.

$ cd fine-tune
$ bash run.sh

Training Time

On a single NVIDIA 1080 Ti GPU, with a batch size of 32, pre-training on VCTK takes approximately 64 hours of wall-clock time. Fine-tuning on two speakers (500 utterances each speaker) with a batch size of 8 takes approximately 6 hours of wall-clock time.

Citation

If you use this code, please cite:

@article{zhangnonpara2020, 
author={Jing-Xuan {Zhang} and Zhen-Hua {Ling} and Li-Rong {Dai}}, 
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
title={Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations}, 
year={2020}, 
volume={28}, 
number={1}, 
pages={540-552}}

Acknowledgements

Part of code was adapted from the following project:

nonparaseq2seqvc_code's People

Contributors

Stargazers

Watchers

Forkers

jrmeyer entn-at bw-git xilaili kimjj-geek onetwothree1234 dacson inconnu11 sunxh16 auzxb ranyajumah evidament liusongxiang chenchy agangzz hongwen-sun knurpsbram ivancarapinha dachengai alva-2020 tosaka-m joshgreifer mbdash dipjyoti92 17011775 liuyikuikui kunzhou9646 t126tank lukelluke leoniuschen deng114 taalua crazycharles6 google0007 vikneo2017 zge ishine linhuiqing casioexlim ksj4654 xiaoyangnihao yingfenging byr-hz cz26 chester-w-xie macroustc sevinjyolchuyeva mmmmichaelzhang zp75296383 jjandnn techthiyanes cenwurong runngezhang shinshoji01 sbb2002 rashidijob mohammdreza2020

nonparaseq2seqvc_code's Issues

deduce num_speakers from *list files

this should not be hard-coded:

nonparaSeq2seqVC_code/pre-train/hparams.py

Line 33 in b9b22bb

n_speakers=99, #

and can easily be deduced from the list.train and list.eval files

Redundant code that is the same in "fine-tune" and "pre-train"

Part of code in both fine-tune and pre-train is the same, for example, the beam search code, decoder,
basic_layers etc. I think the code can be reorganized to be more compact.

GPU memory requirements

Hello,
What are the GPU memory requirements to use the model? I am using a GeForce GTX TITAN X with 12G of RAM and I got the following error:
RuntimeError: CUDA out of memory. Tried to allocate 132.88 MiB (GPU 0; 11.93 GiB total capacity; 11.03 GiB already allocated; 107.44 MiB free; 308.29 MiB cached)

Do you have any suggestions about how to overcome this problem?
Thank you.

Training the model for a different language

Hello @jxzhanggg,
First of all, thank you for your helpful replies to the previous issues I posted.
I would like to adapt this voice conversion model to European Portuguese. The thing is, I do not have a data set as large as VCTK in terms of nr. of utterances per speaker. I do have enough training data for at least 5-6 speakers (more than 500 utterances per speaker), sampled at 16 kHz. I tried several configurations, with batch sizes 8, 16 and 32 for pre-training but never managed to generate intelligible speech (decoder alignments did not converge). I changed the phonemizer backend in extract_features.py from Festival to Espeak, so that I could obtain phoneme transcriptions in Portuguese. I noticed that the total number of different phonemes increased substantially, from 41 (in English) to 66 (in Portuguese). I assume this makes the decoding task more difficult. Also, I experimented with the fine-tune model and the results improved a little bit (sometimes one or two words are intelligible, but still unintelligible utterances overall).

My questions are the following:

Should I try to use the pre-train model, even with only 5-6 speakers, or should I use only the fine-tune model instead?
What would you suggest in order to solve the decoder alignment problem?

Thank you very much

Interpreting results

Hi @jxzhanggg,

I think our discussion will be interesting to others, so I'm posting this as a Github issue. If there's another place to better discuss this, let me know.

I would like to hear your thoughts on the results I've gotten from VCTK so far. It's promising, but definitely doesn't sound as good as what your demopage shows. I've pre-trained on VCTK, and now I'm inspecting the output of pre-trained/inference.py.

Training Info

Trained on 94 of VCTK speakers
Batch size of 16
Single GPU
did not use spectrograms (only mel-spectrograms)
did not use mean / std normalization
Trained for 413,000 iterations (resulting in checkpoint_413000)

Inference Info

Griffith-Lim vocoder
did not use spectrograms (only mel-spectrograms)
did not use mean / std normalization
tested on 2 VCTK speakers unseen in training (but did appear in the validation set)

Results

I can hear a muffled human voice, but it is not clear enough to understand
alignment looks promising, but not complete

What are your thoughts on this? How can I achieve a better result?

Thank you!

Wav__ref_p374_VC.zip
Ali__ref_p374_VC.pdf
Hid__ref_p374_VC.pdf

Feature Request

Hi Guys,

first of all, thanks for sharing this great research.
I have a question. Is it possible to offer a python "code" way to train / fine-tune the model?
For me personally, some examples how to do it would help a lot.

Best regards
Christian Klose

text_inputs vs text_targets?

Hi @jxzhanggg,

What is the difference between text_inputs and text_targets here?

nonparaSeq2seqVC_code/pre-train/reader/reader.py

Line 107 in b9b22bb

batch is list of (text_input, text_targets, mel, speaker_id)

Why consistent_loss_w=0.0 ?

self.consi_w = hparams.consistent_loss_w
if self.consi_w == 0.:
consist_loss = torch.tensor(0.).cuda()
else:
consist_loss = self.MSELoss(text_hidden, mel_hidden)
mask = text_mask.unsqueeze(2).expand(-1, -1, text_hidden.size(2))
consist_loss = torch.sum(consist_loss * mask)/torch.sum(mask)

I see here set the weight to zero. Why not calculate this loss?
Is that means the linguistic representations extracted from audio signals and from phoneme sequences are not similar?

Documentation: Training Time

In the README there should be an estimate of training time. For example:

On 4 NVIDIA Titan GPUs, with a batch size of 32, pre-training on VCTK takes approximately 24 hours of wall-clock time. Fine-tuning on two speakers with this set-up takes approximately 2 hours of wall-clock time.

I made up those numbers, because I still haven't finished pre-training and fine-tuning. I have been pre-training for >15 hours on a GeForce RTX 2080 Ti with 12G of RAM, and I'm not sure when it's going to finish. I had to cut batchsize from 32 down to 16 to avoid OOM errors

Is there any method to predict speaker code embedding?

The one-hot speaker embedding is simple but is applicable in limited scenarios. Is there any method for universal speaker embedding?

Remove all Legacy Code

There is legacy code in different places, and this should all be removed

Upgrade to Python3

warning when installing pytorch with python2.7:

DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7. More details about Python 2 support in pip, can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support

How much time is the training dataset?

Upgrade to TensorFlow 2.0 (or not?)

I'm getting some warnings about TF contrib being depreciated... it's not obvious to me that the repo should be upgraded to TF2.0, but it should be considered.

(venv) josh@thor:~/git/nonparaSeq2seqVC_code/pre-train$ bash run.sh 
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

The mechanism of alignment between text encoder output and audio_seq2seq output

Hi, Zhang
Could you please explain how the text encoder output and recognition encoder output align? it is stated in your paper as "The recognition encoder Er is a seq2seq neural network which aligns the acoustic and phoneme sequences automatically." I couldn't figure out how the code work.
Thank you advance!

Fine-tuning help

It's unclear from the lack of comments in run.sh and the lack of a read-me file in the fine-tuning folder how to set up the Arctic data or other voice-data for fine-tuning training. The order seems clear get the embedding, train using the embedding, and then run inference on unseen source voice recordings. Would it be possible to create a read-me for fine-tuning to describe the data setup and process in more detail?

gate outputs?

there are many mentions of gate in the code, for example:

nonparaSeq2seqVC_code/fine-tune/model/decoder.py

Line 241 in c977fe9

gate_outputs: gate outputs from the decoder

What is this gate, and what does it do?

Did you use a silence symbol?

What do 'pau' and 'SOS/EOS' in symbol list mean? 'pau' means silence pause and 'SOS/EOS' means start or end of a sentence? How to use 'pau' and 'SOS/EOS'?

Multi-GPU training

Hello,
Could you please specify the steps to enable multi-GPU training, please?
I set distributed_run=True in hparams.py and then set --n_gpus=2 and CUDA_VISIBLE_DEVICES=0,3 in file run.sh to select GPUs 0 and 3, respectively. I did this and the code seems to enter some kind of deadlock because it does not start training.
Thank you.

Change variable names "input_text" and "text_input"

this is a confusing variable naming

https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/master/pre-train/inference.py

代码全是坑

Why skip long utterances?

It seems the training code skips utterances longer that 1,000 frames:

nonparaSeq2seqVC_code/pre-train/reader/reader.py

Line 17 in 758ecb6

if int(n_frame) >= 1000:

Does this mean you skip all utterances longer than ~12.5 seconds (in the paper 12.5ms is the reported skip length for calculating the spectogram)? If so, why?

Transition agent in forward attention

Hi,
I wanted to modify the speaking rate of the voice conversion speech. As mentioned in your paper on forward attention, I was looking for transition agent to modify the speaking rate. But, I could not find transition agent in the code. Can you please let me know how to change the transition agent/speaking rate in the code. It would be very much helpful for me.

Thanks and regards,
Narendra

Speaker and linguistic embedding visualizations do not look good as in the paper

Hi @jxzhanggg ,

I trained your model and the converted speeches sound promising (I also attached some samples below).
Then, I tried to visualize the speaker and linguistic embeddings. However, it did not seem perfectly overlapped as in the paper. Moreover, there are still some outliers lied in where it should not have been. (You can observe it in the figures below).

So I'm wondering if it's due to the wrong chosen parameters for t-SNE visualization function (eg. perplexity, iteration, learning_rate, etc.) or something else.

Could you give me some comments about that.
Thank you!

samples.zip

Documentation: Audio + Text Feature Extraction

The usage instructions are missing some information on the feature preprocessing step.

It would be helpful to give more exact instructions on how to extract spectrograms and phoneme features. Does the code expect phoneme .lab files from Festival, as indicated in r9y9's deepvoice code? https://github.com/r9y9/deepvoice3_pytorch/tree/master/vctk_preprocess
is it possible to use deepvoice code to get the exact features expected by nonparaSeq2SeqVC? If so, which scripts are needed?

I'm going to try out the code now, and I'll send PRs on documentation when I'm confident I can add something.

Thanks!

> They are my prepared training list.

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

How did you get the acosutic_frame_number (e.g. 135, 145, 365, 103, 57.... ), I ues the extract_mel_spec function in extract_features.py, but got the different numbers of frame, however the phone_number is the same. here is my result:

/VCTK/spec/p225/log-spec-p225_090.npy 346 22
/VCTK/spec/p225/log-spec-p225_118.npy 315 23
/VCTK/spec/p225/log-spec-p225_014.npy 533 52
/VCTK/spec/p225/log-spec-p225_179.npy 250 11

and my param is
y, sample_rate = librosa.load(filename, sr=16000)
spec = librosa.core.stft(y=y,n_fft=2048,hop_length=200, win_length=800,window='hann',center=True,pad_mode='reflect')
spec = librosa.magphase(spec)[0]
log_spectrogram = np.log(spec).astype(np.float32)
mel_spectrogram = librosa.feature.melspectrogram(S=spec, sr=sample_rate,n_mels=80,power=1.0, fmin=0.0, fmax=None, htk=False, norm=1)
log_mel_spectrogram = np.log(mel_spectrogram).astype(np.float32)

Originally posted by @Alphadone in #2 (comment)

ValueError: too many values to unpack (expected 3)

Hello, I tried the code but facing a bug.
nonparaSeq2seqVC_code-master/pre-train/reader/reader.py", line 18, in read_text start, end, phone = line.strip().split() ValueError: too many values to unpack (expected 3)

It seems that in read_text function, It need the ".phones" files,
however by using "extract_features.py", the phones files is

ae s k hh er t ax b r ih ng dh iy z th ih ng z w ih dh hh er f r ah m dh ax s t ao r

How can I get the right format of phones?

Alignment problem

When I start training VCTK, it seems the alignment has something wrong.
step-28000-tts.pdf
step-28000-vc.pdf
I met the same problem when training tacotron2, and the author suggested to use pre-train model. Does this code has pre-train model? Or any other mothod to deal with this problem?

list file format

In the readme, you say, "Write a snippet of code to walk through the dataset for generating list file for train, valid and test set." What is the format of the list file? Can you post an example list file?

Normalizing features

Hi @jxzhanggg ,
I tried to do feature extraction without mel normalization and then run the pre-train code. However, the result was not so high, you can listen to one of my samples below.
As you said, it is recommended to normalize the mel-spectrogram beforehand as a way to make the model converge properly. So I'm trying to do like so.

But I'm confused that how did you calculate mean and std? because each sample has different length, for example, extracted mel-spectrogram of one utterance has the length of (80, 335), another has length (80, 500), so the shapes are different?
Could you please explain how you do it? and if it is possible could you please give me that part of code?
Thank you!
Wav_275_146_ref_p293_VC.zip

Include pre-trained models

Hi Jing-Xuan,

Thanks for open-sourcing this!

If you include some pre-trained models, (maybe hosted on git LFS), it would be very useful for the research / open source community.

-josh

Keeping prosodic features of reference Speaker

Hi @jxzhanggg,

I am trying to achieve Voice Conversion with this algorithm applied to prosody training. This means that I want to convert a reference audio (Speaker A) to the voice of a user (Speaker B), but maintaining the original phone durations and the pitch contour (different mean f0) of the reference speaker (Speaker A).

Right now I managed to pre-train and fine-tune the model, and the voice conversion works well, the output is very similar to the target. But all the prosodic features from the reference were lost.

Do you have any idea where I may need to tweak to achieve this result? Even if it is at a slight cost of the audio quality. Did you ever attempt this or have an idea which parameters need to be changed?

Thanks in advance!

Pedro Sousa

Pre-train model results

Hello,

I trained the pre-train model with the following specs:

mel_mean_std, spec_mean_std for feature normalization, and phonemes were obtained by running the script extract_features.py;
99 speakers were used, and train/evaluation/test sets were created according to the paper (10 utterances per speaker in eval_set, 20 utterances per speaker in the test set and the rest in the training set);
learning rate decay of 0.95 every 1000 steps;
batch size = 32.

I obtained intelligible, but poor results in terms of voice conversion and quality in general. Also, I realized that the generated VC speech seems to be slower than the original source utterances. Besides, many of the generated samples (typically with 2-4 seconds of speech) have large sections of silence, sometimes more than 20 seconds. I include some of the samples (200k steps of training) and source utterances attached below. What could explain these problems?
samples_checkpoint_200000.zip

Additionally, I would like to ask if the following issues could be some of the reasons for these bad results:

Since I ran the extract_features.py file, mel_mean_std and spec_mean_std were obtained from all 109 speakers in the data set, but I use only 99 speakers, so, should I get the mel_mean_std and spec_mean_std for only the 99 speakers I use? Furthermore, should mel_mean_std and spec_mean_std be obtained only from data in the training set?
Also, I plotted the speaker embedding for some utterances in the training set (10 speakers, 12 utterances per speaker) and although the clusters seem quite good, they are not linearly separable in terms of speaker gender (male/female), as the paper suggests. In the plot below, triangles represent female speakers, and circles represent male speakers.

Thank you very much

assignment statement of y_tts_pred & y_vc_pred

Hi, when I have prepared everything to pre train,it raised an error saying"local variable 'y_tts' referenced before assignment ". So I add a statement before the loop 'for'. I set y_tts and y_tts_pred list type. But another error came " ValueError: not enough values to unpack(expected 12, got 0)". It referred to "y_pred".
I think maybe there is a need to add an initialization statement of y_tts, y_vc and y_tts_pred, y_vc, y_tts_pred. Besides, is only one of y_tts_pred or y_vc_pred assigned at a time? It means the left is set None?

jxzhanggg / nonparaseq2seqvc_code Goto Github PK

nonparaseq2seqvc_code's Introduction

Non-parallel Seq2seq Voice Conversion

Dependencies

Data

Usage

Installation

Feature Extraction

Extract Mel-Spectrograms, Spectrograms and Phonemes

Customize data reader

Pre-train the model

Fine-tune the model

Training Time

Citation

Acknowledgements

nonparaseq2seqvc_code's People

Contributors

Stargazers

Watchers

Forkers

nonparaseq2seqvc_code's Issues

Training Info

Inference Info

Results

Recommend Projects

Recommend Topics

Recommend Org