wendison / vqmivc Goto Github PK

View Code? Open in Web Editor NEW

318.0 10.0 55.0 4.51 MB

Official implementation of VQMIVC: One-shot (any-to-any) Voice Conversion @ Interspeech 2021 + Online playing demo!

License: MIT License

Python 6.00% Shell 3.21% Jupyter Notebook 89.37% Makefile 0.02% Perl 1.40%

voice-conversion speech speech-generation one-shot disentanglement-learning

vqmivc's Introduction

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion (Interspeech 2021)

Run VQMIVC on Replicate

Integrated to Huggingface Spaces with Gradio. See Gradio Web Demo.

Pre-trained models: google-drive or here | Paper demo

This paper proposes a speech representation disentanglement framework for one-shot/any-to-any voice conversion, which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference. Vector quantization with contrastive predictive coding (VQCPC) is used for content encoding and mutual information (MI) is introduced as the correlation metric during training, to achieve proper disentanglement of content, speaker and pitch representations, by reducing their inter-dependencies in an unsupervised manner.

📢 Update

Many thanks to ericguizzo & AK391!

A Replicate demo is provided online, so you can play our pre-trained models there, have fun!
VQMIVC can be trained and tested inside a Docker environment via Cog now.
Gradio Web Demo is available, another online demo!

TODO

Add more details on how to use Cog for development

Requirements

Python 3.6 is used, install apex for speeding up training (optional), other requirements are listed in 'requirements.txt':

pip install -r requirements.txt

Quick start with pre-trained models

ParallelWaveGAN is used as the vocoder, so firstly please install ParallelWaveGAN to try the pre-trained models:

python convert_example.py -s {source-wav} -r {reference-wav} -c {converted-wavs-save-path} -m {model-path}

For example:

python convert_example.py -s test_wavs/p225_038.wav -r test_wavs/p334_047.wav -c converted -m checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/VQMIVC-model.ckpt-500.pt

The converted wav is put in 'converted' directory.

Training and inference:

Step1. Data preparation & preprocessing

Put VCTK corpus under directory: 'Dataset/'
Training/testing speakers split & feature (mel+lf0) extraction:
```
 python preprocess.py
```

Step2. model training:

Training with mutual information minimization (MIM):

 python train.py use_CSMI=True use_CPMI=True use_PSMI=True

Training without MIM:

 python train.py use_CSMI=False use_CPMI=False use_PSMI=False

Step3. model testing:

Put PWG vocoder under directory: 'vocoder/'

Inference with model trained with MIM:

 python convert.py checkpoint=checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/model.ckpt-500.pt

Inference with model trained without MIM:

 python convert.py checkpoint=checkpoints/useCSMIFalse_useCPMIFalse_usePSMIFalse_useAmpTrue/model.ckpt-500.pt

Citation

If the code is used in your research, please Star our repo and cite our paper:

@inproceedings{wang21n_interspeech,
  author={Disong Wang and Liqun Deng and Yu Ting Yeung and Xiao Chen and Xunying Liu and Helen Meng},
  title={{VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1344--1348},
  doi={10.21437/Interspeech.2021-283}
}

Acknowledgements:

The content encoder is borrowed from VectorQuantizedCPC, which also inspires the negative sampling within-utterance for CPC;
The speaker encoder is borrowed from AdaIN-VC;
The decoder is modified from AutoVC;
Estimation of mutual information is modified from CLUB;
Speech features extraction is based on espnet and Pyworld.

vqmivc's People

Contributors

Stargazers

Watchers

Forkers

chenchy charlottecuc whitefu liusongxiang bridgettesong idgmatrix hertz-pj ishine mortyzhou-shef-bit mnfutao johndpope fb029ed ericguizzo dm1parf entn-at xxxin1 ak391 shihuaxing zshy1205 alva-2020 warisqr007 jjandnn silyfox aijianiula0601 yingfenging andreasjansson wx-b tricky61 esoff welkinyang jaedukseo kingstorm techthiyanes donggangj wendonggan wyj1996 zhanfengdog interaktionab shaun95 eomsoohwan codesheep0511 saber5433 silenticymoon haodong-liu abylouw jesse-campbell jingyezhang snykral platform-kit nzpeng kennethlynne chienlinhuang1116 render-ai sectum1919

vqmivc's Issues

"Buffer has wrong number of dimensions (expected 1, got 2)" error when running demo on Replicate

Hello, when I run the demo on Replicate with 2 custom files I get a "Buffer has wrong number of dimensions (expected 1, got 2)" error. This is the error log:
Running predict()...
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/cog/server/redis_queue.py", line 132, in start
self.handle_message(response_queue, message, cleanup_functions)
File "/usr/local/lib/python3.6/site-packages/cog/server/redis_queue.py", line 188, in handle_message
result = run_prediction(self.predictor, inputs, cleanup_functions)
File "/usr/local/lib/python3.6/site-packages/cog/predictor.py", line 62, in run_prediction
result = predictor.predict(**inputs)
File "/usr/local/lib/python3.6/site-packages/cog/input.py", line 80, in wraps
return f(self, **kwargs)
File "/usr/local/lib/python3.6/site-packages/cog/input.py", line 80, in wraps
return f(self, **kwargs)
File "/src/predict.py", line 113, in predict
src_mel, src_lf0 = extract_logmel(src_wav_path, self.mean, self.std)
File "/src/predict.py", line 49, in extract_logmel
f0, timeaxis = pw.dio(wav.astype("float64"), fs, frame_period=frame_period)
File "pyworld/pyworld.pyx", line 93, in pyworld.pyworld.dio
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

problem have slove

voice conversion not happens after fine-tuned with pretrained model

Hi @Wendison

Thank you so much for this great work.

I fine-tuned (resumed) pretrained model (use_CSMI=True use_CPMI=True use_PSMI=True) with indicTTS dataset (20 speakers - each having 1 hour audios)

the model trained with 1000 epochs.

Quality gets better for the target speaker. but source speaker modulation is not converted.

Can you please give your suggestions?

Thanks

preprocess issue

After downloaded the VCTK Corpus and copy the file under /Dataset (and create a directory '/Dataset/VCTK-Corpus/' to include the file: speaker-info.txt), I run the preprocess.py and get the following result. How can I fix this?

(voice-clone) C:\Python\VQMIVC>python preprocess.py
all_spks: ['257', '294', '304', '297', '226', '282', '247', '330', '361', '252', '293', '306', '340', '231', '268', '283', '243', '334', '315', '269', '285', '310', '230', '311', '374', '307', '286', '323', '245', '227', '239', '240', '363', '284', '251', '318', '246', '265', '244', '228', '333', '276', '255', '225', '308', '260', '339', '312', '336', '347', '345', '258', '335', '270', '376', '237', '316', '326', '364', '273', '263', '259', '267', '292', '232', '229', '254', '264', '287', '278', '236', '317', '272', '233', '234', '248', '249', '305', '299', '281', '302', '329', '262', '351', '288', '298', '250', '343', '256', '300', '275', '341', '279', '277', '271', '241', '303', '274', '313', '266', '301', '253', '261', '314', '295', '360', '362', '238']
len(spk_wavs): 0
len(spk_wavs): 0
len(spk_wavs): 0
.
.
.
len(spk_wavs): 0
len(spk_wavs): 0
len(spk_wavs): 0
0
0
0
extract log-mel...
0it [00:00, ?it/s]
normalize log-mel...
Traceback (most recent call last):
File "preprocess.py", line 141, in
mels = np.concatenate(mels, 0)
File "<array_function internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

Other vocoder

Hello, i tried use another vocoder - HiFi Gan with your model.
But i faced with problem which get output with noise audio or silence.

I transopted the logmel output to a regular input for HiFiGan [1, 80, X]
for int16 i get very noise audio, for int32 i get silence

My inference code:

import torch
import numpy as np
from scipy.io.wavfile import write

from hifi_gan.env import AttrDict
from hifi_gan.meldataset import mel_spectrogram, MAX_WAV_VALUE, load_wav
from hifi_gan.models_hifi import Generator
import soundfile as sf
from model_encoder import Encoder, Encoder_lf0
from model_decoder import Decoder_ac
from model_encoder import SpeakerEncoder as Encoder_spk
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
import subprocess
from spectrogram import logmelspectrogram
import resampy
import pyworld as pw
import argparse

def extract_logmel(wav_path, mean, std, sr=16000):
    # wav, fs = librosa.load(wav_path, sr=sr)
    wav, fs = sf.read(wav_path)
    if fs != sr:
        wav = resampy.resample(wav, fs, sr, axis=0)
        fs = sr
    #wav, _ = librosa.effects.trim(wav, top_db=15)
    # duration = len(wav)/fs
    assert fs == 16000
    peak = np.abs(wav).max()
    if peak > 1.0:
        wav /= peak
    mel = logmelspectrogram(
                x=wav,
                fs=fs,
                n_mels=80,
                n_fft=400,
                n_shift=160,
                win_length=400,
                window='hann',
                fmin=80,
                fmax=7600,
            )
    
    mel = (mel - mean) / (std + 1e-8)
    tlen = mel.shape[0]
    frame_period = 160/fs*1000
    f0, timeaxis = pw.dio(wav.astype('float64'), fs, frame_period=frame_period)
    f0 = pw.stonemask(wav.astype('float64'), f0, timeaxis, fs)
    f0 = f0[:tlen].reshape(-1).astype('float32')
    nonzeros_indices = np.nonzero(f0)
    lf0 = f0.copy()
    lf0[nonzeros_indices] = np.log(f0[nonzeros_indices]) # for f0(Hz), lf0 > 0 when f0 != 0
    mean, std = np.mean(lf0[nonzeros_indices]), np.std(lf0[nonzeros_indices])
    lf0[nonzeros_indices] = (lf0[nonzeros_indices] - mean) / (std + 1e-8)
    return mel, lf0

### Load Vocoder
import json
config_file = '../FastPitch/hifi/config.json'
hifi = '../FastPitch/hifi/g_02500000'

with open(config_file) as f:
        data = f.read()
json_config = json.loads(data)
h = AttrDict(json_config)

generator_hifi = Generator(h).to('cuda')
state_dict_g = load_checkpoint(hifi, 'cuda')
generator_hifi.load_state_dict(state_dict_g['generator'])
generator_hifi.eval()
generator_hifi.remove_weight_norm()

checkpoint_path = 'All_model.ckpt-350.pt'
### load_model
encoder = Encoder(in_channels=80, channels=512, n_embeddings=512, z_dim=64, c_dim=256)
encoder_lf0 = Encoder_lf0()
encoder_spk = Encoder_spk()
decoder = Decoder_ac(dim_neck=64)
encoder.to('cuda')
encoder_lf0.to('cuda')
encoder_spk.to('cuda')
decoder.to('cuda')

checkpoint = torch.load(checkpoint_path, map_location=lambda storage, loc: storage)
encoder.load_state_dict(checkpoint["encoder"])
encoder_spk.load_state_dict(checkpoint["encoder_spk"])
decoder.load_state_dict(checkpoint["decoder"])

encoder.eval();
encoder_spk.eval();
decoder.eval();


def convert(src_wav_path, ref_wav_path, generator_hifi):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    mel_stats = np.load('./mel_stats/stats.npy')
    mean = mel_stats[0]
    std = mel_stats[1]
    src_mel, src_lf0 = extract_logmel(src_wav_path, mean, std)
    ref_mel, _ = extract_logmel(ref_wav_path, mean, std)
    src_mel = torch.FloatTensor(src_mel.T).unsqueeze(0).to(device)
    src_lf0 = torch.FloatTensor(src_lf0).unsqueeze(0).to(device)
    ref_mel = torch.FloatTensor(ref_mel.T).unsqueeze(0).to(device)
    out_filename = os.path.basename(src_wav_path).split('.')[0] 
    with torch.no_grad():
        z, _, _, _ = encoder.encode(src_mel)
        lf0_embs = encoder_lf0(src_lf0)
        spk_emb = encoder_spk(ref_mel)
        output = decoder(z, lf0_embs, spk_emb)
        output = output.transpose(1,2)
        # output[0] = 2.718281**output[0]
    
    print('synthesize waveform...')
    
    with torch.no_grad():
        #mel = torch.FloatTensor(mel.cpu()).to(device)
        y_g_hat = generator_hifi(output)
        audio = y_g_hat.squeeze()
        audio = audio * MAX_WAV_VALUE
        audio = audio.cpu().numpy().astype('int16')
        
    return audio

audio = convert('wav1.wav', 'wav2.wav',
        generator_hifi)
write('test.wav', 16000, audio)

How can I calculate suitable parameters?

Hello, I tried to hold training on a large dataset with more than 1,000,000 speakers. But the losses had very bad results. Can you please tell me how to calculate the parameters for training to achieve good results? (batch_size, n_prediction_steps, n_negatives, etc.)
The results on small data (VCTK + LibriTTS) were good.
You have a formula in your code warmup_epochs = 2000 // (len(dataset)//cfg.training.batch_size)
Why 2000? What is the optimal number of warmup epochs?
I tried to increase the batch size to speed up the training. But during the experiments it turned out that this worsened the results.
I will be glad for any advice!

The issue of vocoder in Inference progress

Hi Sir,

Thank you for your sharing firstly.

Now I meet a issure about the inference as below:

raceback (most recent call last):
File "convert.py", line 201, in
convert(config)
File "convert.py", line 194, in convert
subprocess.call(cmd)
File "/home/tts/xxxx/softWare/miniConda/miniconda3/envs/ft_tts/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/tts/xxxx/softWare/miniConda/miniconda3/envs/ft_tts/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/home/tts/xxxx/softWare/miniConda/miniconda3/envs/ft_tts/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: 'parallel-wavegan-decode'

What can I do to solve this problem？ The pretrain vocoder I have been put in the vocoder dir.

(tts) [xxxx@training VQMIVC]$ ll vocoder/
总用量 4
lrwxrwxrwx 1 xxxx xxxx 53 6月 25 10:50 checkpoint-3000000steps.pkl -> ../pretrain_model/vocoder/checkpoint-3000000steps.pkl
lrwxrwxrwx 1 xxxx xxxx 36 6月 25 10:50 config.yml -> ../pretrain_model/vocoder/config.yml
-rw-r--r-- 1 xxxx xxxx 39 6月 24 17:53 README.md
lrwxrwxrwx 1 xxxx xxxx 34 6月 25 10:50 stats.h5 -> ../pretrain_model/vocoder/stats.h5

How to slove this problem?

Dear Phd WANG:
When I run the convert.py file, I meet this problem and i can not slove it, can you give me some suggest? Thank you very much!
Error:
Traceback (most recent call last):
File "convert.py", line 168, in convert
'--feats-scp', f'{str(out_dir)}/feats.1.scp', '--outdir', str(out_dir)])
File "/home/liyp/anaconda3/envs/xll/lib/python3.6/subprocess.py", line 287, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/liyp/anaconda3/envs/xll/lib/python3.6/subprocess.py", line 729, in init
restore_signals, start_new_session)
File "/home/liyp/anaconda3/envs/xll/lib/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'parallel-wavegan-decode': 'parallel-wavegan-decode'

Traiining loss plunged

Hello, thank you for your awesome work. I trained model with VCTK dataset. But I met a question.
The train loss of perpexlity plunged after epoch 149. Have you met the same question?

Btw, could you help me check whether my loss is normal?
Thank you！

How to solve this problem?

Dear PHD:
I have already install the ParallelWaveGAN,however,when I run the egs/vctk/voc1 run.sh, it comes with the error:
Stage 0: Data preparation
Successfully split data directory.
ERROR: num_first + num_second must be the same # utts in src. (20 vs 10)
I wonder if you can help me solve this problem,thank you very much.

What is the "parallel-wavegan-decode" in cmd = ['parallel-wavegan-decode', '--checkpoint',...] ,it is a folder???

Thanks for your code, but I have some problems,
In code: cmd = ['parallel-wavegan-decode', '--checkpoint',...], Is it a folder? If so, what does this folder contain?
My system told me it couldn't be found

Demo: the source in female-to-male does not sound right

I was listening to your samples, and the samples for the source speaker in "Female-to-male" section sounded rather... damaged. (Especially the first one)
At first, I thought it may be some sort of artifact from downsampling, but other source speaker samples sounded just fine.
So, I was just wondering if that's how they are supposed to sound, or if there are some issues with those specific samples.

Thank you very much.

Where can I get the silence trimmed VCTK corpus?

Ｈi,

Thank you for sharing your code!
I wonder where can I get the silence trimmed VCTK corpus?
Since the VCTK dataset I have only contains *.wav file and in your preprocess.py script it seems that all audio files are *.flac format, I cannot run the script.

about the model question

I try to train the model again,after I finished the process.I used the model that trained by myself to voice conversion, but I got noting. could you give me some advice. I have done all things follow the ReadME

Training Loss Abnormal

@andreasjansson @Wendison Hello, sorry to interrupt you! I'm a rookie of voice model. I have trained the model in VCTK-Corpus-0.92.zip dataset by "python3 train.py use_CSMI=True use_CPMI=True use_PSMI=True" in NVIDIA V100S. But after 65 epochs, the train loss are as follows:

Could you give me some advice? Thank you very much!

loglikeli and mi_est

I understand that in mi_first_forward, the network be to optimized is CLUBSample_* which is trained to get mu and logdvar. In mi_second_forward, the core-network(the VC part) is optimized. In mi_first_forward, loglikeli is computed while in mi_second_forward mi_est is computed. I am comfused:

Is the value of loglikeli and mi_est is both used for representation disentanglement?
why loglikeli is used instead of mi_est in mi_first_forward?

Hugging Face App is broken

https://huggingface.co/spaces/akhaliq/VQMIVC

what does gamma: 0.5 in config/training/cpc.ymal mean?

Dear PHD Wang:
I wonder what does gamma: 0.5 mean? Loss weight?

Question about different embeddings/representations

Dear Team,
Thank you very much for your paper and your code. I have a query about the representations.
I understand that MI minimization makes the representations distinct from each other and they embed separate information.
However, it is not clear to me how it ensures that the speaker encoder encodes the speaker feature only and the pitch encoder encodes the pitch feature. is there any other loss other than the ones mentioned in the paper?
Sorry if I am missing something. I will be grateful if you can shed a bit of light on that.

Thanks
Arnab

Questions about the evaluation

Thanks for sharing the code. I have the following questions for which I hope you could clarify for me, if possible.

According to

VQMIVC/convert.py

Line 25 in 851b4f5

def select_wavs(paths, min_dur=2, max_dur=8):

, the files are filtered based on this duration criteria. Is this applied to the test set for any of the objective/subjective evaluations reported in the paper?
Table 2 is explained with the speaker representation capturing content information. Because the speaker representation is a single global vector, do you have an intuition for what content information the global vector can capture besides length of the input sequence (although this information is unlikely due to the wrap padding)?
Table 3: Because the data splits have non-overlapping speakers, how exactly do you train the speaker classifier? Normally, one can train the classifier given features derived from a trained encoder using the training set, then evaluate the classifier with the test set. That is, the train/test split is consistent for both the classifier and the encoder (or VQMIVC), and this can be done because the train and test sets share the same vocabularies or classes. However, we have non-overlapping speaker classes here, so I wonder how the speaker classifier is trained and evaluated.
I wonder if you have considered speaker verification (SV) using a pre-trained speaker encoder as an alternative for evaluating the speaker representation. The speaker encoder of a VC model can learn discriminative speaker embedding, and still perform poorly in terms of speaker conversion, because the decoder can't utilise the information well. As a result, SV might be a good complementary metric.
Based on your experiments, how sensitive is VQMIVC to the parameters like loss weights, and the CPC parameters? I am trying to implement the model and make it consistent with other baselines, which deviates from the parameters you set.

How to fine tune your model for another language dataset?

Hi
I'm work in your job that doing for voice conversion. for this I have a question.
How to fine tune your model for another language dataset?

How to solve this problem?

Dear PHD:
I try to train a vocoder, and I have installed parallelwavegan,and I run the command: run.sh,however it came out with the traceback:
Traceback (most recent call last):
File "/home/liyp/anaconda3/envs/xll/bin/parallel-wavegan-preprocess", line 11, in
load_entry_point('parallel-wavegan', 'console_scripts', 'parallel-wavegan-preprocess')()
File "/data2/hcy/VQMIVC-main/vocoder/ParallelWaveGAN/parallel_wavegan/bin/preprocess.py", line 186, in main
), f"{utt_id} seems to have a different sampling rate."

I find that the sampling rate is 24000hz,however the sampling rate of the VQMIVC is 16000,could you tell me how to modify the sampling rate?

Loss is zero

Hi Thank you for your great work, recently I train this VQMIVC to jvs dataset. But I got strange loss value please look at this :
epoch:506, global step:201, recon loss:3.645, cpc loss:2.398, vq loss:0.000, perpexlity:1.000, lld cs loss:31.000, mi cs loss:0.000E+00, lld ps loss:0.271, mi ps loss:-0.000, lld cp loss:-63.999, mi cp loss:0.000, used time:0.396s
is that normal that some loss turn into zero?

Question About the Embedding

Hi, I have a question about a configuration parameter called encoder_lf0_type = 'no_emb'.

I'm training the model with a subset of 13 speakers, which 4 of them aren't from the dataset, I let it run for 500 epochs but the results with some of those speakers aren't as good as in the paper. Probably it is due to the small amount of speakers used for training, but I gotta ask:

What does the encoder_lf0_type parameter is about?
Can you give me some advice on the training process? (min number of speakers for a decent result, number of epochs, etc)

Thank you.

The CPCLoss

I read related papers， but still do not understand the CPC loss computaiton.

    labels = torch.zeros(
        self.n_speakers_per_batch * self.n_utterances_per_speaker, length,
        dtype=torch.long, device=z.device
    )

    loss = F.cross_entropy(f, labels)

Can someone explain it for me. Why labels of zeros and cross_entropy used here?

lf0 question about convert phase

Hi,
I wonder why you normalize f0 series before feeding to the f0encoder in convert.py.
However, this kind of normalization for f0 isn't used in preprocessing phase.

Docker deploy keeps saying {"message":"Could not convert file input input_source to Path"}

What does this mean? Am I doing something wrong?

How to calculate results

How did you calculate MOS values for you model to compare from others. Can you elaborate the steps to calculate.

What do z_dim and c_dim stand for？

Dear PHD:
Could you tell me what do z_dim:64 and c_dim:256 in config/model/default stand for？And what n_embeddings: 512 in config/model/default stand for?Thank you very much.

Error: ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

I keep getting this error. What does it mean?

Improper converted audio when source = reference

Hi,
I tried using python convert_example.py -s test_wavs/jane3.wav -r test_wavs/jane3.wav -c converted -m checkpoints/useCSMITrue_useCPMITrue_usePSMITrue_useAmpTrue/VQMIVC-model.ckpt-500.pt to check out how the results are when source audio and reference audio are same. But the output is mostly silent. Am I missing something? To reproduce the results, the audio files and vocoder are uploaded here

Source and reference: https://drive.google.com/file/d/1bPAQ9UaKJF1gNNCtkeDmySxLv_uXW1HN/view?usp=sharing
Converted: https://drive.google.com/file/d/1TmxjpHx3WY3nKRwy5lz04LWfKAo69qwW/view?usp=sharing
CC: @Wendison

Pytorch version 1.3.1

The requirements.txt asked to install pytorch==1.3.1 However there is no 1.3.1 in previous versions of Pytorch. Can you clarify that?

When executing train.py without MIM, it requires c10_cuda.dll. I guess this should be related to the pytorch version. The message:

(voice-clone) C:\Python\VQMIVC>python train.py use_CSMI=False use_CPMI=False use_PSMI=False
Traceback (most recent call last):
File "train.py", line 7, in import torch
File "C:\Users\charl.conda\envs\voice-clone\lib\site-packages\torch_init_.py", line 124, in
raise err
OSError: [WinError 127] The specified procedure could not be found. Error loading "C:\Users\charl.conda\envs\voice-clone\lib\site-packages\torch\lib\c10_cuda.dll" or one of its dependencies.

huggingface is broken

In convert.py subprocess.call(['cp', src_wav_path, out_dir]) What does' CP 'mean?

What does' CP 'mean?

Question About Batch Size, number of Epochs and Learning Rate

Hi @Wendison , I've already has trained some models (with VCTK subsets and external speakers) and could notice that a bigger batch size doesn't necessarily results in better audio quality for the same 500 epochs, in some cases, audio quality could be worse (For male References). My question is:

Do you have any report or experiments with different Batch Sizes, Number of Epochs (Why 500 and not 600 or more), and Learning Rates for different batch sizes?

If not, what advice could you provide regarding the Batch Size and the number of Epochs? The bigger the better?

For complex data like this there should be an improvement on bigger batches, but learning rate or number of epochs should be tuned.

Thank You.

checkpoint file not full

i see in the checkpoint only keys:

dict_keys(['encoder', 'encoder_spk', 'decoder'])

But to restore training from pretrained need more keys:
Can you provide the full pretrained file

NameError: name 'amp' is not defined . File "train.py", line 407, in train_model

I am getting below error.

File "train.py", line 407, in train_model
optimizer, optimizer_cs_mi_net, optimizer_ps_mi_net, optimizer_cp_mi_net, scheduler, amp, epoch, checkpoint_dir, cfg)
NameError: name 'amp' is not defined

Loss value and reconstruction gender change

Hi, great work! I trained and tested the model and I have 2 questions to clarify.

I am not sure whether the loss values behave well. The loss values at epoch 500 is shown below. Is that in normal range?
While I conduct reconstruction which means the input wavs of content, speaker and pitch encoder are the same, the reconstructed wav sounds like speaker gender change. I tested the p316_003.wav. I have tried the pretrained ckpt you provided and ckpt I trained myself, and the phenomenon both exist. The reconstructed was are here.

Can I use another language without re-training?

Hi I wonder if this method can be used for any language, without retraining ? thank you

Mel stats and Vocoder

Hi,
I try to reproduce your paper and I encounter a problem with mel stats and vocoder.
When I use your pre-trained vocoder and mel stats, I can notice the speech synthesis performance is quite good.
However, when I run the preprocess code and get new mel stats, the speech synthesis performance degrades on the same pre-trained vocoder.
Thus, the questions are as below:

1.) I wonder if I get new mel stats, it is necessary to train the vocoder again.
2.) I wonder if you use mel stats from the preprocess code for vocoder input normalization.

Thank you

Training for Indian Multi-Speaker/Multi-lingual VC

Hi, @Wendison Thank you so much for your excellent work. very nice paper.

When I saw this reply on the below issues, it helped me to motivate to go further.

#14 (comment)

#17 (comment)

I am trying Common Voice Indian English Multi-Speakers and VCTK Training. I need a few suggestions from you

Steps:

I add Common Voice Indian English Multi-Speakers (40 speakers - each having 30 minutes Datasets) along with VCTK 109 Speakers. and start training use_CSMI=True use_CPMI=True use_PSMI=True
After the model is trained with good accuracy, will go for fine-tuning with other Indian regional languages of Common Voice (Tamil, Hindi, Urdu, etc)

is this approach good?

@Wendison kindly request, please give your suggestions. Thanks