liusongxiang / ppg-vc Goto Github PK

View Code? Open in Web Editor NEW

321.0 10.0 73.0 155.8 MB

PPG-Based Voice Conversion

License: Apache License 2.0

Python 99.50% Shell 0.39% Makefile 0.11%

voice-conversion one-shot speech-synthesis phonetic-posteriorgram ppg conformer ppg-vc

ppg-vc's Introduction

One-shot Phonetic PosteriorGram (PPG)-Based Voice Conversion (PPG-VC): Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling (TASLP 2021)

Paper | Pre-trained models | Paper Demo

This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq) based, non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq based synthesis module. During the training stage, an encoder-decoder based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizer and is utilized to extract speaker-independent, dense and rich linguistic representations from spectral features. Then a multi-speaker location-relative attention based seq2seq synthesis model is trained to reconstruct spectral features from the bottle-neck features, conditioning on speaker representations for speaker identity control in the generated speech. To mitigate the difficulties of using seq2seq based models to align long sequences, we down-sample the input spectral feature along the temporal dimension and equip the synthesis model with a discretized mixture of logistic (MoL) attention mechanism. Since the phoneme recognizer is trained with large speech recognition data corpus, the proposed approach can conduct any-to-many voice conversion. Objective and subjective evaluations shows that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity. Ablation studies are conducted to confirm the effectiveness of feature selection and model design strategies in the proposed approach. The proposed VC approach can readily be extended to support any-to-any VC (also known as one/few-shot VC), and achieve high performance according to objective and subjective evaluations.

Diagram of the BNE-Seq2seqMoL system.

This repo implements an updated version of PPG-based VC models.

Notes:

The PPG model provided in conformer_ppg_model is based on Hybrid CTC-Attention phoneme recognizer, trained with LibriSpeech (960hrs). PPGs have frame-shift of 10 ms, with dimensionality of 144. This modelis very much similar to the one used in this paper.
This repo uses Hifi-GAN V1 as the vocoder model, sampling rate of synthesized audio is 24kHz.

Updates!

We provide an audio sample uttered by Barack Obama (link), you can convert any voice into Obama's voice using this sample as reference. Please have a try!
BNE-Seq2seqMoL One-shot VC model are uploaded (link)
BiLSTM-based One-shot VC model are uploaded (link)

How to use

Setup with virtualenv

$ cd tools
$ make

Note: If you want to specify Python version, CUDA version or PyTorch version, please run for example:

$ make PYTHON=3.7 CUDA_VERSION=10.1 PYTORCH_VERSION=1.6

Conversion with a pretrained model

Download a model from here, we recommend to first try the model bneSeq2seqMoL-vctk-libritts460-oneshot. Put the config file and the checkpoint file in a folder <model_dir>.
Prepare a source wav directory <source_wav_dur>, where the wavs inside are what you want to convert.
Prepare a reference audio sample (i.e., the target voice you want convert to) <ref_wavpath>.
Run test.sh as:

sh test.sh <model_dir>/seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2.yaml <model_dir>/best_loss_step_304000.pth \
  <source_wav_dir> <ref_wavpath>

The converted wavs are saved in the folder vc_gen_wavs.

Data preprocessing

Activate the virtual env py source tools/venv/bin/activate, then:

Please run 1_compute_ctc_att_bnf.py to compute PPG features.
Please run 2_compute_f0.py to compute fundamental frequency.
Please run 3_compute_spk_dvecs.py to compute speaker d-vectors.

Training

Please refer to run.sh

Citations

If you use this repo for your research, please consider of citing the following related papers.

@ARTICLE{liu2021any,
  author={Liu, Songxiang and Cao, Yuewen and Wang, Disong and Wu, Xixin and Liu, Xunying and Meng, Helen},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={Any-to-Many Voice Conversion With Location-Relative Sequence-to-Sequence Modeling}, 
  year={2021},
  volume={29},
  number={},
  pages={1717-1728},
  doi={10.1109/TASLP.2021.3076867}
}

@inproceedings{Liu2018,
  author={Songxiang Liu and Jinghua Zhong and Lifa Sun and Xixin Wu and Xunying Liu and Helen Meng},
  title={Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={496--500},
  doi={10.21437/Interspeech.2018-1504},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1504}
}

ppg-vc's People

Stargazers

Watchers

Forkers

iamgoofball entn-at chenchy ishine yingfenging markyouyuren xintaozhao0805 wendison zhangsanfeng86 ntzzc baturayp mortyzhou-shef-bit lukelluke windowxiaoming chennil vancause andreasjansson jardnzm inconnu11 hlp2819 hongwen-sun piotrdabkowski olegjakushkin susmitabhatt madosma agangzz zhangxinaaaa human2b shangeth thousanda zhyoung24 wyt1234 jaedukseo warisqr007 josh-zhu ahmeftah tricky61 cdliang11 zfb7901 dongsig kettukaa vtuber-plan daxiangpanda aozhi qingen wyj1996 superhg sophiefy amorjnyh tylqbq shaun95 mingthu adambear boostpapa steven850 mingjiechen man0bhir sizzles aydous maxmax2016 kinyugo suzhiba 497662892 road2018 christiangeng oyeji brdhunga evdv jccrews leonardlichking oytunturk kevingenggavo choonleng773

ppg-vc's Issues

about MFCC feature extraction

作者您好，感谢您的工作

我使用您的预训练模型获取了某段声音的梅尔频谱，但我注意到他似乎有小于零的部分，似乎与我之前所见到的（使用librosa库提取的）梅尔频谱不太一样。
如果我想用它来继续提取MFCC特征，请问可行吗？如果可以，您能否提供一些简要指导？
我注意到您代码中可以指定min_mel和max_mel调整数值区间，但是我将min_mel设置大于零后，不知如何指定max_mel的数值，也不能确定求出的是否为标准的MFCC特征。

非常感谢您的阅读

Colab For Running Test

Thank you for the great suggestion.
I will take time to work on this and will let you know when it's ready.

Originally posted by @liusongxiang in #24 (comment)

Any update on this?

Is there any requirement on the audio quality of training data?

I've noticed that the quality of VCTK dataset is extremely high, will the difference of my custom dataset influence the result? Thanks.

training code for conformer + ctc missed files

hello,我对比了工程里的conformer实现代码，发现了一些与espnet对不上的，排除了一些版本的问题，发现有部分代码差别很大，如果直接用espnet最新的代码来训练conformer + ctc,有需要修改的地方吗？尤其是那个subsample这个地方

why std.sqrt() is performed twice in utterance_mvn?

I notice in utterance_mvn(), if norm_means is True, then std.sqrt() is performed twice?
is there any explanation?

if norm_means:
        x -= mean

        if norm_vars:
            var = x.pow(2).sum(dim=1, keepdim=True) / ilens_
            std = torch.clamp(var.sqrt(), min=eps)
            x = x / std.sqrt()
        return x, ilens

Fine tuning the vocoder

Hi, is it possible for you to share the pretrained model of vocoder discriminator so that I can fine tune it on some specific voice ?

thank you.

Unable to run convert_from_wav.py; Module nnsp is missing a submodule?

text:
(base) F:\ppg-vc>python convert_from_wav.py --ppg2mel_model_train_config .\bilstm-vctk-libritts460-oneshot\bilstm_ppg2mel_vctk_libri_oneshotvc.yaml --ppg2mel_model_file .\bilstm-vctk-libritts460-oneshot\step_250000.pth --src_wav_dir F:\p5_characters\Ann\audio --ref_wav_path F:\p5_characters\Akechi\audio\ve370_002_00083.wav -o .\output_ann_to_akechi Traceback (most recent call last): File "convert_from_wav.py", line 11, in <module> from src.mel_decoder_mol_encAddlf0 import MelDecoderMOL File "F:\ppg-vc\src\__init__.py", line 4, in <module> from .mel_decoder_lsa import MelDecoderLSA File "F:\ppg-vc\src\mel_decoder_lsa.py", line 20, in <module> from .rnn_decoder_lsa import Decoder File "F:\ppg-vc\src\rnn_decoder_lsa.py", line 5, in <module> from nnsp.ctc_seq2seq_ppg_vc.lsa_attention import LocationSensitiveAttention ModuleNotFoundError: No module named 'nnsp.ctc_seq2seq_ppg_vc'

I'm on Python 3.8, latest version of nnsp is installed via pip.

Why is sampling rate not consistent for different feature extraction?

Thanks for your work first of all. I've found the sampling rate set to 16k in bnf and spk embeddings and 24k for f0 and mel computation during training. May I know what is the purpose?

finetuning vocoder

Hi
I am new to field of VC, I am trying to finetune model for specific speakers to generate output more like specific speaker.
After going through repo, I though best approach would be to fine tune vocoder.(suggested here #11)
From there I went to this repo (as suggested in the readme)
https://github.com/jik876/hifi-gan
there when i tested other pretrained model, such as LJ_V3,LJ_FT_T2_V3
when using these models in output i am getting just noise?
output : https://drive.google.com/file/d/19xXGF_u0EBtaiFDlg5CkAiEi8Xzc_dd3/view?usp=share_link

--> Can you elaborate on how to finetune on custom data (Specific speakers)? I have 15-20 min data each for 10 speaker ..will that be sufficient?
--> is finetuning vocoder sufficient ? for getting speech for specific users?
--> how to use other models available on hifigan repo? in ppg repo? what config change i need to do?

Does this project support cross lingual voice conversion？

Does this project support cross lingual voice conversion？ If yes, what changes need to be made？ Thanks.

ModuleNotFoundError: No module named 'nnsp.layers'

我使用pip install nnsp显示安装成功，然后运行test.sh的时候遇到了没有nnsp.layers的问题，报错信息如下
Traceback (most recent call last): File "convert_from_wav.py", line 11, in <module> from src.mel_decoder_mol_encAddlf0 import MelDecoderMOL File "/code/ppg-vc-main/src/__init__.py", line 4, in <module> from .mel_decoder_lsa import MelDecoderLSA File "/code/ppg-vc-main/src/mel_decoder_lsa.py", line 20, in <module> from .rnn_decoder_lsa import Decoder File "/code/ppg-vc-main/src/rnn_decoder_lsa.py", line 4, in <module> from .lsa_attention import LocationSensitiveAttention File "/code/ppg-vc-main/src/lsa_attention.py", line 4, in <module> from nnsp.layers.basic_layers import Conv1d, Linear ModuleNotFoundError: No module named 'nnsp.layers'
请问这是什么情况？

Have you tried the performance of the model on the mandarin dataset?

It's a great job and it shown extraordinary results for zero-shot condition. Have you test your model on mandarin datasets?If I want to try on mandarin datasets, which module i need to modify.

Will you provide the Unified Conformer PPG pretrained model in multi-cn example?

How many models (pt) files are needed in total?

Hello,
I read code see several models. Can you name and explain each?
thanks you.

ppg training

hi @liusongxiang ， how to training ppg model?

Will you provide the Unified Conformer pretrained model in multi-cn example?

ModuleNotFoundError: No module named 'nnsp.layers'

bin/train_linglf02mel_seq2seq_encAddlf0.py:2:sys.path.append('/home/shaunxliu/projects/nnsp')

is there anything about nnsp ? thank you!

Vocoder model

Hi,

In the paper the WaveRNN network is used as the neural vocoder. Any specific reason that it is replaced by Hifigan in this repository?

Thanks!

Help with "any to many voice conversion with location relative seq2seq modeling" paper

Dear Dr. Songxiang Liu,

I am trying to use your code for the one-shot VC and train linglf02mel seq2seq methods. However, I am getting a runtime error in the solver.exec() function. I think the error is caused by the fact that the if self.step > sekf.max_step == 1 loop does not terminate, and the code continues to run.

I think the error may be caused by incorrect data in the config file for the train fid lists. I would like to ask for your guidance on the following:

Is the train fid list file a Python file or a text file that contains the paths to the vctk data? Are the vctk data audio or text?
Is this the same for the dev and eval fid lists?
Are the vctk_ppg_dir and libri_ppg_dir directories the output of the compute_ctc_att_bnf file?
Are the vctk_f0_dir and libri_f0_dir directories the output of the compute_f0 file?
Are the vctk_wav_dir and libri_wav_dir directories also audio datasets?
Are the libri_spk_dvec_dir and libri_spk_devc_dir directories the output of the compute_spk_devcs file?

I would be very grateful if you could provide me with any guidance that you can. I will never forget your help.

Thank you,
Negin Vahidi

fine tune asr

Is there the code to train or fine tune the asr model to make better ppg

Training strategy

Hey, i am back again with another question :P

Can I interpret the two-stage training scheme as:

The training of CTC-Attention phoneme recognizer, speaker encoder, and Vocoder. Above three can be trained separately on their own.
The training of seq2seqMoL, it will need the output from CTC-Attention phoneme recognizer and speaker encoder. Each training instance is like (A's sentence_1, B's sentence_x, B's sentence_1), MSE is computed between the model's output B's sentence_1 and the ground truth B's sentence_1.

Please correct me if i am wrong.

Thanks!

刘博您好，看到这里说Source codes should be modified correspondingly for VC applications

Thank you for the questions.
For Q1:
I adapted espnet a lot; it seems that espnet asr models always downsample the encoder input along the temporal axis more than 4x and do not support phoneme as output symbols. Source codes should be modified correspondingly for VC applications. But the basic steps for the training process is very similar to those presented in espnet asr recipes, including the data preparation, files organization. The run.sh should be modified a little bit, e.g., the language model can be skipped. Sufficient familiarity of espnet source code should be necessary if you want to train a content encoder using your own data.

For Q2:
Please refer to this paper for your questions: TTS Skins: Speaker Conversion via ASR.
Good VC performance validate the speaker independence property of the bottle neck feature obtained in this way. The paper listed above says that BNF is better than PPG features, but this could really be a model selection thing.

Hope this can help.
Songxiang Liu

Originally posted by @liusongxiang in #4 (comment)

刘博您好，看到这里说Source codes should be modified correspondingly for VC applications，也看到代码里提供了一份en_conformer_ctc_att的espnet训练配置，请问一下直接使用这个config是否可以训练ppg部分？

刘博您好，最近在使用您的PPG-VC语音转换的感觉效果非常好，可以提供训练过程具体操作吗？

ZeroDivisionError: float division by zero

Im trying to run test.sh but keep getting the same error,
seems I can't get the right utterances.
Can anyone kindly give me some advice?

Experiment name: seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2
Load PPG-model, PPG2Mel-model, Vocoder-model...
Removing weight norm...
Loaded the voice encoder model on cuda in 0.02 seconds.
Number of source utterances: 0.
0it [00:00, ?it/s]
RTF:
Traceback (most recent call last):
  File "convert_from_wav.py", line 216, in <module>
    main()
  File "convert_from_wav.py", line 212, in main
    convert(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "convert_from_wav.py", line 167, in convert
    print(total_rtf / cnt)
ZeroDivisionError: float division by zero

Influence of 10ms to 40ms rate on VC in Conformer

Thank you for sharing，There's one thing I don't understand。
Why do have to change the one-quarter sampling rate of subsampling after removing it? What is the harm to VC task ？

Hi, I wanna ask how to fintune the pretrained speaker encoder with mandarin dataset?

Thanks for your great job! I haven't worked on VC task very long, and I find that the generalization ability of speaker encoder is quite important for voice cloning between the unseen speakers. So could you please teach me how to finetune the pretrained speaker encoder you provided? Any technical repo or command is ok.

question about the training of encoder-decoder

Hi, the paper mentioned MSE loss between the predicted mel-spectrogram and ground-truth mel-spectrogram. I am wondering, if the below example is correct.
A, our source speaker, has a audio saying 12345. B, our target speaker, also has a audio saying 12345, and some other audios. During training, A’s 12345 will be converted to B’s voice by a B’s audio (any audio). Then the output will be compared with B’s 12345 to compute MSE loss.