Coder Social home page Coder Social logo

liusongxiang / ppg-vc Goto Github PK

View Code? Open in Web Editor NEW
316.0 10.0 73.0 155.8 MB

PPG-Based Voice Conversion

License: Apache License 2.0

Python 99.50% Shell 0.39% Makefile 0.11%
voice-conversion one-shot speech-synthesis phonetic-posteriorgram ppg conformer ppg-vc

ppg-vc's Introduction

One-shot Phonetic PosteriorGram (PPG)-Based Voice Conversion (PPG-VC): Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling (TASLP 2021)

This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq) based, non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq based synthesis module. During the training stage, an encoder-decoder based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizer and is utilized to extract speaker-independent, dense and rich linguistic representations from spectral features. Then a multi-speaker location-relative attention based seq2seq synthesis model is trained to reconstruct spectral features from the bottle-neck features, conditioning on speaker representations for speaker identity control in the generated speech. To mitigate the difficulties of using seq2seq based models to align long sequences, we down-sample the input spectral feature along the temporal dimension and equip the synthesis model with a discretized mixture of logistic (MoL) attention mechanism. Since the phoneme recognizer is trained with large speech recognition data corpus, the proposed approach can conduct any-to-many voice conversion. Objective and subjective evaluations shows that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity. Ablation studies are conducted to confirm the effectiveness of feature selection and model design strategies in the proposed approach. The proposed VC approach can readily be extended to support any-to-any VC (also known as one/few-shot VC), and achieve high performance according to objective and subjective evaluations.

Diagram of the BNE-Seq2seqMoL system.

This repo implements an updated version of PPG-based VC models.

Notes:

  • The PPG model provided in conformer_ppg_model is based on Hybrid CTC-Attention phoneme recognizer, trained with LibriSpeech (960hrs). PPGs have frame-shift of 10 ms, with dimensionality of 144. This modelis very much similar to the one used in this paper.

  • This repo uses Hifi-GAN V1 as the vocoder model, sampling rate of synthesized audio is 24kHz.

Updates!

  • We provide an audio sample uttered by Barack Obama (link), you can convert any voice into Obama's voice using this sample as reference. Please have a try!
  • BNE-Seq2seqMoL One-shot VC model are uploaded (link)
  • BiLSTM-based One-shot VC model are uploaded (link)

How to use

Setup with virtualenv

$ cd tools
$ make

Note: If you want to specify Python version, CUDA version or PyTorch version, please run for example:

$ make PYTHON=3.7 CUDA_VERSION=10.1 PYTORCH_VERSION=1.6

Conversion with a pretrained model

  1. Download a model from here, we recommend to first try the model bneSeq2seqMoL-vctk-libritts460-oneshot. Put the config file and the checkpoint file in a folder <model_dir>.
  2. Prepare a source wav directory <source_wav_dur>, where the wavs inside are what you want to convert.
  3. Prepare a reference audio sample (i.e., the target voice you want convert to) <ref_wavpath>.
  4. Run test.sh as:
sh test.sh <model_dir>/seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2.yaml <model_dir>/best_loss_step_304000.pth \
  <source_wav_dir> <ref_wavpath>

The converted wavs are saved in the folder vc_gen_wavs.

Data preprocessing

Activate the virtual env py source tools/venv/bin/activate, then:

  • Please run 1_compute_ctc_att_bnf.py to compute PPG features.
  • Please run 2_compute_f0.py to compute fundamental frequency.
  • Please run 3_compute_spk_dvecs.py to compute speaker d-vectors.

Training

  • Please refer to run.sh

Citations

If you use this repo for your research, please consider of citing the following related papers.

@ARTICLE{liu2021any,
  author={Liu, Songxiang and Cao, Yuewen and Wang, Disong and Wu, Xixin and Liu, Xunying and Meng, Helen},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, 
  title={Any-to-Many Voice Conversion With Location-Relative Sequence-to-Sequence Modeling}, 
  year={2021},
  volume={29},
  number={},
  pages={1717-1728},
  doi={10.1109/TASLP.2021.3076867}
}

@inproceedings{Liu2018,
  author={Songxiang Liu and Jinghua Zhong and Lifa Sun and Xixin Wu and Xunying Liu and Helen Meng},
  title={Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={496--500},
  doi={10.21437/Interspeech.2018-1504},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1504}
}

ppg-vc's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ppg-vc's Issues

question about the training of encoder-decoder

Hi, the paper mentioned MSE loss between the predicted mel-spectrogram and ground-truth mel-spectrogram. I am wondering, if the below example is correct.
A, our source speaker, has a audio saying 12345. B, our target speaker, also has a audio saying 12345, and some other audios. During training, A’s 12345 will be converted to B’s voice by a B’s audio (any audio). Then the output will be compared with B’s 12345 to compute MSE loss.

Fine tuning the vocoder

Hi, is it possible for you to share the pretrained model of vocoder discriminator so that I can fine tune it on some specific voice ?

thank you.

Unable to run convert_from_wav.py; Module nnsp is missing a submodule?

cmd_2WyWrZJ55f

text:
(base) F:\ppg-vc>python convert_from_wav.py --ppg2mel_model_train_config .\bilstm-vctk-libritts460-oneshot\bilstm_ppg2mel_vctk_libri_oneshotvc.yaml --ppg2mel_model_file .\bilstm-vctk-libritts460-oneshot\step_250000.pth --src_wav_dir F:\p5_characters\Ann\audio --ref_wav_path F:\p5_characters\Akechi\audio\ve370_002_00083.wav -o .\output_ann_to_akechi Traceback (most recent call last): File "convert_from_wav.py", line 11, in <module> from src.mel_decoder_mol_encAddlf0 import MelDecoderMOL File "F:\ppg-vc\src\__init__.py", line 4, in <module> from .mel_decoder_lsa import MelDecoderLSA File "F:\ppg-vc\src\mel_decoder_lsa.py", line 20, in <module> from .rnn_decoder_lsa import Decoder File "F:\ppg-vc\src\rnn_decoder_lsa.py", line 5, in <module> from nnsp.ctc_seq2seq_ppg_vc.lsa_attention import LocationSensitiveAttention ModuleNotFoundError: No module named 'nnsp.ctc_seq2seq_ppg_vc'

I'm on Python 3.8, latest version of nnsp is installed via pip.

missing files

what is
/home/shaunxliu/data/vctk/fidlists/train_fidlist.new,
/home/shaunxliu/data/vctk/fidlists/dev_fidlist.new,
/home/shaunxliu/data/vctk/fidlists/eval_fidlist.txt

how to get such kinds of files for my own dataset

fine tune asr

Is there the code to train or fine tune the asr model to make better ppg

Help with "any to many voice conversion with location relative seq2seq modeling" paper

Dear Dr. Songxiang Liu,

I am trying to use your code for the one-shot VC and train linglf02mel seq2seq methods. However, I am getting a runtime error in the solver.exec() function. I think the error is caused by the fact that the if self.step > sekf.max_step == 1 loop does not terminate, and the code continues to run.

I think the error may be caused by incorrect data in the config file for the train fid lists. I would like to ask for your guidance on the following:

  1. Is the train fid list file a Python file or a text file that contains the paths to the vctk data? Are the vctk data audio or text?

  2. Is this the same for the dev and eval fid lists?

  3. Are the vctk_ppg_dir and libri_ppg_dir directories the output of the compute_ctc_att_bnf file?

  4. Are the vctk_f0_dir and libri_f0_dir directories the output of the compute_f0 file?

  5. Are the vctk_wav_dir and libri_wav_dir directories also audio datasets?

  6. Are the libri_spk_dvec_dir and libri_spk_devc_dir directories the output of the compute_spk_devcs file?

I would be very grateful if you could provide me with any guidance that you can. I will never forget your help.

Thank you,
Negin Vahidi

Training strategy

Hey, i am back again with another question :P

Can I interpret the two-stage training scheme as:

  • The training of CTC-Attention phoneme recognizer, speaker encoder, and Vocoder. Above three can be trained separately on their own.
  • The training of seq2seqMoL, it will need the output from CTC-Attention phoneme recognizer and speaker encoder. Each training instance is like (A's sentence_1, B's sentence_x, B's sentence_1), MSE is computed between the model's output B's sentence_1 and the ground truth B's sentence_1.

Please correct me if i am wrong.

Thanks!

about MFCC feature extraction

作者您好,感谢您的工作

我使用您的预训练模型获取了某段声音的梅尔频谱,但我注意到他似乎有小于零的部分,似乎与我之前所见到的(使用librosa库提取的)梅尔频谱不太一样。
如果我想用它来继续提取MFCC特征,请问可行吗?如果可以,您能否提供一些简要指导?
我注意到您代码中可以指定min_mel和max_mel调整数值区间,但是我将min_mel设置大于零后,不知如何指定max_mel的数值,也不能确定求出的是否为标准的MFCC特征。

非常感谢您的阅读

刘博您好,看到这里说Source codes should be modified correspondingly for VC applications

Thank you for the questions.
For Q1:
I adapted espnet a lot; it seems that espnet asr models always downsample the encoder input along the temporal axis more than 4x and do not support phoneme as output symbols. Source codes should be modified correspondingly for VC applications. But the basic steps for the training process is very similar to those presented in espnet asr recipes, including the data preparation, files organization. The run.sh should be modified a little bit, e.g., the language model can be skipped. Sufficient familiarity of espnet source code should be necessary if you want to train a content encoder using your own data.

For Q2:
Please refer to this paper for your questions: TTS Skins: Speaker Conversion via ASR.
Good VC performance validate the speaker independence property of the bottle neck feature obtained in this way. The paper listed above says that BNF is better than PPG features, but this could really be a model selection thing.

Hope this can help.
Songxiang Liu

Originally posted by @liusongxiang in #4 (comment)

刘博您好,看到这里说Source codes should be modified correspondingly for VC applications,也看到代码里提供了一份en_conformer_ctc_att的espnet训练配置,请问一下直接使用这个config是否可以训练ppg部分?

why std.sqrt() is performed twice in utterance_mvn?

I notice in utterance_mvn(), if norm_means is True, then std.sqrt() is performed twice?
is there any explanation?

if norm_means:
        x -= mean

        if norm_vars:
            var = x.pow(2).sum(dim=1, keepdim=True) / ilens_
            std = torch.clamp(var.sqrt(), min=eps)
            x = x / std.sqrt()
        return x, ilens

Vocoder model

Hi,

In the paper the WaveRNN network is used as the neural vocoder. Any specific reason that it is replaced by Hifigan in this repository?

Thanks!

training code for conformer + ctc missed files

hello,我对比了工程里的conformer实现代码,发现了一些与espnet对不上的,排除了一些版本的问题,发现有部分代码差别很大,如果直接用espnet最新的代码来训练conformer + ctc,有需要修改的地方吗?尤其是那个subsample这个地方

finetuning vocoder

Hi
I am new to field of VC, I am trying to finetune model for specific speakers to generate output more like specific speaker.
After going through repo, I though best approach would be to fine tune vocoder.(suggested here #11)
From there I went to this repo (as suggested in the readme)
https://github.com/jik876/hifi-gan
there when i tested other pretrained model, such as LJ_V3,LJ_FT_T2_V3
when using these models in output i am getting just noise?
output : https://drive.google.com/file/d/19xXGF_u0EBtaiFDlg5CkAiEi8Xzc_dd3/view?usp=share_link

--> Can you elaborate on how to finetune on custom data (Specific speakers)? I have 15-20 min data each for 10 speaker ..will that be sufficient?
--> is finetuning vocoder sufficient ? for getting speech for specific users?
--> how to use other models available on hifigan repo? in ppg repo? what config change i need to do?

ModuleNotFoundError: No module named 'nnsp.layers'

我使用pip install nnsp显示安装成功,然后运行test.sh的时候遇到了没有nnsp.layers的问题,报错信息如下
Traceback (most recent call last): File "convert_from_wav.py", line 11, in <module> from src.mel_decoder_mol_encAddlf0 import MelDecoderMOL File "/code/ppg-vc-main/src/__init__.py", line 4, in <module> from .mel_decoder_lsa import MelDecoderLSA File "/code/ppg-vc-main/src/mel_decoder_lsa.py", line 20, in <module> from .rnn_decoder_lsa import Decoder File "/code/ppg-vc-main/src/rnn_decoder_lsa.py", line 4, in <module> from .lsa_attention import LocationSensitiveAttention File "/code/ppg-vc-main/src/lsa_attention.py", line 4, in <module> from nnsp.layers.basic_layers import Conv1d, Linear ModuleNotFoundError: No module named 'nnsp.layers'
请问这是什么情况?

ZeroDivisionError: float division by zero

Im trying to run test.sh but keep getting the same error,
seems I can't get the right utterances.
Can anyone kindly give me some advice?

Experiment name: seq2seq_mol_ppg2mel_vctk_libri_oneshotvc_r4_normMel_v2
Load PPG-model, PPG2Mel-model, Vocoder-model...
Removing weight norm...
Loaded the voice encoder model on cuda in 0.02 seconds.
Number of source utterances: 0.
0it [00:00, ?it/s]
RTF:
Traceback (most recent call last):
  File "convert_from_wav.py", line 216, in <module>
    main()
  File "convert_from_wav.py", line 212, in main
    convert(args)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "convert_from_wav.py", line 167, in convert
    print(total_rtf / cnt)
ZeroDivisionError: float division by zero

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.