bshall / knn-vc Goto Github PK

View Code? Open in Web Editor NEW

436.0 14.0 64.0 283 KB

Voice Conversion With Just Nearest Neighbors

Home Page: https://bshall.github.io/knn-vc/

License: Other

Python 96.69% Jupyter Notebook 3.31%

any-to-any knn pytorch self-supervised-learning speech speech-synthesis voice-conversion

knn-vc's People

Contributors

Stargazers

Watchers

Forkers

splinter21 maxmax2016 muruganr96 shansiliu95 entn-at ishine ryu1845 whitefu hongwen-sun xiaozhuo12138 os-netizen joycemind aydous shaun95 qoboty vsydorskyy eccheng davvo mr-nobody-dey lazeyu wancaiyan gaurangbharti1 kaidduong yanlancai jaedukseo bandanban1 sirbitesalot wilson97 qiaolinwang keikinn brdhunga nzpeng youhjjhhhjj matrixy zhuyi1159 tarepan mkfxxl jackiexiao reregin eyaler littlerookiegithub mikkolehtimaki tathiyennhi kevinwang676 sitholedavid saradark jerry1331 nicolvisser render-ai youjile tardis-forever codeghees 731why oytunturk farahhuifanyang timherzig wangtao201919 wjm0621 wallacerao boringtaskai delijingyic navezjt ego wladgrm

knn-vc's Issues

Maybe mention memory consumption in readme.md?

I just tried to use and test your model, unfortunately I only have a GPU with 16GB of RAM. Apparently WavLM takes about 12 GB and HifiGAN needs another 5 GB so you need at least 20GB of RAM to run inference. Would be nice to clarify that in the requirements section :)

Discriminator checkpoint

Hey! Thank you for sharing your work, I really like ur idea!

I am trying to finetune vocoder for a specific voice to see whether it would improve voice matching, because voice doesn't match well at zero-shot setting. Could you please share discriminator weights of vocoder with prematching?

Question about WavLM layer choice

In your paper, you say:

Recent work confirms that later layers give poorer predictions of pitch, prosody, and speaker identity. Based on these observations, we found that using a layer with high correlation with speaker identification – layer 6 in WavLM-Large – was necessary for good speaker similarity and retention of the prosody information from the source utterance.

The reference associated with that passage, though, doesn't seem to examine WavLM-Large, only Base, and my reading of it is that WavLM-Base's earlier layers (0-2) are more correlated with pitch and energy reconstruction, common speaker ID features.

I'm wondering how you came to use layer 6 of the Large model and whether you tried other layers. I'm having trouble locating other research that dives into layerwise feature correlations for these models, so any pointers you can provide are helpful.

Thanks!

Some questions about implementation

Since the structure of the LibriSpeech dataset is root/subset/speaker/chapter/file, should

knn-vc/prematch_dataset.py

Line 57 in 7b59579

uttrs_from_same_spk = sorted(list(path.parent.rglob('**/*.flac')))

be modified to uttrs_from_same_spk = sorted(list(path.parent.parent.rglob('**/*.flac')))to return other utterances of the same speaker?
Since matching and synthesis do not necessarily use the same features, can I use ASR features (like from Whisper) for matching and its corresponding original Mel spectrogram for synthesis?

Request

Do you think you could make this 48000hz compatible?

torchaudio version

Hi, When I execute the program according to the Quickstart in README.md, the following error is reported:

AttributeError: module 'torchaudio.functional' has no attribute 'loudness'

I guess it is not supported in the torchaudio version I am using, the version I am using is: pytorch==1.12.1 torchaudio==0.12.1.

Can you tell me which version you are using? Thanks.

prematch_dataset run very slow

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1362 G /usr/lib/xorg/Xorg 53MiB |
| 0 N/A N/A 2014 G /usr/lib/xorg/Xorg 119MiB |
| 0 N/A N/A 2144 G /usr/bin/gnome-shell 54MiB |
| 0 N/A N/A 384873 C python 3659MiB |
+-----------------------------------------------------------------------------+
`

It takes up video memory, but it should not be used, and the power has not increased. Is there a problem?

Conversion output has very strong similarity to source audio.

I have experimented quite a lot and I have to acknowledge that the output audio is very similar to the source audio tone.

What strategies or changes would you suggest to increase the similarity to the references?

Using another batch size in training

I encountered a strange bug or rather a strange behaviour, which I can not really pinpoint to the exact issue.
I used the standard training, as you described and it worked fine. However when I changed the batch_size parameter to 12 in config_v1_wavlm.json the train.py was only executed until line 136 for i, batch in pb:. Its not an memory issue as I still have more than 12GB free on my GPU but it seems for some reason the script skips the for loop if you increase the batch size in the json file.

Considering context around source features

Hi,

I had an idea, wanted to run it by you. So right now, for each source feature, you are doing k-means with the reference features. I'm thinking that the surrounding source features also might have useful information that could help you better nail down the correct reference feature.

So for example my source features are [s1 s2 ... s100] and reference features (lets just assume k = 1) are [r1 r2 ... r100]. If you consider the sources features by themselves, maybe s1 maps to r22 and s2 maps to r77. But if you were to consider s1 and s2 together, they combined would map to [r23, r24] which is more correct.

Let me know what you think about this. Does this make sense/is my scenario plausible?

Thank you.

loss issues encountered in fine-tuning the model

Hello author, this project is great. I am trying to add some Chinese speeches for fine-tuning, but my ’validation/mel_spec_error‘ has almost stopped decreasing at 15k, and ’training/gen_loss_total' has also increased. I would like to ask if this loss change is normal. Thank you so much.

Output is a bit shaky, how to fix that?

Thanks for the great work and making code with all weights available!
Really appreciate it..

Can you please guide me on how to improve the output further?
If we change the vocoder to HiFIGAN V2 or train on more data, how do you think output will change?

Also, how much time does it take to train on train-100 data from librispeech?

Some question about features KNN

Have you tried using Wavlm, which has been fine-tuned on an ASR dataset, to extract semantic features for querying KNN instead of directly using SSL features? Using KNN to obtain timestamps only, then using the timestamps of the reference Wavlm SSL to generate the output.

Error in quickstart

Hi, I'd like to test the KNN-VC model, but I'm getting an error at the very beginning:

Python 3.8.10 (default, May 26 2023, 14:05:08)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.0.1+cu117'
>>> import torch, torchaudio
>>> knn_vc = torch.hub.load('bshall/knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True)
Downloading: "https://github.com/bshall/knn-vc/zipball/master" to /local/home/fa125436/.cache/torch/hub/master.zip
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data1/is156025/fa125436/N2D2/env/lib/python3.8/site-packages/torch/hub.py", line 558, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "/data1/is156025/fa125436/N2D2/env/lib/python3.8/site-packages/torch/hub.py", line 584, in _load_local
    hub_module = _import_module(MODULE_HUBCONF, hubconf_path)
  File "/data1/is156025/fa125436/N2D2/env/lib/python3.8/site-packages/torch/hub.py", line 98, in _import_module
    spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/local/home/fa125436/.cache/torch/hub/bshall_knn-vc_master/hubconf.py", line 15, in <module>
    from matcher import KNeighborsVC
  File "/local/home/fa125436/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 32, in <module>
    class KNeighborsVC(nn.Module):
  File "/local/home/fa125436/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 58, in KNeighborsVC
    def get_matching_set(self, wavs: list[Path] | list[Tensor], weights=None, vad_trigger_level=7) -> Tensor:
TypeError: 'type' object is not subscriptable
>>>

Can you please help? Thanks a lot

Question about the Used Hardware

Thanks for the great work.
Knn-VC produces great results without even the need of training.
One thing I noticed is, I canned use more than 3 minutes of refernce audio.
If I use like 5 minutes of audio, pytorch tries to allocate 15 GB of GPU memory.
I tried it once with 11 minutes of refernce audio, but than over 70 GB of memory space is needed.
What kind of GPU did you use for interfernce? Is it normal that so much memory is requested, or is there an error in how I use the toolbox?
Do you have any usefull tips on how the audio should be provided for the framework? E.g. file format, frequency, one long file or sevreral tiny snippets ...

Would be great to hear from you.

Kind regards
Mr Maure

An error when check input type

thx share the great work
I tried the inference procedure, but can't pass pathlib.Path as input.
it may be caused by type check here
https://github.com/interspeech2023blind/knn-vc/blob/cfba7e0808fcf3df1fe9ed2ff10e2031b25d9954/matcher.py#L89
We should use isinstance() and check pathlib.Path instead of path

Matching pool empty

I sometimes experience a bug when performing matching with big datasets (20k samples+).

This is the Stacktrace:

Feature has shape:  torch.Size([445, 1024])---------------------------------------------------------| 0.02% [1/5293 00:10<15:54:29 train_clean2/102/102-83.flac]
Feature has shape:  torch.Size([400, 1024])---------------------------------------------------------| 0.04% [2/5293 00:15<11:44:33 train_clean2/102/102-60.flac]
Done 1,000/5,293████-------------------------------------------------------------------------------| 18.89% [1000/5293 05:40<24:20 train_clean2/14/14-55.flac]c]
Traceback (most recent call last):-----------------------------------------------------------------| 21.56% [1141/5293 06:36<24:02 train_clean2/15/15-5.flac]]]
  File "/raid/nils/projects/knn-vc/prematch_dataset.py", line 172, in <module>
    main(args)
  File "/raid/nils/projects/knn-vc/prematch_dataset.py", line 51, in main
    extract(ls_df, wavlm, args.device, Path(args.librispeech_path), Path(args.out_path), SYNTH_WEIGHTINGS, MATCH_WEIGHTINGS)
  File "/raid/nils/projects/knn-vc/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/raid/nils/projects/knn-vc/prematch_dataset.py", line 128, in extract
    matching_pool, synth_pool = path2pools(row.path, wavlm, match_weights, synth_weights, device)
  File "/raid/nils/projects/knn-vc/prematch_dataset.py", line 75, in path2pools
    matching_pool = torch.concat(matching_pool, dim=0)
RuntimeError: torch.cat(): expected a non-empty list of Tensors

However the problem seems not to be the file itself. When I change the matching directory to train_clean2/15 the algorithm runs through without any problems. I used the German Distant Speech Data Corpus 2014 / 2015 to run this experiment. I wonder what the root of this error might be, I had directories of 5000+ files run through without a problem but sometimes this bug still appears. For some reason it seems not to find a matching vector.

Hints on improvements for training and matching

First of all thanks for the great model! I tested it extensively by now and ran across a few problems and performance issues which you might can help with.

Matching takes a lot of time with big datasets (1000+ 2min files), since it is not multi-gpu, do you intend to change that in the future?
General behaviour: For training in general, it seems to be better to have a few big files rather than many small files (2min vs 10sec). I think this might be related to the overhead introduced by all the small .pt "models". Can you confirm this or believe this is plausible?
My biggest issue so far is when I try to fine tune the hifi-gan vocoder. My Notebook with a A4500 seems to be on par, even outperforming my DGX Station with 4 x V100 32GB GPUs, which is strange.

I identified the following things:

During validation the operation is performed on all files rather than in a batch (1000+ files). The station and the notebook are both about equally fast in validating all files. However the station uses 4 GPUs which are working at 100% all the time and should be a lot faster. Since this is really slow how often do you think should I perform validation?

During batch training the notebook also outperforms the station by a little bit completing one epoch in about 40sec (station 48sec).
However when I look at nvidia-smi on the station the GPU usage is at 0% all the time.

Unfortunately there seem to be some serious issues with a multi-gpu approach. If I only use one GPU on the station one epoch takes about 17 seconds. Maybe you have an idea what goes wrong here?

Edit:
When I monitored the epochs in a multi-gpu setting it seems the epoch itself was trained really fast in 5 seconds. However before the progressbar appears there seems to happen some loading which takes the other 50 seconds. Do you know which process is responsible for that time gap or how to minmize it?

What I tried so far:
I tried to use different batch sizes and adjusted the number of workers in the config, but it did no really change results that much,

An error when loading models

I try:
import torch, torchaudio
knn_vc = torch.hub.load('/home/knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True)

There is an error when loading models:
Traceback (most recent call last):
File "/home/knn-vc/inf.py", line 3, in
knn_vc = torch.hub.load('knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True)
File "/home/miniconda3/envs/knvc/lib/python3.10/site-packages/torch/hub.py", line 555, in load
repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, trust_repo, "load",
File "/home/miniconda3/envs/knvc/lib/python3.10/site-packages/torch/hub.py", line 199, in _get_cache_or_reload
repo_owner, repo_name, ref = _parse_repo_info(github)
File "/home/miniconda3/envs/knvc/lib/python3.10/site-packages/torch/hub.py", line 135, in _parse_repo_info
repo_owner, repo_name = repo_info.split('/')
ValueError: not enough values to unpack (expected 2, got 1)

Training HiFiGAN on higher quality data

Hey, I was wondering what sort of changes it would take to the training script to be able to train HiFiGAN on higher quality data like LibriTTS or LibriTTS-R. The dataset uses wav files instead of flac files and is 24kHz sampling rate. I can preprocess the dataset to be 16kHz and make changes to the files in data_splits to work with wav files, but I wanted to know what the best way to work with this kind of data would be. If there are other ways to help improve the general quality of the outputs, I'd be happy to explore those too. Any help would be great, thanks!

bigvgan as vocoder

Is it possible to use bigvgan instead of hifigan for the vocoder?

SoX effect fails on Windows with SoundFile backend

Traceback (most recent call last):
  File "\.cache\torch\hub\bshall_knn-vc_master\matcher.py", line 72, in get_matching_set
    feats.append(self.get_features(p, weights=self.weighting if weights is None else weights, vad_trigger_level=vad_trigger_level))
  File "\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "\.cache\torch\hub\bshall_knn-vc_master\matcher.py", line 105, in get_features
    waveform_reversed, sr = apply_effects_tensor(x_front_trim, sr, [["reverse"]])
  File "\lib\site-packages\torchaudio\_internal\module_utils.py", line 73, in wrapped
    raise RuntimeError(f"{func.__module__}.{func.__name__} {message}")
RuntimeError: torchaudio.sox_effects.sox_effects.apply_effects_tensor requires sox extension, but TorchAudio is not compiled with it. Please build TorchAudio with libsox support.

It seems to work correctly after patching the apply_effects_tensor method since reverse is the only effect used but probably not the most elegant solution.

torchaudio.sox_effects.apply_effects_tensor = lambda waveform, sample_rate, _: (
    torch.flip(waveform, (-1,)),
    sample_rate,
)

Will this work with 44100hz audio?

Just wondering because this project seems great but 16000hz is a bit too low frequency for my needs.

How to plug-in new finetuned HiFiGAN?

Hey! I know you wrote about this here: #23

I think I'm going to go ahead and try those steps now

However, I did find this 48K HiFiGAN model someone trained (seemingly I could be wrong): https://github.com/vtuber-plan/hifi-gan/releases/tag/v0.3.1

Is it possible to plug that .pt checkpoint into knn-vc as is or does it need to still be trained further?

Using knn-vc for audio streams

Hey!

I wonder if knn-vc could work for audio streams (say, in chunks of audio of 10-500ms) instead of whole audio files. Has this been explored? Could it work? I could not find any info online. I imagine that if the two successive audio chunks would split a phoneme in two this could cause problems?

out_wav is a wav file?

How can I save the out_wav to a wav file? thanks very much.

Size mismatch error

Hi! I'm trying to run basic quickstart script, but it gives me

Traceback (most recent call last):
  File "/data/code_jb/knn-vc/test_run.py", line 11, in <module>
    out_wav = knn_vc.match(query_seq, matching_set, topk=4)
  File "/data/SOFT/miniconda/envs/ml2/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/i.beskrovnyy/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 158, in match
    dists = fast_cosine_dist(query_seq, matching_set, device=device)
  File "/home/i.beskrovnyy/.cache/torch/hub/bshall_knn-vc_master/matcher.py", line 25, in fast_cosine_dist
    dotprod = -torch.cdist(source_feats[None].to(device), matching_pool[None], p=2)[0]**2 + source_norms[:, None]**2 + matching_norms[None]**2
RuntimeError: The size of tensor a (782) must match the size of tensor b (543) at non-singleton dimension 2

My src and target files are different in samplerate and length, can it be the problem?

prematch argument

Hi,

Thank you for this great repo!

In the readme file you mention that prematch option applies prematching the features. But in the code I see that prematch saves all the source features AS-IS without matching them one to each other. Can you please elaborate, I think I'm missing something.

Thanks

Torch Hub CPU inference support

Currently it seems your repository only supports running on GPU, and gives the error

knn_vc = torch.hub.load('bshall/knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True) 
Using cache found in C:\Users\Skyler/.cache\torch\hub\bshall_knn-vc_master
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\hub.py", line 558, in load
    model = _load_local(repo_or_dir, model, *args, **kwargs)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\hub.py", line 587, in _load_local
    model = entry(*args, **kwargs)
  File "C:\Users\Skyler/.cache\torch\hub\bshall_knn-vc_master\hubconf.py", line 20, in knn_vc
    hifigan, hifigan_cfg = hifigan_wavlm(pretrained, progress, prematched, device)
  File "C:\Users\Skyler/.cache\torch\hub\bshall_knn-vc_master\hubconf.py", line 36, in hifigan_wavlm
    generator = HiFiGAN(h).to(device)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\nn\modules\module.py", line 1145, in to
    return self._apply(convert)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\nn\modules\module.py", line 797, in _apply
    module._apply(fn)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\nn\modules\module.py", line 820, in _apply
    param_applied = fn(param)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\nn\modules\module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "c:\Users\Skyler\Documents\startai_tts_foss\.conda\lib\site-packages\torch\cuda\__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Can you modify the hubconf to support cpu only systems too.

Using better neural vocoder GAN (BIGVGAN, RVQGAN ...)

Anyone tried to use a different, more recent (and supposedly) better vocoder ?

HiFI-Gan is already a bit old and better options appeared like BIGVGAN and maybe RVQGAN.

Will this work for singing voice conversion (svc)?

Great repo! Ran some tests with it and it sounds good for speech, but the limited testing I did for singing didn't sound too great. Is this expected / is there a way to adapt it to work well with singing? Perhaps switch it to use NSF-HiFiGAN as so-vits-svc does?

P.S. I especially like the zero-shot any-to-any nature of this model, not sure if there are other projects out there now for zero shot svc.

WavLM Base+ over Large?

First, thanks for the paper and the code, this is very interesting!
Did you happen to do any testing with other versions of WavLM, such as Base or Base+? I was wondering if it would be possible to make this lighter without impacting the quality too much.

Link to paper

Hi, I was wondering if you can provide me the link to the research paper, thanks

Choice for k

Hi, the paper goes over the choice for k very briefly, so I was wondering if you could share some results of the preliminary experiments. It says "when more reference audio is available (e.g. ≥10 mins), the conversion quality may even be improved by using larger values of k (in the order of k = 20)"; does the quality keeps getting better past k=20, or does it start degrading after certain point? Also, did you try k=1, which happens to be the approach this project uses? If so, what were the results?

extend to other SSL model features

Hi authors,

This is an interesting work on VC! Have you tried applying the same idea on codec latents as well? I read that you've tried on hubert features and it worked too, but I'm wondering if you tested on models like encodec / soundstream, or if you have any insights on them. Thanks!

Best,
Dongyao