Coder Social home page Coder Social logo

bshall / soft-vc Goto Github PK

View Code? Open in Web Editor NEW
382.0 382.0 33.0 362 KB

Soft speech units for voice conversion

Home Page: https://bshall.github.io/soft-vc/

License: MIT License

Jupyter Notebook 100.00%
self-supervised-learning speech-synthesis voice-conversion

soft-vc's Issues

is real-time voice conversion possible?

Hi- very impressed by the VC framework. It's very fast and accurate.
I'm wondering is real-time possible? I have a simple WS server that receives audio, but when i push the data through soft-vc, the end result is just noise. In the code below, I save the input stream just to confirm the audio is being received correctly (which it is).
Here is a snippet of my code:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft").cuda()

acoustic_load_path = "./pretrained_models/acoustic.pt"
checkpoint = torch.load(acoustic_load_path, map_location=device)["acoustic-model"]
acoustic = AcousticModel().to(device)
acoustic.load_state_dict(checkpoint)
acoustic.eval()

# load custom vocoder
hifigan_load_path = "./pretrained_models/hifigan.pt"
checkpoint = torch.load(hifigan_load_path, map_location=device)[
    "generator"]["model"]
hifigan = HifiganGenerator().to(device)
consume_prefix_in_state_dict_if_present(checkpoint, "module.")
hifigan.load_state_dict(checkpoint)
hifigan.eval()
hifigan.remove_weight_norm()



inputs = []
outputs = []
while True:
    data = None
    try:
        data = await websocket.recv()
    except:
        break
    if isinstance(data, str):
        print(f"string -> {data}")
        continue

    source = torch.from_numpy(numpy.frombuffer(
        data, dtype=numpy.int16).astype('float32') / 32767)
    source = source.reshape((1, -1))
    source = source.unsqueeze(0).cuda()
    # # Convert to the target speaker
    with torch.inference_mode():
        # Extract speech units
        units = hubert.units(source)
        # Generate target spectrogram
        mel = acoustic.generate(units).transpose(1, 2)
        # Generate audio waveform
        target = hifigan(mel)
        inputs.append(source.squeeze(0).cpu())
        outputs.append(target.squeeze(0).cpu())
        await ws.send(data)

print(f"saving files...")
input_result = torch.cat(inputs, dim=1)
torchaudio.save("inputs.wav", input_result, sample_rate=16_000)
output_result = torch.cat(outputs, dim=1)
torchaudio.save("outputs.wav", output_result, sample_rate=16_000)

skipped phonemes in generated audio

hi, thank you for sharing your code.

i am trying to do voice conversion from English speech to Vietnamese speaker. to do that, i did the following steps

  • extract units for both English and Vietnamese dataset
  • train kmeans on both types of units & extract discrete labels
  • train soft encoder
  • extract soft units
  • train acoustic model
  • train hifigan on Vietnamese dataset

the output for Vietnamese speech (input audio is Vietnamese, of a different speaker) is okay. but output for English is not that good. phonemes are often skipped or mispronouced. do you have any suggestions on how i can improve the results?

Can you share the training scripts?

I have tried the inference example and the result is exciting.

Can you share the training scripts to me or open source?

This will help me a lot.

Thanks!

About fine-tuning.

Hi @bshall . I got impressive results trying to convert singing voice samples. So, I was trying to understand how to fine-tune a specific singer's voice. Do I need to train each individual component for this?

K-means training

What accuracy did you achieve while training the k-means model in the content encoder?

speech resynthesis?

Hi, thanks for the sample codes! very easy to use with impressive results! I am wondering if it is possible to resynthesize the speaker's voice, instead of speech conversion, using your model?

Bug:TypeError: hubert_soft() got an unexpected keyword argument 'trust_repo'

I would like to ask if anyone has encountered this problem while doing experiments? How was it solved?

Using cache found in /data0/home/Liqy/.cache/torch/hub/bshall_hubert_main
Traceback (most recent call last):
File "/tmp/pycharm_project_363/test.py", line 4, in
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft", trust_repo=True).cuda()
File "/Nas/Liqy/hubert-main/lib/python3.10/site-packages/torch/hub.py", line 404, in load
model = _load_local(repo_or_dir, model, *args, **kwargs)
File "/Nas/Liqy/hubert-main/lib/python3.10/site-packages/torch/hub.py", line 433, in _load_local
model = entry(*args, **kwargs)
TypeError: hubert_soft() got an unexpected keyword argument 'trust_repo'

Is it my pytorch version? Can you give me your pytorch version for reference? Thank you very much

Difference between SSL and PPG-based methods?

Hi, I really appreciate your work; the demo sounds great.
I also read papers about PPG-based VC, which uses ASR for PPG extraction. I just wonder about the difference between SSL and PPG-based methods. It seems they both extract some information about linguistics. Have you ever compared them?
Thank you!

Discrete content encoder example

@bshall since this repo gives a simple inference example based on a soft content encoder. Would you give a discrete encoder based inference example? I found some issues when I try to use a discrete encoder. Here is my code:

when it run to mel = acoustic.generate(units).transpose(1, 2)
a dimension mismatch occurred. Looking forward to your suggestions, thanks a lot.

File "D:\Software-location\Anaconda\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\Users\YoungTown/.cache\torch\hub\bshall_acoustic-model_main\acoustic\model.py", line 23, in generate
x = self.encoder(x)
File "D:\Software-location\Anaconda\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\YoungTown/.cache\torch\hub\bshall_acoustic-model_main\acoustic\model.py", line 49, in forward
x = self.convs(x.transpose(1, 2))
IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)

import torch, torchaudio
hubert = torch.hub.load("bshall/hubert:main", "hubert_discrete").cuda()
acoustic = torch.hub.load("bshall/acoustic-model:main", "hubert_discrete").cuda()
hifigan = torch.hub.load("bshall/hifigan:main", "hifigan_hubert_discrete").cuda()

source, sr = torchaudio.load("./huohuo.wav")
source = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)(source)
source = source[0].unsqueeze(0).unsqueeze(0).cuda()

with torch.inference_mode():
    units = hubert.units(source)
    mel = acoustic.generate(units).transpose(1, 2)
    target = hifigan(mel)

    target = target.squeeze().cpu()
    target = target.unsqueeze(0).cpu()
    torchaudio.save('./huohuo_new.wav', target, 16000)

Interesting work!

Hi Benjamin,

Looked at your demo page, it looks nice!
Will the paper be on arxiv?
Looking forward to it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.