Coder Social home page Coder Social logo

serp-ai / bark-with-voice-clone Goto Github PK

View Code? Open in Web Editor NEW

This project forked from suno-ai/bark

3.0K 47.0 396.0 1.45 MB

๐Ÿ”Š Text-prompted Generative Audio Model - With the ability to clone voices

Home Page: https://serp.ai/tools/bark-text-to-speech-ai-voice-clone-app

License: Other

Python 33.09% Jupyter Notebook 66.91%
ai-text-to-speech-tools ai-rapper-voice-generator ai-singing-voice-generator ai-voice-changers ai-voice-generator-celebrity ai-voice-generator-free ai-voice-reader voice-ai-platform ai-voice-clone ai-voice-cloning-app ai-voice-clonining

bark-with-voice-clone's Introduction

๐Ÿถ BARK AI: but with the ability to use voice cloning on custom audio samples

For RVC git clone https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI and train your model or point the code to you model (must clone RVC repo in bark-with-voice-clone directory)

If you want to clone a voice just follow the clone_voice.ipynb notebook. If you want to generate audio from text, follow the generate.ipynb notebook.

To create a voice clone sample, you need an audio sample of around 5-12 seconds

You will get the best results by making generations with your cloned voice until you find one that is really close to the source. Then use that as the new history prompt (comes from the model so should theoretically be more consistent)

Contributors

Huge shoutout & thank you to:

gitmylo for the solution to the semantic token generation for better voice clones and finetunes (HuBERT, etc.)


francislabountyjr gkucsko kmfreyberg Vaibhavs10 devinschumacher mcamac fiq zygi jn-jairo gitmylo alyxdow mikeyshulman

Original README.md

๐Ÿค– Usage

from bark import SAMPLE_RATE, generate_audio, preload_models
from IPython.display import Audio

# download and load all models
preload_models()

# generate audio from text
text_prompt = """
     Hello, my name is Serpy. And, uh โ€” and I like pizza. [laughs] 
     But I also have other interests such as playing tic tac toe.
"""
audio_array = generate_audio(text_prompt)

# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)
pizza.webm

To save audio_array as a WAV file:

from scipy.io.wavfile import write as write_wav

write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)

๐ŸŒŽ Foreign Language

Bark supports various languages out-of-the-box and automatically determines language from input text. When prompted with code-switched text, Bark will attempt to employ the native accent for the respective languages. English quality is best for the time being, and we expect other languages to further improve with scaling.

text_prompt = """
    Buenos dรญas Miguel. Tu colega piensa que tu alemรกn es extremadamente malo. 
    But I suppose your english isn't terrible.
"""
audio_array = generate_audio(text_prompt)
miguel.webm

๐ŸŽถ Music

Bark can generate all types of audio, and, in principle, doesn't see a difference between speech and music. Sometimes Bark chooses to generate text as music, but you can help it out by adding music notes around your lyrics.

text_prompt = """
    โ™ช In the jungle, the mighty jungle, the lion barks tonight โ™ช
"""
audio_array = generate_audio(text_prompt)
lion.webm

๐ŸŽค Voice Presets and Voice/Audio Cloning

Bark has the capability to fully clone voices - including tone, pitch, emotion and prosody. The model also attempts to preserve music, ambient noise, etc. from input audio. However, to mitigate misuse of this technology, we limit the audio history prompts to a limited set of Suno-provided, fully synthetic options to choose from for each language. Specify following the pattern: {lang_code}_speaker_{0-9}.

text_prompt = """
    I have a silky smooth voice, and today I will tell you about 
    the exercise regimen of the common sloth.
"""
audio_array = generate_audio(text_prompt, history_prompt="en_speaker_1")
sloth.webm

Note: since Bark recognizes languages automatically from input text, it is possible to use for example a german history prompt with english text. This usually leads to english audio with a german accent.

๐Ÿ‘ฅ Speaker Prompts

You can provide certain speaker prompts such as NARRATOR, MAN, WOMAN, etc. Please note that these are not always respected, especially if a conflicting audio history prompt is given.

text_prompt = """
    WOMAN: I would like an oatmilk latte please.
    MAN: Wow, that's expensive!
"""
audio_array = generate_audio(text_prompt)
latte.webm

๐Ÿ’ป Installation

pip install git+https://github.com/suno-ai/bark.git

or

git clone https://github.com/suno-ai/bark
cd bark && pip install . 

๐Ÿ› ๏ธ Hardware and Inference Speed

Bark has been tested and works on both CPU and GPU (pytorch 2.0+, CUDA 11.7 and CUDA 12.0). Running Bark requires running >100M parameter transformer models. On modern GPUs and PyTorch nightly, Bark can generate audio in roughly realtime. On older GPUs, default colab, or CPU, inference time might be 10-100x slower.

โš™๏ธ Details

Similar to Vall-E and some other amazing work in the field, Bark uses GPT-style models to generate audio from scratch. Different from Vall-E, the initial text prompt is embedded into high-level semantic tokens without the use of phonemes. It can therefore generalize to arbitrary instructions beyond speech that occur in the training data, such as music lyrics, sound effects or other non-speech sounds. A subsequent second model is used to convert the generated semantic tokens into audio codec tokens to generate the full waveform. To enable the community to use Bark via public code we used the fantastic EnCodec codec from Facebook to act as an audio representation.

Below is a list of some known non-speech sounds

  • [laughter]
  • [laughs]
  • [sighs]
  • [music]
  • [gasps]
  • [clears throat]
  • โ€” or ... for hesitations
  • โ™ช for song lyrics
  • capitalization for emphasis of a word
  • MAN/WOMAN: for bias towards speaker

Supported Languages

Language Status
English (en) โœ…
German (de) โœ…
Spanish (es) โœ…
French (fr) โœ…
Hindi (hi) โœ…
Italian (it) โœ…
Japanese (ja) โœ…
Korean (ko) โœ…
Polish (pl) โœ…
Portuguese (pt) โœ…
Russian (ru) โœ…
Turkish (tr) โœ…
Chinese, simplified (zh) โœ…
Arabic Coming soon!
Bengali Coming soon!
Telugu Coming soon!

bark-with-voice-clone's People

Contributors

alyxdow avatar devinschumacher avatar fiq avatar francislabountyjr avatar gitmylo avatar gkucsko avatar jn-jairo avatar kmfreyberg avatar mcamac avatar mikeyshulman avatar vaibhavs10 avatar zygi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bark-with-voice-clone's Issues

Inconsistent generation

Hello, I succesfully cloned my voice but the results are pretty inconsistent.
I tried by cloning with samples of 2 seconds, 3 seconds, 5 and even 7 but nothing seems to work.
I'll explain better, after I cloned my voice, if I try to generate an audio file one of these things happen:

  1. The audio is understandable but it's definitely not my voice
  2. I get random noises like high whistles, music or buzz sounds
  3. I hear something very close to my voice but it just emits long "hmmmmm....." like sounds, no matter what I write

What are you guys experience on cloning voices? Is there some parameters we can set or some specific phrases that would help the voice cloning process being better?

I think we should create some space like a subreddit or a discord to share our prompts and experience in order to refine the voice cloning process

Voice cloning - unexpected results

Hi, Francis and all contributors!

I'm trying to clone my voice from a .wav file to get the pronunciation of the text in Russian according to https://github.com/serp-ai/bark-with-voice-clone/blob/main/clone_voice.ipynb. The main problem is that the resulting .wav contains voice acting not in Russian, but in English. I am running the code on macbook with m2 (on CPU).

Am I missing something? ๐Ÿ˜ž

Here is my code:

import sys
sys.path.append('bark-voice-cloning-HuBERT-quantizer')

from encodec.utils import convert_audio
from hubert.hubert_manager import HuBERTManager
from hubert.pre_kmeans_hubert import CustomHubert
from hubert.customtokenizer import CustomTokenizer
from bark.api import generate_audio
from bark.generation import SAMPLE_RATE, preload_models, load_codec_model
import torchaudio
import torch
import os
import numpy as np
from scipy.io.wavfile import write as write_wav

device = torch.device('cpu')
model = load_codec_model(use_gpu=False)

hubert_manager = HuBERTManager()
hubert_manager.make_sure_hubert_installed()
hubert_manager.make_sure_tokenizer_installed()

# Load the HuBERT model
hubert_model = CustomHubert(checkpoint_path='data/models/hubert/hubert.pt').to(device)

# Load the CustomTokenizer model
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth').to(device)  # Automatically uses the right layers

audio_filepath = '/Users/brasd99/Downloads/original.wav' # the audio you want to clone (under 13 seconds)
wav, sr = torchaudio.load(audio_filepath)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
wav = wav.to(device)

semantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
semantic_tokens = tokenizer.get_token(semantic_vectors)

# Extract discrete codes from EnCodec
with torch.no_grad():
    encoded_frames = model.encode(wav.unsqueeze(0))
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze()  # [n_q, T]

# move codes to cpu
codes = codes.cpu().numpy()
# move semantic tokens to cpu
semantic_tokens = semantic_tokens.cpu().numpy()

voice_filename = 'ru_speaker_1'

current_path = os.getcwd()
voice_name = os.path.join(current_path, f'{voice_filename}.npz')

np.savez(voice_name, fine_prompt=codes, coarse_prompt=codes[:2, :], semantic_prompt=semantic_tokens)

# Enter your prompt and speaker here
text_prompt = "ะŸั€ะธะฒะตั‚! ะšะฐะบ ั‚ะฒะพะธ ะดะตะปะฐ, ะดั€ัƒะณ? [laughs]"

# download and load all models
preload_models(
    text_use_gpu=False,
    text_use_small=False,
    coarse_use_gpu=False,
    coarse_use_small=False,
    fine_use_gpu=False,
    fine_use_small=False,
    codec_use_gpu=False,
    force_reload=False
)

# simple generation
audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

# save audio
filepath = "out.wav" # change this to your desired output path
write_wav(filepath, SAMPLE_RATE, audio_array)

AssertionError no matter if it's 4s or 40s or 40 minutes

Hello, tried with 3 different audio lenghts, even 4s it give this error.

AssertionError                            Traceback (most recent call last)
Cell In[29], line 10
      1 # generation with more control
      2 x_semantic = generate_text_semantic(
      3     text_prompt,
      4     history_prompt=voice_name,
   (...)
      7     top_p=0.95,
      8 )
---> 10 x_coarse_gen = generate_coarse(
     11     x_semantic,
     12     history_prompt=voice_name,
     13     temp=0.7,
     14     top_k=50,
     15     top_p=0.95,
     16 )
     17 x_fine_gen = generate_fine(
     18     x_coarse_gen,
     19     history_prompt=voice_name,
     20     temp=0.5,
     21 )
     22 audio_array = codec_decode(x_semantic)

File ~\bark-with-voice-clone\bark\generation.py:521, in generate_coarse(x_semantic, history_prompt, temp, top_k, top_p, use_gpu, silent, max_coarse_history, sliding_window_len, model, use_kv_caching)
    519 x_semantic_history = x_history["semantic_prompt"]
    520 x_coarse_history = x_history["coarse_prompt"]
--> 521 assert (
    522     isinstance(x_semantic_history, np.ndarray)
    523     and len(x_semantic_history.shape) == 1
    524     and len(x_semantic_history) > 0
    525     and x_semantic_history.min() >= 0
    526     and x_semantic_history.max() <= SEMANTIC_VOCAB_SIZE - 1
    527     and isinstance(x_coarse_history, np.ndarray)
    528     and len(x_coarse_history.shape) == 2
    529     and x_coarse_history.shape[0] == N_COARSE_CODEBOOKS
    530     and x_coarse_history.shape[-1] >= 0
    531     and x_coarse_history.min() >= 0
    532     and x_coarse_history.max() <= CODEBOOK_SIZE - 1
    533     and (
    534         round(x_coarse_history.shape[-1] / len(x_semantic_history), 1)
    535         == round(semantic_to_coarse_ratio / N_COARSE_CODEBOOKS, 1)
    536     )
    537 )
    538 x_coarse_history = _flatten_codebooks(x_coarse_history) + SEMANTIC_VOCAB_SIZE
    539 # trim histories correctly

AssertionError: 

Please if you can help, thank you so uch

Generation failed with clone_voice.ipynb

I cloned voice like this.

import numpy as np
voice_name = 'ryoppippi' # whatever you want the name of the voice to be
output_path = 'bark/assets/prompts/' + voice_name + '.npz'
np.savez(output_path, fine_prompt=codes, coarse_prompt=codes[:2, :], semantic_prompt=semantic_tokens)

Then, I tried to generate my voice following the notebook like this

from bark.api import generate_audio
from bark.generation import SAMPLE_RATE
text_prompt = "Hello, my name is Suno. And, uh โ€” and I like pizza. [laughs]"
voice_name = "ryoppippi" # use your custom voice name here if you have one

# simple generation
audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

And it causes error

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[18], line 2
      1 # simple generation
----> 2 audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

File /workspace/bark-with-voice-clone/bark/api.py:78, in generate_audio(text, history_prompt, text_temp, waveform_temp)
     66 """Generate audio array from input text.
     67 
     68 Args:
   (...)
     75     numpy audio array at sample frequency 24khz
     76 """
     77 x_semantic = text_to_semantic(text, history_prompt=history_prompt, temp=text_temp)
---> 78 audio_arr = semantic_to_waveform(x_semantic, history_prompt=history_prompt, temp=waveform_temp)
     79 return audio_arr

File /workspace/bark-with-voice-clone/bark/api.py:46, in semantic_to_waveform(semantic_tokens, history_prompt, temp)
     31 def semantic_to_waveform(
     32     semantic_tokens: np.ndarray,
     33     history_prompt: Optional[str] = None,
     34     temp: float = 0.7,
     35 ):
     36     """Generate audio array from semantic input.
     37 
     38     Args:
   (...)
     44         numpy audio array at sample frequency 24khz
     45     """
---> 46     x_coarse_gen = generate_coarse(
     47         semantic_tokens,
     48         history_prompt=history_prompt,
     49         temp=temp,
     50     )
     51     x_fine_gen = generate_fine(
     52         x_coarse_gen,
     53         history_prompt=history_prompt,
     54         temp=0.5,
     55     )
     56     audio_arr = codec_decode(x_fine_gen)

File /workspace/bark-with-voice-clone/bark/generation.py:477, in generate_coarse(x_semantic, history_prompt, temp, top_k, top_p, use_gpu, silent, max_coarse_history, sliding_window_len, model)
    475 x_semantic_history = x_history["semantic_prompt"]
    476 x_coarse_history = x_history["coarse_prompt"]
--> 477 assert (
    478     isinstance(x_semantic_history, np.ndarray)
    479     and len(x_semantic_history.shape) == 1
    480     and len(x_semantic_history) > 0
    481     and x_semantic_history.min() >= 0
    482     and x_semantic_history.max() <= SEMANTIC_VOCAB_SIZE - 1
    483     and isinstance(x_coarse_history, np.ndarray)
    484     and len(x_coarse_history.shape) == 2
    485     and x_coarse_history.shape[0] == N_COARSE_CODEBOOKS
    486     and x_coarse_history.shape[-1] >= 0
    487     and x_coarse_history.min() >= 0
    488     and x_coarse_history.max() <= CODEBOOK_SIZE - 1
    489     and (
    490         round(x_coarse_history.shape[-1] / len(x_semantic_history), 1)
    491         == round(semantic_to_coarse_ratio / N_COARSE_CODEBOOKS, 1)
    492     )
    493 )
    494 x_coarse_history = _flatten_codebooks(x_coarse_history) + SEMANTIC_VOCAB_SIZE
    495 # trim histories correctly

AssertionError: 

I'm not sure what happens

[laughs] causing repetition of sentence or random repetitive sounds even with the same notebook

I ran the generate.ipynb notebook without changing anything and the audio produced was repeating the 2nd part of the sentence 3 times instead of a [laughs]. I tried changing the temperature but it is not producing the correct audio in any case.
This is the issue only when I use the coarse and fine information part

      x_semantic = generate_text_semantic(
          text_prompt,
          history_prompt=voice_name,
          temp=0.1,
          top_k=50,
          top_p=0.95,
      )
      
      x_coarse_gen = generate_coarse(
          x_semantic,
          history_prompt=voice_name,
          temp=0.1,
          top_k=50,
          top_p=0.95,
      )
      x_fine_gen = generate_fine(
          x_coarse_gen,
          history_prompt=voice_name,
          temp=0.1,
      )
      audio_array = codec_decode(x_fine_gen)

If i don't use this part and directly use audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.1, waveform_temp=0.7), I get the correct speech corresponding to the text

Fine-tune with LORA, high GPU usage

Hi guys, thank you for this cool work. I am doing some experiments with fine-tuning on a certain speaker recently. The related parameters are set as:

lora_dim = 64
optimize_lora_params_only = True

However, I've noticed that the GPU memory usage and overall training time remain virtually unchanged, regardless of whether optimize_lora_params_only is set to True or False.

Is this expected? I was under the impression that using LoRa for fine-tuning the model would significantly decrease both GPU memory consumption and training time. Could I be overlooking something? I appreciate your response.

Non-free LICENSE

Non-commercial (NC) clauses are non-free as they are not free software according to the FSF, open source according to the OSI, or free culture according to Freedom Defined. I would recommend using CC-BY-SA-4.0, CC-BY-4.0, or CC-0-1.0 instead which are free culture Creative Commons licenses.

generate_with_settings Hyperparameter Tuning

Hi,

Several parameters (such as semantic_temp, semantic_top_k, coarse_temp etc.) have default values and it is unclear what kind of result you will get in case you tweak them. I do play with them based on their literal meaning just like in an LLM.

Did someone come up with relatively good parameter values or a way to tune them for generating voices in a more consistent way? Any clue would be great! Thanks.

CUDA problem in jupyter notebook

Hi there,

When using the "clone_voice" or "generate" notebook in Jupyter, why am I unable to utilize CUDA and PyTorch with my GPU, resulting in the need to use CPU instead, which leads to slower performance?

thanks for this project!

No Model Download in new Version

In the old Version everything worked fine, but I downloaded the newest version today and in generate.ipynb it won't download the models, it just keeps loading. Same with the generate_chunked.ipynb.

Clone_voice.ipynb spitting out import issues for codec

I've been trying to run the clone_voice.ipynb file but I have been getting stuck at the importing of codec, but there have been some other errors after that. I have codec, but I'm getting the error message " cannot import name 'codec_encode' from 'bark.generation' ", " name 'torchaudio' is not defined ", and " name 'torch' is not defined ".

ValueError: history prompt not found

I encountered this problem
ValueError: history prompt not found
while I execute the file clone_voice.ipynb at this part

simple generation

audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)
Please help me resolve this problem

preload_models() got an unexpected keyword argument 'path'

after generate the .npz output file,

I got a error running the cell:
`

download and load all models

preload_models(
text_use_gpu=True,
text_use_small=False,
coarse_use_gpu=True,
coarse_use_small=False,
fine_use_gpu=True,
fine_use_small=False,
codec_use_gpu=True,
force_reload=False,
path="models"
)`

the error is "TypeError: preload_models() got an unexpected keyword argument 'path'"

I tried to change the default "models" for "path="./output.npz"" but didn't work. what the path should to be?

image

Loss after fine-tune vs loss when load fine-tuned checkpoint different

Thank you for this cool work. I am doing some experiments with fine-tuning on a different language (Vietnamese). I followed the notebook train_semantic.ipynb to train text to semantic. First, I checked the default pretrained model, and the validation loss was 10.90. After training, the validation loss in the notebook was 2.59, and I had the latest checkpoint in semantic_output/pytorch_model.bin. However, when I loaded the latest checkpoint to inference, the validation loss of text_to_semantic was now 6.35. The same problem occurs with semantic_to_coarse. Coarse_to_fine does not face this problem.

When I inferred with generate.ipynb, the quality of the audio was so poor, and it did not correspond to the text. Am I doing something wrong?

Confused model loading behavior, LORA is not used at all?

Hi guys, in generation.py, I noticed following code snippet as below. Looks like LORA is not used for inference at all, or is there anything I missed ? Thank you

 unwanted_prefix = "_orig_mod."
    for k, v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    unwanted_suffixes = [
        "lora_right_weight",
        "lora_left_weight",
        "lora_right_bias",
        "lora_left_bias",
    ]
    for k, v in list(state_dict.items()):
        for suffix in unwanted_suffixes:
            if k.endswith(suffix):
                state_dict.pop(k)
    # super hacky - should probably refactor this
    if state_dict.get('lm_head.0.weight', None) is not None:
        state_dict['lm_head.weight'] = state_dict.pop('lm_head.0.weight')
    if state_dict.get('lm_heads.0.0.weight', None) is not None:
        state_dict['lm_heads.0.weight'] = state_dict.pop('lm_heads.0.0.weight')
    if state_dict.get('lm_heads.1.0.weight', None) is not None:
        state_dict['lm_heads.1.weight'] = state_dict.pop('lm_heads.1.0.weight')
    if state_dict.get('lm_heads.2.0.weight', None) is not None:
        state_dict['lm_heads.2.weight'] = state_dict.pop('lm_heads.2.0.weight')
    if state_dict.get('lm_heads.3.0.weight', None) is not None:
        state_dict['lm_heads.3.weight'] = state_dict.pop('lm_heads.3.0.weight')
    if state_dict.get('lm_heads.4.0.weight', None) is not None:
        state_dict['lm_heads.4.weight'] = state_dict.pop('lm_heads.4.0.weight')
    if state_dict.get('lm_heads.5.0.weight', None) is not None:
        state_dict['lm_heads.5.weight'] = state_dict.pop('lm_heads.5.0.weight')
    if state_dict.get('lm_heads.6.0.weight', None) is not None:
        state_dict['lm_heads.6.weight'] = state_dict.pop('lm_heads.6.0.weight')

Loss does not change during finetuning

Hey! I am trying to finetune bark with your code to generate children's voices. However, coarse and fine module losses do not decrease during the finetuning. I am using default parameters as in this repo. Am I doing something wrong? Below, you can see the coarse and fine loss graphs respectively.

coarse loss fine loss

fine tuning on dataset with multiple speakers

I want to improve the quality of the Russian language, I have a large dataset, but it contains audio from different speakers.
How to organize data with many speakers, and run training?

Error when trying to export Audio

I can generate Audio no problem, but if I try to export it, I get Errors.

The following Error in the generate.ipynb:

NameError Traceback (most recent call last)
Cell In[1], line 4
2 # save audio
3 filepath = "/output/audio.wav" # change this to your desired output path
----> 4 write_wav(filepath, SAMPLE_RATE, audio_array)

NameError: name 'SAMPLE_RATE' is not defined

And when I use the generate_chunked.ipynb I get this Error:
FileNotFoundError Traceback (most recent call last)
Cell In[8], line 63
60 audio_array = np.concatenate(all_parts, axis=-1)
62 # save audio
---> 63 write_wav(out_filepath, SAMPLE_RATE, audio_array)
65 # play audio
66 Audio(audio_array, rate=SAMPLE_RATE)

File c:\ProgramData\Anaconda3\lib\site-packages\scipy\io\wavfile.py:767, in write(filename, rate, data)
765 fid = filename
766 else:
--> 767 fid = open(filename, 'wb')
769 fs = rate
771 try:

FileNotFoundError: [Errno 2] No such file or directory: 'audio/audio.wav'

Assertion error, when generating after cloning

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[15], line 2
      1 # simple generation
----> 2 audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

File f:\WBC\bark-with-voice-clone\bark\api.py:113, in generate_audio(text, history_prompt, text_temp, waveform_temp, silent, output_full)
     94 """Generate audio array from input text.
     95 
     96 Args:
   (...)
    105     numpy audio array at sample frequency 24khz
    106 """
    107 semantic_tokens = text_to_semantic(
    108     text,
    109     history_prompt=history_prompt,
    110     temp=text_temp,
    111     silent=silent,
    112 )
--> 113 out = semantic_to_waveform(
    114     semantic_tokens,
    115     history_prompt=history_prompt,
    116     temp=waveform_temp,
    117     silent=silent,
    118     output_full=output_full,
...
    570 )
    571 x_coarse_history = _flatten_codebooks(x_coarse_history) + SEMANTIC_VOCAB_SIZE
    572 # trim histories correctly

AssertionError: 

ModuleNotFoundError: No module named 'hubert'

When trying to run 'clone_voice.ipynb', I get:

ModuleNotFoundError Traceback (most recent call last)
in <cell line: 1>()
----> 1 from hubert.hubert_manager import HuBERTManager
2 hubert_manager = HuBERTManager()
3 hubert_manager.make_sure_hubert_installed()
4 hubert_manager.make_sure_tokenizer_installed()

ModuleNotFoundError: No module named 'hubert'

Bad Performance of Voice Cloning

I am using the https://github.com/serp-ai/bark-with-voice-clone/blob/main/clone_voice.ipynb Notebook to generate audio clips similar to one provided by me.

While the code ran well, the resulting audio file was not really very good. I am using common American and British accents speakers

Any tips to tune the model to correctly get the results or any parameters to play with ?


import sys
sys.path.append('./bark-voice-cloning-HuBERT-quantizer')
import os
from pydub import AudioSegment
from scipy.io.wavfile import write as write_wav
import numpy as np
import torch
import torchaudio
from bark.api import generate_audio
from bark.generation import SAMPLE_RATE, preload_models, load_codec_model
from encodec.utils import convert_audio
from bark_hubert_quantizer.customtokenizer import CustomTokenizer
from bark_hubert_quantizer.hubert_manager import HuBERTManager
from bark_hubert_quantizer.pre_kmeans_hubert import CustomHubert

preload_models(
    text_use_gpu=True,
    text_use_small=False,
    coarse_use_gpu=True,
    coarse_use_small=False,
    fine_use_gpu=True,
    fine_use_small=False,
    codec_use_gpu=True,
    force_reload=False
)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = load_codec_model(use_gpu=True if device == 'cuda' else False)

hubert_manager = HuBERTManager()
hubert_manager.make_sure_hubert_installed()
hubert_manager.make_sure_tokenizer_installed()

# Load the HuBERT model
hubert_model = CustomHubert(checkpoint_path='data/models/hubert/hubert.pt').to(device)

# Load the CustomTokenizer model
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth', map_location=device).to(device)

"""# Inference"""

text_prompt = 'Hello! How are you?, I am Monster from Monster. I make AI Models for all of you here at Blocks and I am really excited about it. I make Generative AI accessible to all' #@param {type:"string"}
audio_filepath = r'/home/qblocks/Cloning/CA_AG_Kamala_Harris_2013_CADEM_Convention.webm' #@param {type:"string"}

def trim_and_convert_audio(input_path, output_path, target_duration_ms=30000):
    # Load the audio file
    print("Loading Audio File:", input_path)
    audio = AudioSegment.from_file(input_path)
    # Get the duration of the audio in milliseconds
    audio_duration = len(audio)
    # Trim the audio to the target duration
    if audio_duration > target_duration_ms:
        trimmed_audio = audio[:target_duration_ms]
    else:
        trimmed_audio = audio
    # Save the trimmed audio as a WAV file
    trimmed_audio.export(output_path, format="wav")
    print("Trimmed audio saved as:", output_path)

output_audio_path = "converted_audio.wav"  
trim_and_convert_audio(audio_filepath, output_audio_path)

if not os.path.isfile(audio_filepath):
  raise ValueError(f"Audio file not exists ({output_audio_path})")

wav, sr = torchaudio.load(output_audio_path)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
wav = wav.to(device)

semantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
semantic_tokens = tokenizer.get_token(semantic_vectors)

# Extract discrete codes from EnCodec
with torch.no_grad():
    encoded_frames = model.encode(wav.unsqueeze(0))
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze()

# move codes to cpu
codes = codes.cpu().numpy()
# move semantic tokens to cpu
semantic_tokens = semantic_tokens.cpu().numpy()

voice_filename = 'output3.npz'
current_path = os.getcwd()
voice_name = os.path.join(current_path, voice_filename)

np.savez(voice_name, fine_prompt=codes, coarse_prompt=codes[:2, :], semantic_prompt=semantic_tokens)

# simple generation
audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.8, waveform_temp=0.8)

# save audio
filepath = "out5.wav" # change this to your desired output path
write_wav(filepath,SAMPLE_RATE,audio_array)

How to add other foriegn language?

It is not an issue, but a request for feature. I am wondering is there any simple way to train or fine-tine the model for other languages?

Unusable - Any help?

All default english voices sounds really robotic and cloning... while it does work, it's still very robotic / low quality and in some cases it doesn't even speak the words in my prompt, it speaks something completely different (still in english).

gpu memory changes in recent updates

I have been running bark on my 1650 ti computer for a couple months, 6gb vram, works fine without the small models option

So then i installed it on my 2070 and I get the memory error issue that is discussed in another issue thread.

I took the advice in that thread and just added use small models, but the audio was 10 times worse then what my 1650ti was making. Just garbled audio every other generation or so.

So then i figured, screw it, I'm a copy this working directory from my 1650ti pc over to the 2070 and try it, what the heck right. and sure enough it works fine. uses all 8gb of vram on my 2070 and audio garble is gone. runs much slower now of course lol.

But i I uninstalled pytorch a hundred times thinking it had to be a pytorch issue because why would it run on my 6gb card and not my 8gb card? I haven't had time to look through the code to see what changed yet. But I thought I would mention it as I didn't see this covered in resolved issues. And there are definitely some people addressing this issue in issues thinking it is a memory issue. Which I guess it is. we need 12gb. but it was and is running just fine on whatever version I seem to have installed now for 6gb and 8gb cards and I AM NOT RUNNING SMALL MODELS for sure. i checked and double checked. It only uses 2 to 4 gb vram on the current git and i have to enable small models, but it uses all 6 or 8 on the old git without small models enabled.

Random tail speech generated from fine-tuned model

Hi guys, I have recently tested the model fine-tuning on several certain speakers, I would say the improvement is significant comparing to the original base model. Thank you for this great work again. However,

  1. for all the fine-tuned model I tried, regardless the length of input text (even short as "Good morning"), the system will always try to generate the audio to the maximum length (15 seconds). This behavior leads to a lot of random generated contents at the end of the speech.
  2. When using the fine-tuned model, this has been observed not just the fine-tuned speaker, but also the default speaker in Bark

I tried to modify the temperature and min eos probability, not helpful. Any thought or guidance would be great.

ModuleNotFoundError: No module named 'audiolm_pytorch'

This issue was created on DagsHub by:
tangzeq

when run :
 
# From https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer 
# Load HuBERT for semantic tokens
from hubert.pre_kmeans_hubert import CustomHubert
from hubert.customtokenizer import CustomTokenizer

# Load the HuBERT model
hubert_model = CustomHubert(checkpoint_path='data/models/hubert/hubert.pt').to(device)

# Load the CustomTokenizer model
tokenizer = CustomTokenizer.load_from_checkpoint('data/models/hubert/tokenizer.pth').to(device)  # Automatically uses the right layers

then ๏ผš

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ in <module>:3                                                                                    โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    1 # From https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer                       โ”‚
โ”‚    2 # Load HuBERT for semantic tokens                                                           โ”‚
โ”‚ โฑ  3 from hubert.pre_kmeans_hubert import CustomHubert                                           โ”‚
โ”‚    4 from hubert.customtokenizer import CustomTokenizer                                          โ”‚
โ”‚    5                                                                                             โ”‚
โ”‚    6 # Load the HuBERT model                                                                     โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /mnt/workspace/hubert/pre_kmeans_hubert.py:20 in <module>                                        โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    17                                                                                            โ”‚
โ”‚    18 from torchaudio.functional import resample                                                 โ”‚
โ”‚    19                                                                                            โ”‚
โ”‚ โฑ  20 from audiolm_pytorch.utils import curtail_to_multiple                                      โ”‚
โ”‚    21                                                                                            โ”‚
โ”‚    22 import logging                                                                             โ”‚
โ”‚    23 logging.root.setLevel(logging.ERROR)                                                       โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
ModuleNotFoundError: No module named 'audiolm_pytorch'

Can't generate audio in Google Colab

I was modifying the notebook so that it can run on colab and it seems like training works perfectly fine but I can't seem to get it to generate any sound. When I get to the generation cell I get this error

AssertionError                            Traceback (most recent call last)
[<ipython-input-6-18c098f645af>](https://localhost:8080/#) in <cell line: 5>()
      3      But I also have other interests such as playing tic tac toe.
      4 """
----> 5 audio_array = generate_audio(text_prompt, history_prompt="output")
      6 Audio(audio_array, rate=SAMPLE_RATE)

2 frames
[/usr/local/lib/python3.9/dist-packages/bark/generation.py](https://localhost:8080/#) in generate_coarse(x_semantic, history_prompt, temp, top_k, top_p, use_gpu, silent, max_coarse_history, sliding_window_len, model)
    475         x_semantic_history = x_history["semantic_prompt"]
    476         x_coarse_history = x_history["coarse_prompt"]
--> 477         assert (
    478             isinstance(x_semantic_history, np.ndarray)
    479             and len(x_semantic_history.shape) == 1

AssertionError:

not sure if I did something wrong with training or something installed incorrectly

Clone_voice.ipynb generate_audio AssertionError

I upload Clone_voice.ipynb to colab, got an error.

simple generation

audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 100/100 [00:09<00:00, 10.93it/s]

AssertionError Traceback (most recent call last)
in <cell line: 2>()
1 # simple generation
----> 2 audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

2 frames
/usr/local/lib/python3.10/dist-packages/bark/generation.py in generate_coarse(x_semantic, history_prompt, temp, top_k, top_p, silent, max_coarse_history, sliding_window_len, use_kv_caching)
565 and x_coarse_history.max() <= CODEBOOK_SIZE - 1
566 and (
--> 567 round(x_coarse_history.shape[-1] / len(x_semantic_history), 1)
568 == round(semantic_to_coarse_ratio / N_COARSE_CODEBOOKS, 1)
569 )

AssertionError:

OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 8.00 GiB total capacity; 7.30 GiB already allocated; 0 bytes free; 7.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have cloned the repo and am trying to use my GPU (RTX 3060 ti 8 GB) with it but am getting OutOfMemoryError.

The error is occurring at line 283 in generation.py

Error line ==> File [d:\ML\bark-with-voice-clone\bark\generation.py:283, in _load_model(ckpt_path, device, use_small, model_type)
281 logger.info(f"model loaded: {round(n_params/1e6,1)}M params, {round(val_loss,3)} loss")
282 model.eval()
--> 283 model.to(device)
284 del checkpoint, state_dict
285 _clear_cuda_cache()

Error message ==> OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 8.00 GiB total capacity; 7.30 GiB already allocated; 0 bytes free; 7.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I looked up on the internet and it seems that reducing the batch size might resolve this issue. However I am not sure where to make changes and don't want to break it by trying it out on my own.

PS: It worked fine without GPU but was slow(of course). I have tried both the notebook and the script snippet from here, but am stuck with the same error when using GPU.

What would I need to do here onwards in this case?

AssertionError

AssertionError                            Traceback (most recent call last)
Cell In[3], line 2
      1 # simple generation
----> 2 audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

File f:\WBC\bark-with-voice-clone\bark\api.py:113, in generate_audio(text, history_prompt, text_temp, waveform_temp, silent, output_full)
     94 """Generate audio array from input text.
     95 
     96 Args:
   (...)
    105     numpy audio array at sample frequency 24khz
    106 """
    107 semantic_tokens = text_to_semantic(
    108     text,
    109     history_prompt=history_prompt,
    110     temp=text_temp,
    111     silent=silent,
    112 )
--> 113 out = semantic_to_waveform(
    114     semantic_tokens,
    115     history_prompt=history_prompt,
    116     temp=waveform_temp,
    117     silent=silent,
    118     output_full=output_full,
...
    537 )
    538 x_coarse_history = _flatten_codebooks(x_coarse_history) + SEMANTIC_VOCAB_SIZE
    539 # trim histories correctly

AssertionError: 

bad performance. need help

You will get the best results by making generations with your cloned voice until you find one that is really close to the source. Then use that as the new history prompt (comes from the model so should theoretically be more consistent)
Please add such a example in the notebook if possible. Also if I have 5 minutes of recording, how can I fine tune the model?

EinopsError

Hello,

Just also to point out this error that often pops op for no reason when trying to clone.

`EinopsError Traceback (most recent call last)
File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\einops\einops.py:412, in reduce(tensor, pattern, reduction, **axes_lengths)
411 recipe = _prepare_transformation_recipe(pattern, reduction, axes_lengths=hashable_axes_lengths)
--> 412 return _apply_recipe(recipe, tensor, reduction_type=reduction)
413 except EinopsError as e:

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\einops\einops.py:235, in _apply_recipe(recipe, tensor, reduction_type)
233 backend = get_backend(tensor)
234 init_shapes, reduced_axes, axes_reordering, added_axes, final_shapes =
--> 235 _reconstruct_from_shape(recipe, backend.shape(tensor))
236 tensor = backend.reshape(tensor, init_shapes)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\einops\einops.py:165, in _reconstruct_from_shape_uncached(self, shape)
164 if len(shape) != len(self.input_composite_axes):
--> 165 raise EinopsError('Expected {} dimensions, got {}'.format(len(self.input_composite_axes), len(shape)))
167 ellipsis_shape: List[int] = []

EinopsError: Expected 3 dimensions, got 2

During handling of the above exception, another exception occurred:

EinopsError Traceback (most recent call last)
Cell In[13], line 22
10 x_coarse_gen = generate_coarse(
11 x_semantic,
12 history_prompt=voice_name,
(...)
15 top_p=0.95,
16 )
17 x_fine_gen = generate_fine(
18 x_coarse_gen,
19 history_prompt=voice_name,
20 temp=0.5,
21 )
---> 22 audio_array = codec_decode(x_semantic)

File ~\bark-with-voice-clone\bark\generation.py:764, in codec_decode(fine_tokens, model, use_gpu)
762 arr = arr.to(device)
763 arr = arr.transpose(0, 1)
--> 764 emb = model.quantizer.decode(arr)
765 out = model.decoder(emb)
766 audio_arr = out.detach().cpu().numpy().squeeze()

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\encodec\quantization\vq.py:115, in ResidualVectorQuantizer.decode(self, codes)
112 def decode(self, codes: torch.Tensor) -> torch.Tensor:
113 """Decode the given codes to the quantized representation.
114 """
--> 115 quantized = self.vq.decode(codes)
116 return quantized

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\encodec\quantization\core_vq.py:365, in ResidualVectorQuantization.decode(self, q_indices)
363 for i, indices in enumerate(q_indices):
364 layer = self.layers[i]
--> 365 quantized = layer.decode(indices)
366 quantized_out = quantized_out + quantized
367 return quantized_out

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\encodec\quantization\core_vq.py:291, in VectorQuantization.decode(self, embed_ind)
289 quantize = self._codebook.decode(embed_ind)
290 quantize = self.project_out(quantize)
--> 291 quantize = rearrange(quantize, "b n d -> b d n")
292 return quantize

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\einops\einops.py:483, in rearrange(tensor, pattern, **axes_lengths)
481 raise TypeError("Rearrange can't be applied to an empty list")
482 tensor = get_backend(tensor[0]).stack_on_zeroth_dimension(tensor)
--> 483 return reduce(cast(Tensor, tensor), pattern, reduction='rearrange', **axes_lengths)

File ~\AppData\Local\Programs\Python\Python310\lib\site-packages\einops\einops.py:420, in reduce(tensor, pattern, reduction, **axes_lengths)
418 message += '\n Input is list. '
419 message += 'Additional info: {}.'.format(axes_lengths)
--> 420 raise EinopsError(message + '\n {}'.format(e))

EinopsError: Error while processing rearrange-reduction pattern "b n d -> b d n".
Input tensor shape: torch.Size([1, 128]). Additional info: {}.
Expected 3 dimensions, got 2`

error when training with long audios.

error when training with long audios.

when training with audio 5s-7s it works but at
training with an audio of 15s-20s gives me the following error
100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 100/100 [00:03<00:00, 27.04it/s]

AssertionError Traceback (most recent call last)
in <cell line: 7>()
5 """
6 voice_name = "/content/bark-with-voice-clone/bark/assets/prompts/output5.npz" # use your custom voice name here if you have one
----> 7 audio_array = generate_audio(text_prompt, history_prompt=voice_name)

2 frames
/usr/local/lib/python3.9/dist-packages/bark/generation.py in generate_coarse(x_semantic, history_prompt, temp, top_k, top_p, use_gpu, silent, max_coarse_history, sliding_window_len, model)
522 else:
523
--> 524 x_history = np.load(
525 os.path.join(CUR_PATH, "assets", "prompts", f"{history_prompt}.npz")
526 )

AssertionError:

Training Model (Semantics)

I had the data of multiple ~8 seconds audio clips (.wav). If I understand it correctly, do I need to generate the semantics output, fine output and course output to able to train it using my own dataset? and is it able to generate a natural synthetis audio by training it using my own datasets?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.