jasonppy / voicecraft Goto Github PK

View Code? Open in Web Editor NEW

7.4K 7.4K 722.0 2.78 MB

Zero-Shot Speech Editing and Text-to-Speech in the Wild

License: Other

Python 39.22% Jupyter Notebook 60.16% Shell 0.41% Batchfile 0.07% Dockerfile 0.14%

voicecraft's People

Contributors

Stargazers

Watchers

Forkers

cat-stack-boop ishine render-ai wendonggan segmentationfaults del18687058912 toannguyen247 kustomzone kingfener xzm2004260 anthonyyuan songguoguo ofirkris asakrg v6p tarasmetal buyersystem hotmiaow avtobus entn-at sorokinvld aminecs beimingmaster ericismyeldestson whitefu cryptoxunm qoboty howardbaek yomaser jackie666666 bridge01 hamedmoo talenhuang misterypoem lplzyp eltociear davidmartinrius seawolf2357 moluyouwo alignment-lab-ai ariesw nooseok szhowardhuang azgo14 mariambasents shiyukonghui shaun95 hhy5277 yanxg mecasual19 dealexpesh derekjhunt keystoneinfosec mbrukman ukaserge wolvend jjhw jakubik2023 veryvanya segmond jbalber keyman9848 kafkaqin b08240 lycsqq jmaigc chenchy hongwen-sun ailabteam maxmax2016 folkevil hufangfang1 gaomingxing cellsplit welovehiro pourmoeziashkan furinomartina631 tavisfendler514 oytunturk objsgit foundations macroustc liuxing9848 lukin-kirill jossion12 ego ajits-github speechprojects thanhkm rapidai mikeknows rooben-me raymondgp berkeleynerd chesketh76 taimurayaz ralf12358 yonnym tspannhw yif0

voicecraft's Issues

Validation loss Divergence?

Thanks for your great work!
Now, I'm training 100M voicecraft using ljspeech and custom data(32 hours Maybe ?)
But, I faced a issue about validation loss divergence.

I think the cause it delay stacking which changed the sequnece every epoch described in your paper. If the train-accuracy of all 4 codebooks reaches 1, it is predicted that the validation loss will decrease..

For this reason, I have two questions.

Could you explain whether my training is right or not ? ( loss curve and analysis etc..)
Could you share your train and validation curve ?

Best regards

Seung Woo Yu

metadata-generation-failed during environment setup

During environment setup, after run
pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft

I get the following problem:

Would you please help,
Thanks

Bash Error with New inference_tts.ipynb

Hello, I was going through the readme, and everything seemed to be working fine, until I got an error on cell 2. I'm not sure if it's an environment error, so apologies if that's the case.

Platform: Windows WSL2
File: inference_tts.ipynb
Code:

# install MFA models and dictionaries if you haven't done so already
!source ~/.bashrc && \
    conda activate voicecraft && \
    mfa model download dictionary english_us_arpa && \
    mfa model download acoustic english_us_arpa

Error: Output:

/bin/bash: line 1: /home/zak/.bashrc: No such file or directory

non-speech sounds like suno bark ?

is it possible to generate interjections with this model like mhmm or aha ?

Usage instructions.

(really really impressed by the demo, so much further than the best SOTA model I found so far, congrats on the great work).

Running with docker/jupyter.

I followed the Docker/jupyter instructions, to the letter (I'm not at all familiar with jupyter, very with docker).

It went mostly well.

I keep running cells/advancing, again and again, until I get at the bottom.

And then nothing? What's supposed to happen? I don't see any new instructions, no new files, anything, I'm fairly lost.

Running as a script.

The jupyter stuff is great to get to know the project, but (unless I don't understand what jupyter is), it won't really help getting voicecraft integrated into my project / enable me to generate thousands of files / "call" voicecraft programmatically from my nodejs system.

In other, there is something like:

python3 voicecraft/bin/inference.py --text="Read this text" --model_path="voicecraft/model/file.something" --voice_sample="/tmp/voices/robert.wav" --output="/tmp/sample_voicecraft_output.wav" --device=cpu

What's the equivalent for voicecraft, and how do I get to the point where it'll agree to run? (running inside docker is fine, or outside docker too, just need to get it to run).

I found main.py, and I think the options for the command line are in config.py, but I don't know which options I need and which I don't / I don't know how to use the script. I didn't find an example of how to use it, but I'll keep looking.

Intonation.

I might be getting a bit ahead of myself here since I don't have it running yet, but maybe you know: will intonation/style transfer through? Like if my voice sample has the person whispering, will the output be whispering? Same for shouting, crying, etc. That's really the big thing missing from my system, is there any way to get that to work with voicecraft, do you know?

Thanks a lot in advance!
Cheers.

What languages does it support?

We would love if it supported any language like RVC And especially in Hebrew

Error in loading your tuned EnCodec from Huggingface

Hi, @jasonppy, thanks for the model and open-sourcing the code to inspire ML Speech engineers to investigate it!

My question is about loading your pretrained EnCodec model that is stored in HF hub. I installed the env and main packages, downloaded checkpoint from https://huggingface.co/pyp1/VoiceCraft/tree/main and tried to use it. But got the following error:

phonemization

load tokenizer

load the encodec model

from audiocraft.solvers import CompressionSolver
model = CompressionSolver.model_from_checkpoint("/home/jovyan/kda/VoiceCraft/exp/encodec_4cb2048_giga.th")
model = model.cuda()
model = model.eval()

Output:

MissingConfigException Traceback (most recent call last)
Cell In[9], line 5
1 ### phonemization
2 # load tokenizer
3 # load the encodec model
4 from audiocraft.solvers import CompressionSolver
----> 5 model = CompressionSolver.model_from_checkpoint("/home/jovyan/kda/VoiceCraft/exp/encodec_4cb2048_giga.th")
6 model = model.cuda()
7 model = model.eval()

File /home/user/conda/lib/python3.9/site-packages/audiocraft/solvers/compression.py:287, in CompressionSolver.model_from_checkpoint(checkpoint_path, device)
285 logger = logging.getLogger(name)
286 logger.info(f"Loading compression model from checkpoint: {checkpoint_path}")
--> 287 _checkpoint_path = checkpoint.resolve_checkpoint_path(checkpoint_path, use_fsdp=False)
288 assert _checkpoint_path is not None, f"Could not resolve compression model checkpoint path: {checkpoint_path}"
289 state = checkpoint.load_checkpoint(_checkpoint_path)

File /home/user/conda/lib/python3.9/site-packages/audiocraft/utils/checkpoint.py:68, in resolve_checkpoint_path(sig_or_path, name, use_fsdp)
56 def resolve_checkpoint_path(sig_or_path: tp.Union[Path, str], name: tp.Optional[str] = None,
57 use_fsdp: bool = False) -> tp.Optional[Path]:
58 """Resolve a given checkpoint path for a provided dora sig or path.
59
60 Args:
(...)
66 Path, optional: Resolved checkpoint path, if it exists.
67 """
---> 68 from audiocraft import train
69 xps_root = train.main.dora.dir / 'xps'
70 sig_or_path = str(sig_or_path)

File /home/user/conda/lib/python3.9/site-packages/audiocraft/train.py:131
126 logger.info("Changing tmpdir to %s", tmpdir)
127 os.environ['TMPDIR'] = str(tmpdir)
130 @hydra_main(config_path='../config', config_name='config', version_base='1.1')
--> 131 def main(cfg):
132 init_seed_and_system(cfg)
134 # Setup logging both to XP specific folder, and to stderr.

File /home/user/conda/lib/python3.9/site-packages/dora/hydra.py:308, in hydra_main.._decorator(main)
307 def _decorator(main: MainFun):
--> 308 return HydraMain(main, config_name=config_name, config_path=config_path,
309 **kwargs)

File /home/user/conda/lib/python3.9/site-packages/dora/hydra.py:161, in HydraMain.init(self, main, config_name, config_path, **kwargs)
158 self.full_config_path = self.full_config_path / config_path
160 self._initialized = False
--> 161 self._base_cfg = self._get_config()
162 self._config_groups = self._get_config_groups()
163 dora = self._get_dora()

File /home/user/conda/lib/python3.9/site-packages/dora/hydra.py:281, in HydraMain._get_config(self, overrides)
275 """
276 Internal method, returns the config for the given override,
277 but without the dora.sig field filled.
278 """
279 with initialize_config_dir(str(self.full_config_path), job_name=self._job_name,
280 **self.hydra_kwargs):
--> 281 return self._get_config_noinit(overrides)

File /home/user/conda/lib/python3.9/site-packages/dora/hydra.py:289, in HydraMain._get_config_noinit(self, overrides)
287 cfg = copy.deepcopy(cfg)
288 else:
--> 289 cfg = compose(self.config_name, overrides) # type: ignore
290 return cfg

File /home/user/conda/lib/python3.9/site-packages/hydra/compose.py:38, in compose(config_name, overrides, return_hydra_config, strict)
36 gh = GlobalHydra.instance()
37 assert gh.hydra is not None
---> 38 cfg = gh.hydra.compose_config(
39 config_name=config_name,
40 overrides=overrides,
41 run_mode=RunMode.RUN,
42 from_shell=False,
43 with_log_configuration=False,
44 )
45 assert isinstance(cfg, DictConfig)
47 if not return_hydra_config:

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/hydra.py:594, in Hydra.compose_config(self, config_name, overrides, run_mode, with_log_configuration, from_shell, validate_sweep_overrides)
576 def compose_config(
577 self,
578 config_name: Optional[str],
(...)
583 validate_sweep_overrides: bool = True,
584 ) -> DictConfig:
585 """
586 :param config_name:
587 :param overrides:
(...)
591 :return:
592 """
--> 594 cfg = self.config_loader.load_configuration(
595 config_name=config_name,
596 overrides=overrides,
597 run_mode=run_mode,
598 from_shell=from_shell,
599 validate_sweep_overrides=validate_sweep_overrides,
600 )
601 if with_log_configuration:
602 configure_log(cfg.hydra.hydra_logging, cfg.hydra.verbose)

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py:142, in ConfigLoaderImpl.load_configuration(self, config_name, overrides, run_mode, from_shell, validate_sweep_overrides)
133 def load_configuration(
134 self,
135 config_name: Optional[str],
(...)
139 validate_sweep_overrides: bool = True,
140 ) -> DictConfig:
141 try:
--> 142 return self._load_configuration_impl(
143 config_name=config_name,
144 overrides=overrides,
145 run_mode=run_mode,
146 from_shell=from_shell,
147 validate_sweep_overrides=validate_sweep_overrides,
148 )
149 except OmegaConfBaseException as e:
150 raise ConfigCompositionException().with_traceback(sys.exc_info()[2]) from e

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py:243, in ConfigLoaderImpl._load_configuration_impl(self, config_name, overrides, run_mode, from_shell, validate_sweep_overrides)
233 def _load_configuration_impl(
234 self,
235 config_name: Optional[str],
(...)
239 validate_sweep_overrides: bool = True,
240 ) -> DictConfig:
241 from hydra import version, version
--> 243 self.ensure_main_config_source_available()
244 parsed_overrides, caching_repo = self._parse_overrides_and_create_caching_repo(
245 config_name, overrides
246 )
248 if validate_sweep_overrides:

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py:129, in ConfigLoaderImpl.ensure_main_config_source_available(self)
123 else:
124 msg = (
125 "Primary config directory not found.\nCheck that the"
126 f" config directory '{source.path}' exists and readable"
127 )
--> 129 self._missing_config_error(
130 config_name=None, msg=msg, with_search_path=False
131 )

File /home/user/conda/lib/python3.9/site-packages/hydra/_internal/config_loader_impl.py:102, in ConfigLoaderImpl._missing_config_error(self, config_name, msg, with_search_path)
99 else:
100 return msg
--> 102 raise MissingConfigException(
103 missing_cfg_file=config_name, message=add_search_path()
104 )

MissingConfigException: Primary config directory not found.
Check that the config directory '/home/user/conda/lib/python3.9/site-packages/audiocraft/../config' exists and readable

So do you have any special config files for your encodec or this is the error with Audiocraft/Hydra package?

RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

I'm getting this error regardless of the wav file I use, including the demo file:

RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size

Have you encountered this before?

Possible Use Before Defined

https://github.com/jasonppy/VoiceCraft/blob/master/inference_speech_editing_scale.py#L127

performance

on top hardware and with compilation, inference speed is still too slow to be competitive or support real-time applications. a long sentence can take anywhere from 4 to 10 seconds.

i will say the quality is quite good, and the zero-shot capability is impressive.

PostgreSQL/pgvector requirement?

Is the full blown Postgres really needed for this, especially for non-cloud local system executing, and couldn't it be simplified with something like SQLite-VSS (https://github.com/asg017/sqlite-vss)?

Which all languages does it support

MFA not compatible with hugging face space?

I've been working on creating a hugging face space that uses Voicecraft, but there seems to be an issue where MFA can't be installed via conda since hugging face spaces only allow you to install via apt-get and pip. Have you guys figured out how to work around this issue?

Could not find a model named "english_us_arpa" for dictionary.

when I run the 4th cell i get this error.

seed - magic number

In the Jupyter inference the seed is set but never used. To me it looks like setting the seed makes no difference on the end result or am I missing something?

espeak not working as backend on Windows OS

hi there,

Thanks for open sourcing this, I have everything installed and building perfectly. But espeak isn't supported on windows, is there a way to use a different backend for the text tokenizer? I've tried nltk and failed :(

These two lines:

text_tokenizer = TextTokenizer(backend="espeak")
audio_tokenizer = AudioTokenizer(signature=encodec_fn)  # Will also put the neural codec model on GPU

Everything else is working perfectly:

RealEdit Dataset Release

Hi,

Thank you for sharing this remarkable work.
I am wondering if there are any plans to make the RealEdit dataset publicly available.
I am interested in utilizing the RealEdit dataset for academic research purposes.

mfa: command not found

This line of code fails:

os.system(f"mfa align -j 1 --output_format csv {temp_folder} english_us_arpa english_us_arpa {align_temp}")

what 'mfa' should stand for?

Thanks

Add option for fp16 kv cache

FluentSpeech model trained on GigaSpeech

Thanks for open sourcing this for better research.

Could you open source FluentSpeech model weights that you trained on GigaSpeech together with your model?

gradio port

I did not like having to mess with jupyter and having to run whisper separately, so I made a gradio version. Will submit pull request eventually. you can try it out here for now.
note that the conda env is slightly different in my fork
https://github.com/friendlyFriend4000/VoiceCraft

Installation on windows native

Hi,

This issue is a installation solution for installing on windows

preferably, if possible at all

without using WSL / docker / conda

Just stock python & pip, maybe a venv, maybe some powershell but preferably pure batch install.

In reference to previous attempts

#28
#29

colab demo

Can someone share a colab to test this

Finetuning on custom voice

Hi, thanks for your amazing work. Can't wait to try it out.
I am wondering if it's possible to finetune your pretrained model on a custom voice and, if so, if you can upload a notebook to follow.
I am reading the training section but I'm not sure I completely understood how I could finetune a custom voice. It would be nice.

Thank you again.

How to fine tune with our own voice

Your guide is terrible. Lets say I have 2 hours of speech of myself. How can I train my own voice? Fine tune the base model?

some Voice editing problem

I have noticed some testing and demo issues regarding voice editing
I would like to ask you about when you edit the last part of the text, for example: https://youtu.be/PJ2qSjycLcw?t=353, after starting at 5:50, there will be a problem with the synthesis quality at the end of the sentence, I prefer the bad audio is not mask in two parts but edit it the end of the sentence. So I found your demo "this was george steers the son of a british naval captain and ship modeler who had become an american naval officer and was entrusted with the prestigious role of overseeing the operations at the renowned naval headquarters" editing in the end of the sentence.There will also be strange pauses at the end of the sentence between the last few words.

RuntimeError: espeak not installed on your system

Environment

macOS Sonoma 14.3.1
M1 Max 64GB

Issue Description
I am attempting to run the inference_tts.ipynb notebook on an Apple Silicon Mac. As part of adapting the code for Apple Silicon, I replaced CUDA references with MPS (or CPU where MPS isn't an option). However, I encountered a runtime error related to espeak not being recognized by the system despite being installed.

Steps taken

Installed espeak via Homebrew using brew install espeak.
Confirmed espeak installation with espeak --version, which outputs 'eSpeak NG text-to-speech: 1.51.1'
Ran the inference_tts.ipynb notebook after making necessary modifications for MPS compatibility.
Encountered the RuntimeError: espeak not installed on your system upon executing Cell 4.

Behavior
Despite espeak being installed (confirmed via command line), a RuntimeError is thrown indicating that espeak is not installed on the system.

Troubleshooting Steps Taken

Verified that espeak is accessible via the command line and shows the installed version.
Attempted to reinstall espeak through Homebrew.
Checked the system's PATH to ensure it includes the directory where espeak is installed.

Full output with error

RuntimeError                              Traceback (most recent call last)
Cell In[4], line 31
     27 model.eval()
     29 phn2num = ckpt['phn2num']
---> 31 text_tokenizer = TextTokenizer(backend="espeak")
     32 audio_tokenizer = AudioTokenizer(signature=encodec_fn) # will also put the neural codec model on gpu
     34 # run the model to get the output

File ~/VoiceCraft_things/VoiceCraft-master/data/tokenizer.py:48, in TextTokenizer.__init__(self, language, backend, separator, preserve_punctuation, punctuation_marks, with_stress, tie, language_switch, words_mismatch)
     36 def __init__(
     37     self,
     38     language="en-us",
   (...)
     46     words_mismatch: WordMismatch = "ignore",
     47 ) -> None:
---> 48     phonemizer = EspeakBackend(
     49         language,
     50         punctuation_marks=punctuation_marks,
     51         preserve_punctuation=preserve_punctuation,
     52         with_stress=with_stress,
     53         tie=tie,
     54         language_switch=language_switch,
     55         words_mismatch=words_mismatch,
     56     )
...
     81 self._logger.info(
     82     'initializing backend %s-%s',
     83     'espeak', '.'.join(str(v) for v in self.version()))

RuntimeError: espeak not installed on your system

I would also really like to do it without e.g. Docker, because I want to try to use MPS for the Apple Silicon GPU.

Would appreciate any guidance on resolving this issue so that espeak is correctly recognized by the system and the notebook can run as intended.

License

Hi,
Thank you for releasing VoiceCraft! It's super cool and I'm really impressed by the quality. Do you have any plans to open source it by switching to an open source license? (Are the model weights based off of XTTS?)
Thanks!

Few questions about the paper. [Encodec;inference speed; model parameters]

Hi @jasonppy , great work and samples, thanks for sharing the code!

Introduction of causal masking for TTS - is an elegant approach for contextualization. Bravo!

I'm curious about few aspects of your work at the moment:

Did you train Encodec as well? To my knowledge the parameters are released too. But looking into your code, it seams that you trained it too. Now I wonder what might be a reason for this. A hypothesis: no parameters for 16 kHz sampling rate?
When it comes to inference you mention that you run it multiple times. Can you share inference speed for say 10 seconds long utterance on 820M model?
Is there any estimate when model parameters will be released?

Have a good one!
Best, Taras

Add pip as a secondary installation method.

Conda is not available on all systems. Pip requirements.txt would be nice.
Thank you.

Apple SoC support M1/M2/M3

Hi,

Has anyone managed to run this successfully on macOS with the Apple's M series GPU?

Thanks!

MFA alignment temp file

I am trying to clone my own voice and when i use my own file mfa outputs the default demo text "but when I approaches...." instead of mine

AttributeError: module 'torch' has no attribute 'compiler' and other various issue

System

Windows 11
NVIDIA MX130
i5-10210U
12GB RAM

Error Code

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[11], [line 12](vscode-notebook-cell:?execution_count=11&line=12)
      [8](vscode-notebook-cell:?execution_count=11&line=8) prompt_end_frame = int(cut_off_sec * info.sample_rate)
     [11](vscode-notebook-cell:?execution_count=11&line=11) # # load model, tokenizer, and other necessary files
---> [12](vscode-notebook-cell:?execution_count=11&line=12) from models import voicecraft
     [13](vscode-notebook-cell:?execution_count=11&line=13) voicecraft_name="giga830M.pth"
     [14](vscode-notebook-cell:?execution_count=11&line=14) ckpt_fn =f"[./pretrained_models/](https://file+.vscode-resource.vscode-cdn.net/c%3A/Users/NAME/TTS/src/audiocraft/audiocraft/pretrained_models/){voicecraft_name}"

File [c:\Users\PEY3C\TTS\src\audiocraft\audiocraft\models\__init__.py:10](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:10)
      [6](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:6) """
      [7](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:7) Models for EnCodec, AudioGen, MusicGen, as well as the generic LMModel.
      [8](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:8) """
      [9](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:9) # flake8: noqa
---> [10](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:10) from . import builders, loaders
     [11](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:11) from .encodec import (
     [12](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:12)     CompressionModel, EncodecModel, DAC,
     [13](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:13)     HFEncodecModel, HFEncodecCompressionModel)
     [14](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/__init__.py:14) from .audiogen import AudioGen

File [c:\Users\NAME\TTS\src\audiocraft\audiocraft\models\builders.py:14](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:14)
      [7](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:7) """
      [8](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:8) All the functions to build the relevant models and modules
      [9](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:9) from the Hydra config.
     [10](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:10) """
     [12](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:12) import typing as tp
---> [14](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:14) import audiocraft
     [15](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:15) import omegaconf
     [16](file:///C:/Users/NAME/TTS/src/audiocraft/audiocraft/models/builders.py:16) import torch

File [c:\users\name\tts\src\audiocraft\audiocraft\__init__.py:24](file:///C:/users/pey3c/tts/src/audiocraft/audiocraft/__init__.py:24)
      [6](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:6) """
      [7](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:7) AudioCraft is a general framework for training audio generative models.
      [8](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:8) At the moment we provide the training code for:
   (...)
     [20](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:20)     improves the perceived quality and reduces the artifacts coming from adversarial decoders.
     [21](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:21) """
     [23](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:23) # flake8: noqa
---> [24](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:24) from . import data, modules, models
     [26](file:///C:/users/name/tts/src/audiocraft/audiocraft/__init__.py:26) __version__ = '1.0.0'

File [c:\users\name\tts\src\audiocraft\audiocraft\data\__init__.py:10](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:10)
      [6](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:6) """Audio loading and writing support. Datasets for raw audio
      [7](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:7) or also including some metadata."""
      [9](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:9) # flake8: noqa
---> [10](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/__init__.py:10) from . import audio, audio_dataset, info_audio_dataset, music_dataset, sound_dataset

File [c:\users\name\tts\src\audiocraft\audiocraft\data\info_audio_dataset.py:19](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:19)
     [17](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:17) from .audio_dataset import AudioDataset, AudioMeta
     [18](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:18) from ..environment import AudioCraftEnvironment
---> [19](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:19) from ..modules.conditioners import SegmentWithAttributes, ConditioningAttributes
     [22](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:22) logger = logging.getLogger(__name__)
     [25](file:///C:/users/name/tts/src/audiocraft/audiocraft/data/info_audio_dataset.py:25) def _clusterify_meta(meta: AudioMeta) -> AudioMeta:

File [c:\users\name\tts\src\audiocraft\audiocraft\modules\__init__.py:22](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/__init__.py:22)
     [20](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/__init__.py:20) from .lstm import StreamableLSTM
     [21](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/__init__.py:21) from .seanet import SEANetEncoder, SEANetDecoder
---> [22](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/__init__.py:22) from .transformer import StreamingTransformer

File [c:\users\name\tts\src\audiocraft\audiocraft\modules\transformer.py:23](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:23)
     [21](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:21) from torch.nn import functional as F
     [22](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:22) from torch.utils.checkpoint import checkpoint as torch_checkpoint
---> [23](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:23) from xformers import ops
     [25](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:25) from .rope import RotaryEmbedding
     [26](file:///C:/users/name/tts/src/audiocraft/audiocraft/modules/transformer.py:26) from .streaming import StreamingModule

File [c:\Users\NAME\miniconda3\envs\voicecraft\lib\site-packages\xformers\__init__.py:12](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:12)
      [9](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:9) import torch
     [11](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:11) from . import _cpp_lib
---> [12](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:12) from .checkpoint import (  # noqa: E402, F401
     [13](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:13)     checkpoint,
     [14](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:14)     get_optimal_checkpoint_policy,
     [15](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:15)     list_operators,
     [16](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:16)     selective_checkpoint_wrapper,
     [17](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:17) )
     [19](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:19) try:
     [20](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/__init__.py:20)     from .version import __version__  # noqa: F401

File [c:\Users\NAME\miniconda3\envs\voicecraft\lib\site-packages\xformers\checkpoint.py:464](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:464)
    [460](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:460)         self.counter += 1
    [461](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:461)         return self.optim_output[count] == 1
--> [464](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:464) class SelectiveCheckpointWrapper(ActivationWrapper):
    [465](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:465)     def __init__(self, mod, memory_budget=None, policy_fn=None):
    [466](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:466)         if torch.__version__ < (2, 1):

File [c:\Users\NAME\miniconda3\envs\voicecraft\lib\site-packages\xformers\checkpoint.py:481](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:481), in SelectiveCheckpointWrapper()
    [476](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:476)     # TODO: this should be enabled by default in PyTorch
    [477](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:477)     torch._dynamo.config._experimental_support_context_fn_in_torch_utils_checkpoint = (
    [478](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:478)         True
    [479](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:479)     )
--> [481](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:481) @torch.compiler.disable
    [482](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:482) def _get_policy_fn(self, *args, **kwargs):
    [483](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:483)     if not torch.is_grad_enabled():
    [484](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:484)         # no need to compute a policy as it won't be used
    [485](file:///C:/Users/NAME/miniconda3/envs/voicecraft/lib/site-packages/xformers/checkpoint.py:485)         return []

AttributeError: module 'torch' has no attribute 'compiler'

Description

Man, it driven me to insanity when almost every stage of inference_tts.ipynb have their own errors. I have tried troubleshooting with my knowledge about Python package, and compatibility issues with the rest. Here is what I have counted:

 from data.tokenizer import (
    AudioTokenizer,
    TextTokenizer,
)

Unclear instruction of where to put the inference_tts.ipynb. I supposed it supposed to be src/audiocraft/audiocraft/inference_tts.ipynb. Also ImportError: attempted relative import beyond top-level package

Adding Absolute Import like this could help prevent this issue but it will raising another issue, which is ModuleNotFoundError: No module named 'AudioTokenizer'

import sys
sys.path.append('C:\\Users\\NAME\\TTS\\src\\audiocraft\\audiocraft\\data')

# # load model, tokenizer, and other necessary files
from models import voicecraft
voicecraft_name="giga830M.pth"
ckpt_fn =f"./pretrained_models/{voicecraft_name}"
encodec_fn = "./pretrained_models/encodec_4cb2048_giga.th"
if not os.path.exists(ckpt_fn):
    os.system(f"wget https://huggingface.co/pyp1/VoiceCraft/resolve/main/{voicecraft_name}\?download\=true")
    os.system(f"mv {voicecraft_name}\?download\=true ./pretrained_models/{voicecraft_name}")
if not os.path.exists(encodec_fn):
    os.system(f"wget https://huggingface.co/pyp1/VoiceCraft/resolve/main/encodec_4cb2048_giga.th")
    os.system(f"mv encodec_4cb2048_giga.th ./pretrained_models/encodec_4cb2048_giga.th")

from models import voicecraft doesn't seem like it is working like it should. Probably same package issue with Stage 1's AudioTokenizer and TextTokenizer.

Which is the one you see on Error Code. It is... ridiculous. The thing with AttributeError: module 'torch' has no attribute 'compiler' usually caused by torch version that does not support compiler, which is PyTorch 2.0 >.

But hell, transformer of mine is 4.38.1, xformers is 0.0.25.post1, and my torch is 2.2.2+cu121. Which supposedly should able to have compiler. There may be other causes and well, I don't have any ideas.

Minor one, what is the thing with apt-get install ffmpeg and apt-get install espeak-ng? It doesn't recognized as any in my system. I think it supposed to be Linux command?

Post-script

You may find this entire issue look like a rant, but no, I didn't want to mean like that. Sure it's a little bit hideous when all of this happened. But all thing considered, it is an amazing project that could probably join in the current stance between Coqui and Tortoise especially the zero-shot part. It will be much popular that it was if someone eventually got this to hook up on their webui, like rsxdalv's TTS Generation Webui.

Of course I would say we still need the fixing. You can ask me for more context or information if you wanted it to fix this.

Update 1 : fixing my wording

Does the licence allow Youtube voice over?

Hello there, I was reading the Licence and I couldn't figure out weather I am allowed or not. My idea is to use the voice for doing voiceovers in youtube videos. I could argue that the channel is not monetized and that the videos are not like audiobooks which will only relly on the speach.
I raised the issue because if you don't go through with doing the proyect completely open source (not saying you should) it could be use full for the comunity.
Thanks for such an amazing proyect.

Multi-language model

Hi, thank you for your excellent repo!
Do you have any plan to develop a multi-language model?

where model

pytorch version clash

The instructions say to install torch==2.0.1.

But while installing

pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft

the different version pytorch-2.2.2 is forcibly installed over the initially installed version.

Then import torchaudio results in the error

 undefined symbol: _ZN2at4_ops15sum_dim_IntList4callERKNS_6TensorEN3c1016OptionalArrayRefIlEEbNS5_8optionalINS5_10ScalarTypeEEE

Is it possible to run this without docker?

I get this error:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
I blame docker.

Why would I need docker for this? What purpose does it serve?

How to install mfa's english_us_arpa, I tried to run commond "mfa model download dictionary english_us_arpa" and "mfa model download acoustic english_us_arpa" with github token, but it doesn't work.

train other languages

It would be great to have some tips on how to train different languages... I have datasets of different languages and would be happy to train with those datasets, but I don't know where to start

support Chinese?

any plan to support Chinese language?

Request for a requirements.txt

Ran into several issues with imports failing after following the instructions in the readme. Installation would have been smoother with exact versions of the offending packages.

Why would someone willingly make a tool that is only going to be abused and make the world worse?

License

Hey! Incredible work and results, and amazing due dilligence in the paper - really appreciated, and putting together RealEdit for evaluating results and fairly training and comparing to other SoTA models is so nice to see! Thanks!

Wondering if you are planning on switching over to a more allowing license that would allow the use of your work in commercial practices? :-)

examples are using cpu as default

Hi!, thanks for releasing the weights, having some fun with the model so far!

I wanted to ask if there's a particular reason as to why the examples are currently set to use CPU instead of Cuda?

Thanks!

samples to review quality before download and installation

for projects like this, with big installations and download there should always be a sample to determine quality beforehand.

Generating long speeches

Would there be a way to generate long speeches ?

Because right now, it requires to be fed with at least 3 seconds of speech each time you want to inference something new. And if the length of the desired generation is too long, it hallucinates and ends up doing gibberish stuff.

One way to solve this issue would be to generate speeches sentence by sentence. One issue with that is that it'll still require those 3 seconds of base speech each time. The other one is the consistency of the generated speech at the end, as the different intonations between the sentences would be immensely off.

Anyone has an idea ?

Where are the model weights ?

train.txt and validation.txt generation from extracted_codes_and_phonemes

Thanks for this amazing work to benefit the speech research community.

Just wondering, is the provided train.txt and validation extracted from the XL split of gigaspeech? In the manifest file, are the three columns "0 name codec_number"? Could you maybe also provide the script to generate them from the processed feature folder path/to/store_extracted_codes_and_phonemes please? Just in case someone wants to test it on a smaller dataset split or on a different dataset? Thank you.

Supported emotions

How many emotions are supported?

Can we get list of these?

How to trigger specific emotional Audio? Maybe some prompt engineering?