open-mmlab / amphion Goto Github PK

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

Home Page: https://openhlt.github.io/amphion/

License: MIT License

Python 91.87% Shell 3.74% Cython 0.05% Dockerfile 0.10% HTML 0.77% JavaScript 3.48%

audio-generation audio-synthesis audioldm hifi-gan music-generation naturalspeech2 singing-voice-conversion speech-synthesis text-to-audio text-to-speech

amphion's Introduction

Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,

TTS: Text to Speech (⛳ supported)
SVS: Singing Voice Synthesis (👨‍💻 developing)
VC: Voice Conversion (👨‍💻 developing)
SVC: Singing Voice Conversion (⛳ supported)
TTA: Text to Audio (⛳ supported)
TTM: Text to Music (👨‍💻 developing)
more…

In addition to the specific generation tasks, Amphion also includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.

Here is the Amphion v0.1 demo, whose voice, audio effects, and singing voice are generated by our models. Just enjoy it!

Amphion-Demo-EN.mp4

🚀 News

2024/03/12: Amphion now support NaturalSpeech3 FACodec and release pretrained checkpoints.
2024/02/22: The first Amphion visualization tool, SingVisio, release.
2023/12/18: Amphion v0.1 release.
2023/11/28: Amphion alpha release.

⭐ Key Features

TTS: Text to Speech

Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
- FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
- VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
- Vall-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
- NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.

SVC: Singing Voice Conversion

Ampion supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec. Their specific roles in SVC has been investigated in our NeurIPS 2023 workshop paper.
Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses Bidirectional dilated CNN as a backend and supports several sampling algorithms such as DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.

TTA: Text to Audio

Amphion supports the TTA with a latent diffusion model. It is designed like AudioLDM, Make-an-Audio, and AUDIT. It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper.

Vocoder

Amphion supports various widely-used neural vocoders, including:
- GAN-based vocoders: MelGAN, HiFi-GAN, NSF-HiFiGAN, BigVGAN, APNet.
- Flow-based vocoders: WaveGlow.
- Diffusion-based vocoders: Diffwave.
- Auto-regressive based vocoders: WaveNet, WaveRNN.
Amphion provides the official implementation of Multi-Scale Constant-Q Transform Discriminator (our ICASSP 2024 paper). It can be used to enhance any architecture GAN-based vocoders during training, and keep the inference stage (such as memory or speed) unchanged.

Evaluation

Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:

F0 Modeling: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
Energy Modeling: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
Intelligibility: Character/Word Error Rate, which can be calculated based on Whisper and more.
Spectrogram Distortion: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
Speaker Similarity: Cosine similarity, which can be calculated based on RawNet3, Resemblyzer, WeSpeaker, WavLM and more.

Datasets

Amphion unifies the data preprocess of the open-source datasets including AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK, and more. The supported dataset list can be seen here (updating).

Visualization

Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.

Currently, Amphion supports SingVisio, a visualization tool of the diffusion model for singing voice conversion.

📀 Installation

Amphion can be installed through either Setup Installer or Docker Image.

Setup Installer

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

# Install Python Environment
conda create --name amphion python=3.9.15
conda activate amphion

# Install Python Packages Dependencies
sh env.sh

Docker Image

Install Docker, NVIDIA Driver, NVIDIA Container Toolkit, and CUDA.
Run the following commands:

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

docker pull realamphion/amphion
docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion

Mount dataset by argument -v is necessary when using Docker. Please refer to Mount dataset in Docker container and Docker Docs for more details.

🐍 Usage in Python

We detail the instructions of different tasks in the following recipes:

👨‍💻 Contributing

We appreciate all contributions to improve Amphion. Please refer to CONTRIBUTING.md for the contributing guideline.

🙏 Acknowledgement

ming024's FastSpeech2 and jaywalnut310's VITS for model architecture code.
lifeiteng's VALL-E for training pipeline and model architecture design.
WeNet, Whisper, ContentVec, and RawNet3 for pretrained models and inference code.
HiFi-GAN for GAN-based Vocoder's architecture design and training strategy.
Encodec for well-organized GAN Discriminator's architecture and basic blocks.
Latent Diffusion for model architecture design.
TensorFlowTTS for preparing the MFA tools.

©️ License

Amphion is under the MIT License. It is free for both research and commercial use cases.

📚 Citations

@article{zhang2023amphion,
      title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit}, 
      author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Haorui He and Chaoren Wang and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
      journal={arXiv},
      year={2024},
      volume={abs/2312.09911}
}

amphion's People

Contributors

Stargazers

Watchers

Forkers

lmxue openhlt rmsnow zhaozhen2333 shiyuzh2007 ishine spiralanch rohithkodali wendongj whitefu hbwu-ntu cenwurong neverwang2023 xaviergithub qoboty pineking zjwang21 fengshi-cherish amorjnyh zhihao-chen usthbstar zhizhengwu yesmessages-trust4me xadsorcept y-muchsports udogger channetr70icytwilight itsbrex hadryan viewfinder-annn entn-at hongwen-sun chenx17 huangyingting yanyanxixi yangliu1992 macroustc sirbitesalot lokshaw-chau bakerbunker saber5433 treya-lin adorable-qin arsity tsok-xyz aapostoliadis lzh970328 yparvej zhikangniu eltociear takhemlata f901107 th3-m1nd-3xpansi0n-n3xus thanhpham1987 sanelez sorokinvld 910882575 dhirajkahol metamorphart tutumomo golanghack babybirdprd dolife t-authenting jansystemic tonywhite11 angeloluidens mjweb100versinda unlimitorbe-x gsuabinnow larvional2 spinti-cornyslip tonyonst56 gisteroxcaptail wardipity28headlinte somberconspiracy-chuddle iparkelratchapter bornedbalarec portestu48 unlimitorbe80 yhopwator umilab lyhiving justinwong2011 happynailab lappii chunhualiu clumsyroot ankye lhl1001 ai-s2-lab 63can ejhortala hhy5277 peichangliang123 bigsml zgq91 prahs bluewhiteheart gmh5225

amphion's Issues

There is no whisper in the pretrained folder, should I create it myself?

Hello, I just cloned the repository, but there's no whisper in the pretrained folder. Should I create it myself?
Additionally, is the model being used Whisper v3? Can I switch to a larger model myself?
Thank you

[BUG]: Issue with the vocoder calling during inference

Describe the bug

In models/base/new_inference.py

vocoder_cfg, vocoder_ckpt = self._parse_vocoder(self.args.vocoder_dir)

If the vocoder checkpoint is saved as '*.pt' it works fine, but there will be error when I using Hifigan checkpoint which is save as '*.bin'.

I have already sent a request to use the SVCC dataset to Dr. Bidsha Sharma and Prof. Haizhou Li. Could you please send me the dataset download as soon as possible? I really need this dataset. Thank you so much.

Ahm

Issue Running FastSpeech2 Model - FileNotFoundError: 'data/LJSpeech/valid.json'

Hi, thank you for developing this excellent project.

I am attempting to execute the FastSpeech2 model using the provided instructions at:
https://github.com/open-mmlab/Amphion/tree/main/egs/tts/FastSpeech2

Upon running the process command sh egs/tts/FastSpeech2/run.sh --stage 1, I encountered the following error: FileNotFoundError: [Errno 2] No such file or directory: 'data/LJSpeech/valid.json'.

Could you please provide guidance on resolving this issue? Your assistance is much appreciated.

Full log:

(amphion) root@5d89psego5dhs-0:/zhangpai21/workspace/cgy/1_projects/7_Amphion# sh egs/tts/FastSpeech2/run.sh --stage 1
/zhangpai21/workspace/cgy/1_projects/7_Amphion/mfa
Exprimental Configuration File: /zhangpai21/workspace/cgy/1_projects/7_Amphion/egs/tts/FastSpeech2/exp_config.json
Preprocess LJSpeech...
Prepare alignment LJSpeech...
13100it [00:01, 10765.15it/s]
MFA results are save in data/LJSpeech/TextGrid
Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 13100.0
Creating dictionary information...
Setting up corpus_data directory...
Generating base features (mfcc)...
Calculating CMVN...
Done with setup.
Done! Everything took 1712.1863117218018 seconds
----------
Dataset splits for LJSpeech...

No Data Augmentation.
----------
Preparing metadata...
Including: 
LJSpeech

  0%|                                                                                                                                 | 0/1 [00:00<?, ?it/s]Singer LJSpeech_LJSpeech: 1363.67 mins for training
---------- 

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.77it/s]
Extracting acoustic features for LJSpeech using 4 workers ...
types:  ['train', 'valid', 'test']
Traceback (most recent call last):
  File "/zhangpai21/workspace/cgy/1_projects/7_Amphion/bins/tts/preprocess.py", line 250, in <module>
    main()
  File "/zhangpai21/workspace/cgy/1_projects/7_Amphion/bins/tts/preprocess.py", line 246, in main
    preprocess(cfg, args)
  File "/zhangpai21/workspace/cgy/1_projects/7_Amphion/bins/tts/preprocess.py", line 178, in preprocess
    extract_acoustic_features(dataset, output_path, cfg, args.num_workers)
  File "/zhangpai21/workspace/cgy/1_projects/7_Amphion/bins/tts/preprocess.py", line 44, in extract_acoustic_features
    with open(dataset_file, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/LJSpeech/valid.json'

[BUG]: Parameter "--fs" does work properly

Describe the bug

The behavior of parameter --fs does work as described in Evaluation README. The README states that it is optional, but if I do not fill in it, it gives out error. Also, no matter I fill in it will int type value (aka 24000) or string type value (aka '24000'), it gives out error.

How To Reproduce

Config/File changes: No changes.

BUG 1: --fs is not actually optional.
Run

$ bash egs/metrics/run.sh --reference_folder compare/ref_dir --generated_folder compare/gen_dir --dump_folder compare/dump_dir --metrics "fpc"

Get error

usage: calc_metrics.py [-h] [--ref_dir REF_DIR] [--deg_dir DEG_DIR] [--dump_dir DUMP_DIR]
                       [--metrics METRICS [METRICS ...]] [--fs FS]
calc_metrics.py: error: argument --fs: expected one argument

BUG 2: We fill in the --fs, no matter we fill 24000 (int type) or "24000" (string type), it still does not work properly.
If we use "24000"

$ bash egs/metrics/run.sh --reference_folder compare/ref_dir --generated_folder compare/gen_dir --dump_folder compare/dump_dir --metrics "fpc" --fs "24000"

Get error

  0%|                                                                     | 0/3 [00:00<?, ?it/s]
  0%|                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/bins/calc_metrics.py", line 155, in <module>
    calc_metric(args.ref_dir, args.deg_dir, args.dump_dir, args.metrics, args.fs)
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/bins/calc_metrics.py", line 110, in calc_metric
    score = METRIC_FUNC[metric](
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/evaluation/metrics/f0/f0_pearson_coefficients.py", line 49, in extract_fpc
    audio_ref, _ = librosa.load(audio_ref, sr=fs)
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/librosa/core/audio.py", line 192, in load
    y = resample(y, orig_sr=sr_native, target_sr=sr, res_type=res_type)
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/librosa/core/audio.py", line 668, in resample
    y_hat = np.apply_along_axis(
  File "<__array_function__ internals>", line 180, in apply_along_axis
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/numpy/lib/shape_base.py", line 379, in apply_along_axis
    res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/soxr/__init__.py", line 145, in resample
    if in_rate <= 0 or out_rate <= 0:
TypeError: '<=' not supported between instances of 'str' and 'int'

If we use 24000

$ bash egs/metrics/run.sh --reference_folder compare/ref_dir --generated_folder compare/gen_dir --dump_folder compare/dump_dir --metrics "fpc" --fs 24000

  0%|                                                                     | 0/3 [00:00<?, ?it/s]
  0%|                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/bins/calc_metrics.py", line 155, in <module>
    calc_metric(args.ref_dir, args.deg_dir, args.dump_dir, args.metrics, args.fs)
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/bins/calc_metrics.py", line 110, in calc_metric
    score = METRIC_FUNC[metric](
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/evaluation/metrics/f0/f0_pearson_coefficients.py", line 49, in extract_fpc
    audio_ref, _ = librosa.load(audio_ref, sr=fs)
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/librosa/core/audio.py", line 192, in load
    y = resample(y, orig_sr=sr_native, target_sr=sr, res_type=res_type)
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/librosa/core/audio.py", line 668, in resample
    y_hat = np.apply_along_axis(
  File "<__array_function__ internals>", line 180, in apply_along_axis
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/numpy/lib/shape_base.py", line 379, in apply_along_axis
    res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/soxr/__init__.py", line 145, in resample
    if in_rate <= 0 or out_rate <= 0:
TypeError: '<=' not supported between instances of 'str' and 'int'

Expected behavior

Execute without error and dump results.

Screenshots

See Reproduce part.

Environment Information

Operating System: Debian 12
Python Version: Python 3.10.13
Driver & CUDA Version: NVIDIA-SMI 545.36, Driver Version: 546.33, CUDA Version: 12.3
Error Messages and Logs: See Reproduce part.

An issue with the preprocessing part of LibriTTS.

Hello!🤗 When I'm working with the LibriTTS dataset, I noticed that the generated test.json and train.json files do not include the corresponding text for each audio sample. This absence of text information causes an error when trying to extract the phonemes later on. Additionally, I couldn't find any code related to text processing in the preprocessors/libritts.py file.

Thanks for the team's hard work and contributions!🎉

Amount of data to train NaturalSpeech 2

Hi, may I ask how many hours of audio in the training data to produce the checkpoint used in https://huggingface.co/spaces/amphion/NaturalSpeech2?

Thank you.

Monotonic align not found. Please make sure you have compiled it.

Hello, I would like to run fine-tuning training for the model, but there was an error. I have already completed the preprocessing with this command.
egs/svc/VitsSVC/run.sh --stage 1

Afterward, I intend to run fine-tuning training with this string of commands.
According to read.md, I have also uploaded the well-trained 400000.pt to the server and clearly specified the absolute path of the model in my command as follows:

sh egs/svc/VitsSVC/run.sh --stage 2 --name tingting \
    --resume true \
    --resume_from_ckpt_path "/root/Amphion/pretrained/bigvgan/400000.pt" \
    --resume_type "finetune"

But in the command line output, the following error appeared first.
Monotonic align not found. Please make sure you have compiled it.
After that, a series of error messages occurred. How should I resolve this? Thank you.

TTA dataset and update readme

Audiocaps dataset is missing. A public downloading link is needed and the corresponding readme needs to be updated.

Error running "sh env.sh"

It's very hard, I get a lot of errors when I run "sh env.sh"

how to retrain model ?

I have trained a VITS model, but whenever I run the training process again, it starts from epoch 0 instead of continuing from the last epoch. Do you have any solutions for this issue?

When will the pre-trained weights of VALL-E be released?

How much data was involved in the pre-training, and how much of it is in Chinese ? Thank you very much.

[BUG]: NaturalSpeech2 training issue

Describe the bug

Thank you so much for sharing this wonderful project. However, I have some problem about the tts ns2 training.
./egs/tts/NaturalSpeech2/README.md suggests us to follow other Amphion TTS recipes for the data processing. But After I finish the features that need to be used in ns2 using fs2 and valle data preprocess script, I find I can not run the training script of ns2 successfully. In ./models/tts/naturalspeech2/ns2_dataset.py, some of the features seems to be obtained by refer to "phones" and "num_frames" in metadata, which is NOT included in the train.txt file.
Is there anything else I can do to run ns2 training successfully. Or should I just wait for the official update of ns2 preprocess as I have seen in other issue.
Can any of the author tell me when would the preprocess script be ready? Looking forward for your reply.

Fine-tune the SVC model for a new singer?

Thanks for the great work.

I was wondering if is it possible to fine-tune the SVC model for a new singer. if yes, can you refer me to the data format?

The instructions aren't complete, how do you run it?

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

Install Python Environment

conda create --name amphion python=3.9.15
conda activate amphion

Install Python Packages Dependencies

sh env.sh

I've done this but what do I do now?

System(s) to which the project applies?windows10?Linux？

Is this project available for Windows 10? Some of the packages seem to be Linux only?I'm currently running it on Win10, and I'm getting some errors, so I'm not sure if it's an incompatible environment or not.

svc pre-trained model

VELL-E (model_train_stage 2) error output "No such file pytorch_model.bin"

Hi，

When I train VALL-E. I successfully trained the AR model，however when I tried to train the NAR model by "sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName]
", it outputted "No such file pytorch_model.bin" in ARModelPath. In the ARModelPath, I find that only "ckpts.json model.safetensors optimizer.bin random_state_0.pkl scheduler.bin" exist. There is no "pytorch_model.bin".
I try to debug the codes, cannot find the codes used to save "pytorch_model.bin".

Pls help check it. Thanks in advance.

Does amphion support multi-GPU training now?

Thank you for the great works!
I want to train a model with more data. But I am don't know whether amphion can support nulti-gpu training now.
If not, will it be supportted in the future?

[Feature]: could you publish releases on github please ?

it would make it much easier package this for linux distribution to build from source

Where is the hifigan vocoder of TTA released?

Here is the bug report:

  File "D:\github\Amphion\models\tta\ldm\audioldm_inference.py", line 42, in __init__
    self.build_vocoder()
  File "D:\github\Amphion\models\tta\ldm\audioldm_inference.py", line 68, in build_vocoder
    with open(config_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'D:/github/Amphion/ckpts/tta/hifigan_checkpoints/config.json'

Can TTS support Chinese?

Phoneme extraction very slow

Hi I am new to Amphion and am just starting to try the VITS recipe.

I am doing the preprocessing stage by simply running:

bash egs/tts/VITS/run.sh --stage 1

However, it seems the operation is very slow considering LJSpeech is actually not a very big dataset with only 13k audios. The weirdest part is the phoneme extraction part, on my server it is as slow as this:

Extracting content features for LJSpeech...
Extracting phoneme sequence for LJSpeech...
 30%|█████████████████▋                                        | 3987/13100 [49:53<2:03:53,  1.23it/s]

I checked the codes and it seems it is not doing something very complicated, just getting the phonemes from either lexicon or g2p_en and then save it as an independent phoneme file, so I wonder if this speed is normal, how does it operate in your local environment? If it is also very slow at your end, I wonder if you guys would consider making it work concurrently?

phone_extractor.py

    for utt in tqdm(metadata):
        uid = utt["Uid"]
        text = utt["Text"]


        phone_seq = phone_extractor.extract_phone(text)


        phone_path = os.path.join(out_path, uid + ".phone")
        with open(phone_path, "w") as fin:
            fin.write(" ".join(phone_seq))

g2p_module.py

    def preprocess_english(self, text):
        text = text.rstrip(punctuation)


        g2p = G2p()
        phones = []
        words = re.split(r"([,;.\-\?\!\s+])", text)
        for w in words:
            if w.lower() in self.lexicon:
                phones += self.lexicon[w.lower()]
            else:
                phones += list(filter(lambda p: p != " ", g2p(w)))
        phones = "}{".join(phones)
        phones = re.sub(r"\{[^\w\s]?\}", "{sp}", phones)
        phones = phones.replace("}{", " ")
        
        
        return phones

Will there be a wechat group for discussion?

[BUG]: Unable to run stage 1 with FastSpeech2

Describe the bug

I followed the tutorials for the example recipe of FastSpeech2 and didn't pass the first stage.
This problem also occurs on my Windows laptop.

How To Reproduce

Steps to reproduce the behavior:

Config/File changes: Only the local path of the dataset
Run command: sh egs/tts/FastSpeech2/run.sh --stage 1

Expected behavior

Data Preparation failed and was interrupted.

Screenshots

See error:
(Amphion) harrywang@Harrys-MacBook-Air Amphion % sh egs/tts/FastSpeech2/run.sh --stage 1
/Users/harrywang/Amphion/mfa
Exprimental Configuration File: /Users/harrywang/Amphion/egs/tts/FastSpeech2/exp_config.json
Preprocess LJSpeech...
Prepare alignment LJSpeech...
0it [00:00, ?it/s]
Traceback (most recent call last):
File "/Users/harrywang/Amphion/bins/tts/preprocess.py", line 244, in
main()
File "/Users/harrywang/Amphion/bins/tts/preprocess.py", line 240, in main
preprocess(cfg, args)
File "/Users/harrywang/Amphion/bins/tts/preprocess.py", line 112, in preprocess
prepare_align(
File "/Users/harrywang/Amphion/preprocessors/processor.py", line 104, in prepare_align
ljspeech.prepare_align(dataset, dataset_path, cfg, output_path)
File "/Users/harrywang/Amphion/preprocessors/ljspeech.py", line 139, in prepare_align
wav, _ = librosa.load(wav_path, sampling_rate)
TypeError: load() takes 1 positional argument but 2 were given

Environment Information

Operating System: MacOS 14.2.1 (problem also occur on Windows 11)
Python Version: Python 3.9.15

When will there be preprocessing of Chinese data sets?[Help]:

When will there be preprocessing of Chinese data sets?

Feature Alignment in SVC dataset

I am trying to use latent feature from Encodec as the condition for SVC diffusion network. However, I encountered some problem when aligning the length of the Encodec feature sequence to the length of Mel spectrogram. Specifically, I tried to call the offline_align() function in __getitem()__ of SVCDataset, but I am not sure how to calculate source_hop:

source_hop = (
                self.cfg.preprocess.whisper_frameshift
                * self.cfg.preprocess.whisper_downsample_rate
                * self.cfg.preprocess.sample_rate
            )

So my questions are:

How does source_hop and target_hop come? I am not sure if neural codecs like Encodec or SpeechTokenizer have a "frameshift". How should I calculate source_hop on this occasion?
It is said that the frameshift of content features and Mel spectrogram should not differ much. Considering this, is it still reasonable to utilize Encodec features as the condition? (the strides in Encodec encoder are [2, 4, 5, 8], so I suppose the downsample rate is 320)

[Feature]: Could you please release a version on Google Colab?

Could you please release a version on Google Colab? This repository is too challenging for beginners, from environment setup to usage instructions. I have tried many times, but without success.

About TTS resume

HI, I found that resume code of TTS is in
https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L140
and
https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L302

however, _accelerator_prepare is in
https://github.com/open-mmlab/Amphion/blob/main/models/tts/base/tts_trainer.py#L145

So when resume_type=="resume", self. _check_resume function seems not to work.

Is there something which I missed?

ImportError: cannot import name 'VariableSampler' from 'models.base.base_sampler'

When I run "sh egs/tts/VITS/run.sh --stage 2 --name ljs_base"
I had an issue with the following:

"""
Traceback (most recent call last):
File "/home/zhangy33/data1/lz/tts/Amphion/bins/tts/train.py", line 13, in
from models.tts.naturalspeech2.ns2_trainer import NS2Trainer
File "/home/zhangy33/data1/lz/tts/Amphion/models/tts/naturalspeech2/ns2_trainer.py", line 16, in
from models.base.base_sampler import VariableSampler
ImportError: cannot import name 'VariableSampler' from 'models.base.base_sampler'
"""

I didn't find the "VariableSampler" when I checked the "models.base.base_sampler.py".

Can you help me with this problem?

[BUG]: Fix for AssertionError When Running FastSpeech2 Preprocessing （run.sh --stage 1）

Describe the bug

When running the preprocessing stage of FastSpeech2 within the Amphion project, an AssertionError was encountered, stating that the Montreal Forced Aligner (MFA) tools were not found at the expected path.

How To Reproduce

Steps to reproduce the behavior:

Config/File changes: cd Amphion , I run this command in the Amphion root path.
Run command: sh egs/tts/FastSpeech2/run.sh --stage 1 to initiate the preprocessing stage.
See error: 'AssertionError: Please download the MFA tools to Amphion/mfa/montreal-forced-aligner/bin/mfa_align firstly.'

Expected behavior

The expected behavior was that the preprocessing stage would complete without errors, provided that all the necessary tools and dependencies were correctly installed and configured.

Additional context

Upon investigating the issue, it was discovered that the AssertionError was due to a file path issue in Amphion/preprocessors/ljspeech.py at line 41, which in turn was caused by the result of os.path.exists(lexicon) returning False at line 39. The problem was traced back to line 28, where the lexicon path should be modified to
lexicon=os.path.join("text", "lexicon", "librispeech-lexicon.txt").
After making this change, the preprocessing stage ran successfully.

Data preparation for TTA example

Hi, thanks for your nice work.

Could you provide the script for extracting the acoustic feature for the TTA task?

The beginner recipe said there are four stages for this task.

Data preparation
Train VAE
Train LDM model
Inference

I can find three scripts except the one for data preparation.

OSError: Model file not found: pretrained/contentvec/checkpoint_best_legacy_500. pt

Hello, I've encountered another strange issue, again while using vits svc.
I ran the following command →
./run.sh --stage 1
I've already placed the required model files on this path →
"/root/Amphion/pretrained/contentvec/checkpoint_best_legacy_500.pt "

Here is the error report →
Traceback (most recent call last):
File "/root/Amphion/bins/svc/preprocess.py", line 183, in
main()
File "/root/Amphion/bins/svc/preprocess.py", line 179, in main
preprocess(cfg, args)
File "/root/Amphion/bins/svc/preprocess.py", line 165, in preprocess
extract_content_features(dataset, output_path, cfg, args.num_workers)
File "/root/Amphion/bins/svc/preprocess.py", line 64, in extract_content_featu
res
content_extractor.extract_utt_content_features_dataloader(
File "/root/Amphion/processors/content_extractor.py", line 488, in extract_utt
_content_features_dataloader
extractor.load_model()
File "/root/Amphion/processors/content_extractor.py", line 247, in load_model
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
File "/root/miniconda3/envs/amphion/lib/python3.9/site-packages/fairseq/checkp
oint_utils.py", line 423, in load_model_ensemble_and_task
raise IOError("Model file not found: {}".format(filename))
OSError: Model file not found: pretrained/contentvec/checkpoint_best_legacy_500.
pt

[Help]: about VALLE training

After training valle on libritts, we noticed a drop in the Train Loss and a slight increase in the Valid Loss. Plus, when we trained AR and NAR models for 20 Epochs each, the synthesized speech quality wasn't great, and unfortunately, even after 100 Epochs of team training, the results were disappointing. Did the team encounter an increase in ValidLoss during the training process? I only made use of lexicon without making any other changes.

TensorBoard:

INFO:

ckpts.json

FileNotFoundError: [Errno 2] No such file or directory: 'Amphion/ckpts/svc/exp2/singers.json' when running SVC stage 3

I followed the SVC recipe and got this error in stage 3,but I'm sure the singers.json is in the right directory.
Is there a problem about path within my script?

[Help]: Format of CustomSVCDataset

Problem Overview

I adjusted my data set according to the CustomSVCDataset data set format requirements, but when I executed sh egs/svc/MultipleContentsSVC/run.sh --stage 1, I reported an error. The error appeared in this sentence of the metadata.py file: utterances = sorted(utterances, key=lambda x: x["Duration"]), the data set must have Duration, is there anything else needed to customize the data set?

Screenshots

(If applicable, add screenshots to help explain your problem.)

RuntimeError: Placeholder storage has not been allocated on MPS device!

I tried to run the script on my macOS(m2) as down follow:
CUDA_VISIBLE_DEVICES='mps:0' accelerate launch bins/tts/inference.py \
--config "ckpts/tts/valle_libritts/args.json"
--log_level debug
--acoustics_dir ckpts/tts/valle_libritts
--output_dir ckpts/tts/valle_libritts/result
--mode "single"
--text "his is a clip of generated speech with the given text from Amphion Vall-E mode"
--text_prompt "many animals of even complex structure which live parasitically within others are wholly devoid of an alimentary cavity"
--audio_prompt ckpts/tts/valle_libritts/prompt/LJ025-0076.wav
--test_list_file None

but i got the error massages :

CosineAnnealingLR

The params of CosineAnnealingLR scheduler in valle_trainer.py seem different with pytorch Docs.

code:

            scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
                self.cfg.train.warmup_steps,
                self.optimizer,
                eta_min=self.cfg.train.base_lr,
            )

pytorch 2.0 Docs:
torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=- 1, verbose=False)

If the T_max should be set as warmup_steps or have another special setting?

[BUG]: prompt_examples/*.wav missing

Describe the bug

Examples listed in the egs/tts/VALLE/README.md fail

egs/tts/VALLE/prompt_examples/7176_92135_000004_000000.wav is missing
prompt_examples only contain txt files

How To Reproduce

follow
https://huggingface.co/amphion/valle_libritts

Does it support for multilanguage?

Hi, thanks for your nice work.,

I have a 20k hours turkish dataset, can i train non-english model with this repository.
I see that you used g2p module and its only for english.

Thanks,

have plans to create scripts or .py files that can run on Windows in the future?

Hello, I was wondering if there are any plans to create .py files that can run on Windows?
Currently, all the scripts in our project are .sh files, which cannot be executed on Windows.
I'm also considering whether to spend time converting the .sh scripts into .py. If someone is already working on this, then I'll just wait for it.
Thank you!

ImportError: cannot import name 'quote' from 'urllib' (/root/miniconda3/lib/pyth on3.8/urllib/init.py)

Hello, I encountered an error while using run.sh in vits svc. How can I resolve this?
thanks!

root@autodl-container-06e841a5b8-eff3dbef:~/Amphion/egs/svc/VitsSVC# ./run.sh --
stage 1
Exprimental Configuration File: /root/Amphion/egs/svc/VitsSVC/exp_config.json
2023-12-31 10:45:06 | INFO | fairseq.tasks.text_to_speech | Please install tenso
rboardX: pip install tensorboardX
Traceback (most recent call last):
File "/root/Amphion/bins/svc/preprocess.py", line 19, in
from processors import acoustic_extractor, content_extractor, data_augment
File "/root/Amphion/processors/data_augment.py", line 12, in
import parselmouth
File "/root/miniconda3/lib/python3.8/site-packages/parselmouth/init.py", l
ine 22, in
from parselmouth.base import Parselmouth
File "/root/miniconda3/lib/python3.8/site-packages/parselmouth/base.py", line
30, in
from parselmouth.adapters.dfp.interface import DFPInterface
File "/root/miniconda3/lib/python3.8/site-packages/parselmouth/adapters/dfp/in
terface.py", line 17, in
from urllib import quote
ImportError: cannot import name 'quote' from 'urllib' (/root/miniconda3/lib/pyth
on3.8/urllib/init.py)

In the future, will this repo support advanced models like XTTS-v2, VITS2, StyleTTS2?

When I running "sh egs/svc/MultipleContentsSVC/run.sh --stage 1",it does't work! Where can I get '[Opencpop dataset path]/segments/train.txt'

Exprimental Configuration File: /home/ayit/Downloads/Amphion/egs/svc/MultipleContentsSVC/exp_config.json
Preprocess m4singer...

Preparing test samples for m4singer...

M4Singer: 20 singers, 20896 utterances (419 unique songs)
Singers:
Alto-1 Alto-2 Alto-3 Alto-4 Alto-5 Alto-6 Alto-7 Bass-1 Bass-2 Bass-3Soprano-1 Soprano-2 Soprano-3 Tenor-1 Tenor-2 Tenor-3 Tenor-4 Tenor-5 Tenor-6 Tenor-7
0%| | 0/20 [00:00<?, ?it/s]/home/ayit/Downloads/Amphion/preprocessors/m4singer.py:106: FutureWarning: get_duration() keyword argument 'filename' has been renamed to 'path' in version 0.10.0.
This alias will be removed in version 1.0.
duration = librosa.get_duration(filename=res["Path"])
100%|█████████████████████████████████████████| 20/20 [00:05<00:00, 3.93it/s]
#Train = 20739, #Test = 122
#Train hours= 29.48246345867975, #Test hours= 0.16981873925264554
Preprocess opencpop...

Dataset splits for opencpop...

Traceback (most recent call last):
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 182, in
main()
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 178, in main
preprocess(cfg, args)
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 83, in preprocess
preprocess_dataset(
File "/home/ayit/Downloads/Amphion/preprocessors/processor.py", line 48, in preprocess_dataset
opencpop.main(dataset, output_path, dataset_path)
File "/home/ayit/Downloads/Amphion/preprocessors/opencpop.py", line 66, in main
res, hours = get_uid2utt(opencpop_path, dataset, dataset_type)
File "/home/ayit/Downloads/Amphion/preprocessors/opencpop.py", line 26, in get_uid2utt
lines = get_lines(file)
File "/home/ayit/Downloads/Amphion/preprocessors/opencpop.py", line 15, in get_lines
with open(file, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '[Opencpop dataset path]/segments/train.txt'

During preprocessing, my custom dataset could not be found.

Hello, I've encountered a problem again. It might be a simple mistake, but I can't find a way to solve it.
When running ./run.sh --stage 1, it couldn't find the files in my data set.
I've already checked the read.md, but I still can't resolve this error. I need more guidance, thank you!

My data is placed here →
/root/Amphion/egs/datasets/
My wav file is here →
/root/Amphion/egs/datasets/tingting/tingting/298.wav
The beginning part of my exp_config.json file→
{
"base_config": "config/vitssvc.json",
"model_type": "VitsSVC",
"dataset": [
"tingting"
],
"dataset_path": {
// TODO: Fill in your dataset path
"tingting": "/root/Amphion/egs/datasets/"
},

The following is the content of the error report→
(amphion) root@autodl-container-25b911bc3c-59272fc1:~/Amphion/egs/svc/VitsSVC# .
/run.sh --stage 1
Exprimental Configuration File: /root/Amphion/egs/svc/VitsSVC/exp_config.json
Preprocess tingting...
No Data Augmentation.

Preparing metadata...
Including:
tingting
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/root/Amphion/bins/svc/preprocess.py", line 183, in
main()
File "/root/Amphion/bins/svc/preprocess.py", line 179, in main
preprocess(cfg, args)
File "/root/Amphion/bins/svc/preprocess.py", line 108, in preprocess
cal_metadata(cfg)
File "/root/Amphion/preprocessors/metadata.py", line 27, in cal_metadata
assert os.path.exists(save_dir)
AssertionError

FileNotFoundError: [Errno 2] No such file or directory: '[M4Singer dataset path]/meta.json' when running sh egs/svc/MultipleContentsSVC/run.sh --stage 1

This is my termunal:
(amphion) ayit@aiexplorer:~/Downloads/Amphion$ sh egs/svc/MultipleContentsSVC/run.sh --stage 1
Exprimental Configuration File: /home/ayit/Downloads/Amphion/egs/svc/MultipleContentsSVC/exp_config.json
Preprocess m4singer...

Preparing test samples for m4singer...

Traceback (most recent call last):
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 182, in
main()
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 178, in main
preprocess(cfg, args)
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 83, in preprocess
preprocess_dataset(
File "/home/ayit/Downloads/Amphion/preprocessors/processor.py", line 50, in preprocess_dataset
m4singer.main(output_path, dataset_path)
File "/home/ayit/Downloads/Amphion/preprocessors/m4singer.py", line 71, in main
with open(meta_file, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '[M4Singer dataset path]/meta.json'

Where is the code for VALLE training and inference?

Thanks for your work, much helpful!

[Error] Please specify the running stage

when I running the commend(following as the guide: https://huggingface.co/amphion/valle_libritts )
'''
sh egs/tts/VALLE/run.sh --stage 3 --gpu "0"
--config "ckpts/tts/valle_libritts/args.json"
--infer_expt_dir Amphion/ckpts/tts/valle_libritts
--infer_output_dir Amphion/ckpts/tts/valle_libritts/result
--infer_mode "single"
--infer_text "This is a clip of generated speech with the given text from Amphion Vall-E model."
--infer_text_prompt "But even the unsuccessful dramatist has his moments."
'''
I got the following error message:
[Error] Please specify the running stage

Unable to run training script of Natural Speech 2

Hi,

I ran into multiple issues trying to run the training script:
In ns2_dataset.py:

self.utt2phone[utt] = utt_info["phones"]: where phones comes from? I suspect we need to run the phonemizer first? but I don't see extract_phone=True in the config file
utt_info["num_frames"] is utt_info["Duration"], right?

In exp_config_base.json:

use_code=true, use_pitch=true, use_phone, should extract_acoustic_token=true, extract_pitch=true, extract_phone=true also?
There seems to be some mismatch between tts/preprocessing.py and the config file. For example: code_dir should be acoustic_token_dir?

open-mmlab / amphion Goto Github PK

amphion's Introduction

Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit

🚀 News

⭐ Key Features

TTS: Text to Speech

SVC: Singing Voice Conversion

TTA: Text to Audio

Vocoder

Evaluation

Datasets

Visualization

📀 Installation

Setup Installer

Docker Image

🐍 Usage in Python

👨‍💻 Contributing

🙏 Acknowledgement

©️ License

📚 Citations

amphion's People

Contributors

Stargazers

Watchers

Forkers

amphion's Issues

Describe the bug

Describe the bug

How To Reproduce

Expected behavior

Screenshots

Environment Information

Describe the bug

Install Python Environment

Install Python Packages Dependencies

Describe the bug

How To Reproduce

Expected behavior

Screenshots

Environment Information

Describe the bug

How To Reproduce

Expected behavior

Additional context

Problem Overview

Screenshots

Describe the bug

How To Reproduce

Exprimental Configuration File: /home/ayit/Downloads/Amphion/egs/svc/MultipleContentsSVC/exp_config.json Preprocess m4singer...

The following is the content of the error report→ (amphion) root@autodl-container-25b911bc3c-59272fc1:~/Amphion/egs/svc/VitsSVC# . /run.sh --stage 1 Exprimental Configuration File: /root/Amphion/egs/svc/VitsSVC/exp_config.json Preprocess tingting... No Data Augmentation.

This is my termunal: (amphion) ayit@aiexplorer:~/Downloads/Amphion$ sh egs/svc/MultipleContentsSVC/run.sh --stage 1 Exprimental Configuration File: /home/ayit/Downloads/Amphion/egs/svc/MultipleContentsSVC/exp_config.json Preprocess m4singer...

Recommend Projects

Recommend Topics

Recommend Org

Exprimental Configuration File: /home/ayit/Downloads/Amphion/egs/svc/MultipleContentsSVC/exp_config.json
Preprocess m4singer...

The following is the content of the error report→
(amphion) root@autodl-container-25b911bc3c-59272fc1:~/Amphion/egs/svc/VitsSVC# .
/run.sh --stage 1
Exprimental Configuration File: /root/Amphion/egs/svc/VitsSVC/exp_config.json
Preprocess tingting...
No Data Augmentation.

This is my termunal:
(amphion) ayit@aiexplorer:~/Downloads/Amphion$ sh egs/svc/MultipleContentsSVC/run.sh --stage 1
Exprimental Configuration File: /home/ayit/Downloads/Amphion/egs/svc/MultipleContentsSVC/exp_config.json
Preprocess m4singer...