Coder Social home page Coder Social logo

open-mmlab / amphion Goto Github PK

View Code? Open in Web Editor NEW
3.9K 50.0 316.0 10.57 MB

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

Home Page: https://openhlt.github.io/amphion/

License: MIT License

Python 91.87% Shell 3.74% Cython 0.05% Dockerfile 0.10% HTML 0.77% JavaScript 3.48%
audio-generation audio-synthesis audioldm hifi-gan music-generation naturalspeech2 singing-voice-conversion speech-synthesis text-to-audio text-to-speech

amphion's Introduction

Amphion: An Open-Source Audio, Music, and Speech Generation Toolkit


Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development. Amphion offers a unique feature: visualizations of classic models or architectures. We believe that these visualizations are beneficial for junior researchers and engineers who wish to gain a better understanding of the model.

The North-Star objective of Amphion is to offer a platform for studying the conversion of any inputs into audio. Amphion is designed to support individual generation tasks, including but not limited to,

  • TTS: Text to Speech (⛳ supported)
  • SVS: Singing Voice Synthesis (👨‍💻 developing)
  • VC: Voice Conversion (👨‍💻 developing)
  • SVC: Singing Voice Conversion (⛳ supported)
  • TTA: Text to Audio (⛳ supported)
  • TTM: Text to Music (👨‍💻 developing)
  • more…

In addition to the specific generation tasks, Amphion also includes several vocoders and evaluation metrics. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.

Here is the Amphion v0.1 demo, whose voice, audio effects, and singing voice are generated by our models. Just enjoy it!

Amphion-Demo-EN.mp4

🚀 News

  • 2024/03/12: Amphion now support NaturalSpeech3 FACodec and release pretrained checkpoints. arXiv hf hf readme
  • 2024/02/22: The first Amphion visualization tool, SingVisio, release. arXiv openxlab Video readme
  • 2023/12/18: Amphion v0.1 release. arXiv hf youtube readme
  • 2023/11/28: Amphion alpha release. readme

⭐ Key Features

TTS: Text to Speech

  • Amphion achieves state-of-the-art performance when compared with existing open-source repositories on text-to-speech (TTS) systems. It supports the following models or architectures:
    • FastSpeech2: A non-autoregressive TTS architecture that utilizes feed-forward Transformer blocks.
    • VITS: An end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning
    • Vall-E: A zero-shot TTS architecture that uses a neural codec language model with discrete codes.
    • NaturalSpeech2: An architecture for TTS that utilizes a latent diffusion model to generate natural-sounding voices.

SVC: Singing Voice Conversion

  • Ampion supports multiple content-based features from various pretrained models, including WeNet, Whisper, and ContentVec. Their specific roles in SVC has been investigated in our NeurIPS 2023 workshop paper. arXiv code
  • Amphion implements several state-of-the-art model architectures, including diffusion-, transformer-, VAE- and flow-based models. The diffusion-based architecture uses Bidirectional dilated CNN as a backend and supports several sampling algorithms such as DDPM, DDIM, and PNDM. Additionally, it supports single-step inference based on the Consistency Model.

TTA: Text to Audio

  • Amphion supports the TTA with a latent diffusion model. It is designed like AudioLDM, Make-an-Audio, and AUDIT. It is also the official implementation of the text-to-audio generation part of our NeurIPS 2023 paper. arXiv code

Vocoder

Evaluation

Amphion provides a comprehensive objective evaluation of the generated audio. The evaluation metrics contain:

  • F0 Modeling: F0 Pearson Coefficients, F0 Periodicity Root Mean Square Error, F0 Root Mean Square Error, Voiced/Unvoiced F1 Score, etc.
  • Energy Modeling: Energy Root Mean Square Error, Energy Pearson Coefficients, etc.
  • Intelligibility: Character/Word Error Rate, which can be calculated based on Whisper and more.
  • Spectrogram Distortion: Frechet Audio Distance (FAD), Mel Cepstral Distortion (MCD), Multi-Resolution STFT Distance (MSTFT), Perceptual Evaluation of Speech Quality (PESQ), Short Time Objective Intelligibility (STOI), etc.
  • Speaker Similarity: Cosine similarity, which can be calculated based on RawNet3, Resemblyzer, WeSpeaker, WavLM and more.

Datasets

Amphion unifies the data preprocess of the open-source datasets including AudioCaps, LibriTTS, LJSpeech, M4Singer, Opencpop, OpenSinger, SVCC, VCTK, and more. The supported dataset list can be seen here (updating).

Visualization

Amphion provides visualization tools to interactively illustrate the internal processing mechanism of classic models. This provides an invaluable resource for educational purposes and for facilitating understandable research.

Currently, Amphion supports SingVisio, a visualization tool of the diffusion model for singing voice conversion. arXiv openxlab Video

📀 Installation

Amphion can be installed through either Setup Installer or Docker Image.

Setup Installer

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

# Install Python Environment
conda create --name amphion python=3.9.15
conda activate amphion

# Install Python Packages Dependencies
sh env.sh

Docker Image

  1. Install Docker, NVIDIA Driver, NVIDIA Container Toolkit, and CUDA.

  2. Run the following commands:

git clone https://github.com/open-mmlab/Amphion.git
cd Amphion

docker pull realamphion/amphion
docker run --runtime=nvidia --gpus all -it -v .:/app realamphion/amphion

Mount dataset by argument -v is necessary when using Docker. Please refer to Mount dataset in Docker container and Docker Docs for more details.

🐍 Usage in Python

We detail the instructions of different tasks in the following recipes:

👨‍💻 Contributing

We appreciate all contributions to improve Amphion. Please refer to CONTRIBUTING.md for the contributing guideline.

🙏 Acknowledgement

©️ License

Amphion is under the MIT License. It is free for both research and commercial use cases.

📚 Citations

@article{zhang2023amphion,
      title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit}, 
      author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Haorui He and Chaoren Wang and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
      journal={arXiv},
      year={2024},
      volume={abs/2312.09911}
}

amphion's People

Contributors

adorable-qin avatar bakerbunker avatar chenx17 avatar eltociear avatar harryhe11 avatar hecheng0625 avatar lmxue avatar lokshaw-chau avatar merakist avatar rmsnow avatar treya-lin avatar viewfinder-annn avatar vocodexelysium avatar wsywsywsywsywsy979 avatar yasiendwieb avatar yuantuo666 avatar zhizhengwu avatar zyingt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amphion's Issues

Issue Running FastSpeech2 Model - FileNotFoundError: 'data/LJSpeech/valid.json'

Hi, thank you for developing this excellent project.

I am attempting to execute the FastSpeech2 model using the provided instructions at:
https://github.com/open-mmlab/Amphion/tree/main/egs/tts/FastSpeech2

Upon running the process command sh egs/tts/FastSpeech2/run.sh --stage 1, I encountered the following error: FileNotFoundError: [Errno 2] No such file or directory: 'data/LJSpeech/valid.json'.

Could you please provide guidance on resolving this issue? Your assistance is much appreciated.

Full log:

(amphion) root@5d89psego5dhs-0:/zhangpai21/workspace/cgy/1_projects/7_Amphion# sh egs/tts/FastSpeech2/run.sh --stage 1
/zhangpai21/workspace/cgy/1_projects/7_Amphion/mfa
Exprimental Configuration File: /zhangpai21/workspace/cgy/1_projects/7_Amphion/egs/tts/FastSpeech2/exp_config.json
Preprocess LJSpeech...
Prepare alignment LJSpeech...
13100it [00:01, 10765.15it/s]
MFA results are save in data/LJSpeech/TextGrid
Setting up corpus information...
Number of speakers in corpus: 1, average number of utterances per speaker: 13100.0
Creating dictionary information...
Setting up corpus_data directory...
Generating base features (mfcc)...
Calculating CMVN...
Done with setup.
Done! Everything took 1712.1863117218018 seconds
----------
Dataset splits for LJSpeech...

No Data Augmentation.
----------
Preparing metadata...
Including: 
LJSpeech

  0%|                                                                                                                                 | 0/1 [00:00<?, ?it/s]Singer LJSpeech_LJSpeech: 1363.67 mins for training
---------- 

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.77it/s]
Extracting acoustic features for LJSpeech using 4 workers ...
types:  ['train', 'valid', 'test']
Traceback (most recent call last):
  File "/zhangpai21/workspace/cgy/1_projects/7_Amphion/bins/tts/preprocess.py", line 250, in <module>
    main()
  File "/zhangpai21/workspace/cgy/1_projects/7_Amphion/bins/tts/preprocess.py", line 246, in main
    preprocess(cfg, args)
  File "/zhangpai21/workspace/cgy/1_projects/7_Amphion/bins/tts/preprocess.py", line 178, in preprocess
    extract_acoustic_features(dataset, output_path, cfg, args.num_workers)
  File "/zhangpai21/workspace/cgy/1_projects/7_Amphion/bins/tts/preprocess.py", line 44, in extract_acoustic_features
    with open(dataset_file, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/LJSpeech/valid.json'

[BUG]: Parameter "--fs" does work properly

Describe the bug

The behavior of parameter --fs does work as described in Evaluation README. The README states that it is optional, but if I do not fill in it, it gives out error. Also, no matter I fill in it will int type value (aka 24000) or string type value (aka '24000'), it gives out error.

How To Reproduce

Config/File changes: No changes.

BUG 1: --fs is not actually optional.
Run

$ bash egs/metrics/run.sh --reference_folder compare/ref_dir --generated_folder compare/gen_dir --dump_folder compare/dump_dir --metrics "fpc"

Get error

usage: calc_metrics.py [-h] [--ref_dir REF_DIR] [--deg_dir DEG_DIR] [--dump_dir DUMP_DIR]
                       [--metrics METRICS [METRICS ...]] [--fs FS]
calc_metrics.py: error: argument --fs: expected one argument

BUG 2: We fill in the --fs, no matter we fill 24000 (int type) or "24000" (string type), it still does not work properly.
If we use "24000"

$ bash egs/metrics/run.sh --reference_folder compare/ref_dir --generated_folder compare/gen_dir --dump_folder compare/dump_dir --metrics "fpc" --fs "24000"

Get error

  0%|                                                                     | 0/3 [00:00<?, ?it/s]
  0%|                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/bins/calc_metrics.py", line 155, in <module>
    calc_metric(args.ref_dir, args.deg_dir, args.dump_dir, args.metrics, args.fs)
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/bins/calc_metrics.py", line 110, in calc_metric
    score = METRIC_FUNC[metric](
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/evaluation/metrics/f0/f0_pearson_coefficients.py", line 49, in extract_fpc
    audio_ref, _ = librosa.load(audio_ref, sr=fs)
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/librosa/core/audio.py", line 192, in load
    y = resample(y, orig_sr=sr_native, target_sr=sr, res_type=res_type)
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/librosa/core/audio.py", line 668, in resample
    y_hat = np.apply_along_axis(
  File "<__array_function__ internals>", line 180, in apply_along_axis
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/numpy/lib/shape_base.py", line 379, in apply_along_axis
    res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/soxr/__init__.py", line 145, in resample
    if in_rate <= 0 or out_rate <= 0:
TypeError: '<=' not supported between instances of 'str' and 'int'

If we use 24000

$ bash egs/metrics/run.sh --reference_folder compare/ref_dir --generated_folder compare/gen_dir --dump_folder compare/dump_dir --metrics "fpc" --fs 24000
  0%|                                                                     | 0/3 [00:00<?, ?it/s]
  0%|                                                                     | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/bins/calc_metrics.py", line 155, in <module>
    calc_metric(args.ref_dir, args.deg_dir, args.dump_dir, args.metrics, args.fs)
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/bins/calc_metrics.py", line 110, in calc_metric
    score = METRIC_FUNC[metric](
  File "/mnt/c/users/lukec/pycharmprojects/Amphion/evaluation/metrics/f0/f0_pearson_coefficients.py", line 49, in extract_fpc
    audio_ref, _ = librosa.load(audio_ref, sr=fs)
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/librosa/core/audio.py", line 192, in load
    y = resample(y, orig_sr=sr_native, target_sr=sr, res_type=res_type)
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/librosa/core/audio.py", line 668, in resample
    y_hat = np.apply_along_axis(
  File "<__array_function__ internals>", line 180, in apply_along_axis
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/numpy/lib/shape_base.py", line 379, in apply_along_axis
    res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
  File "/home/luke/.conda/envs/Amphion/lib/python3.10/site-packages/soxr/__init__.py", line 145, in resample
    if in_rate <= 0 or out_rate <= 0:
TypeError: '<=' not supported between instances of 'str' and 'int'

Expected behavior

Execute without error and dump results.

Screenshots

See Reproduce part.

Environment Information

  • Operating System: Debian 12
  • Python Version: Python 3.10.13
  • Driver & CUDA Version: NVIDIA-SMI 545.36, Driver Version: 546.33, CUDA Version: 12.3
  • Error Messages and Logs: See Reproduce part.

An issue with the preprocessing part of LibriTTS.

Hello!🤗 When I'm working with the LibriTTS dataset, I noticed that the generated test.json and train.json files do not include the corresponding text for each audio sample. This absence of text information causes an error when trying to extract the phonemes later on. Additionally, I couldn't find any code related to text processing in the preprocessors/libritts.py file.

image

image

Thanks for the team's hard work and contributions!🎉

Monotonic align not found. Please make sure you have compiled it.

Hello, I would like to run fine-tuning training for the model, but there was an error. I have already completed the preprocessing with this command.
egs/svc/VitsSVC/run.sh --stage 1

Afterward, I intend to run fine-tuning training with this string of commands.
According to read.md, I have also uploaded the well-trained 400000.pt to the server and clearly specified the absolute path of the model in my command as follows:

sh egs/svc/VitsSVC/run.sh --stage 2 --name tingting \
    --resume true \
    --resume_from_ckpt_path "/root/Amphion/pretrained/bigvgan/400000.pt" \
    --resume_type "finetune"

But in the command line output, the following error appeared first.
Monotonic align not found. Please make sure you have compiled it.
After that, a series of error messages occurred. How should I resolve this? Thank you.

TTA dataset and update readme

Audiocaps dataset is missing. A public downloading link is needed and the corresponding readme needs to be updated.

how to retrain model ?

I have trained a VITS model, but whenever I run the training process again, it starts from epoch 0 instead of continuing from the last epoch. Do you have any solutions for this issue?

[BUG]: NaturalSpeech2 training issue

Describe the bug

Thank you so much for sharing this wonderful project. However, I have some problem about the tts ns2 training.
./egs/tts/NaturalSpeech2/README.md suggests us to follow other Amphion TTS recipes for the data processing. But After I finish the features that need to be used in ns2 using fs2 and valle data preprocess script, I find I can not run the training script of ns2 successfully. In ./models/tts/naturalspeech2/ns2_dataset.py, some of the features seems to be obtained by refer to "phones" and "num_frames" in metadata, which is NOT included in the train.txt file.
Is there anything else I can do to run ns2 training successfully. Or should I just wait for the official update of ns2 preprocess as I have seen in other issue.
Can any of the author tell me when would the preprocess script be ready? Looking forward for your reply.

VELL-E (model_train_stage 2) error output "No such file pytorch_model.bin"

Hi,

When I train VALL-E. I successfully trained the AR model,however when I tried to train the NAR model by "sh egs/tts/VALLE/run.sh --stage 2 --model_train_stage 2 --ar_model_ckpt_dir [ARModelPath] --name [YourExptName]
", it outputted "No such file pytorch_model.bin" in ARModelPath. In the ARModelPath, I find that only "ckpts.json model.safetensors optimizer.bin random_state_0.pkl scheduler.bin" exist. There is no "pytorch_model.bin".
I try to debug the codes, cannot find the codes used to save "pytorch_model.bin".

Pls help check it. Thanks in advance.

Does amphion support multi-GPU training now?

Thank you for the great works!
I want to train a model with more data. But I am don't know whether amphion can support nulti-gpu training now.
If not, will it be supportted in the future?

Where is the hifigan vocoder of TTA released?

Here is the bug report:

  File "D:\github\Amphion\models\tta\ldm\audioldm_inference.py", line 42, in __init__
    self.build_vocoder()
  File "D:\github\Amphion\models\tta\ldm\audioldm_inference.py", line 68, in build_vocoder
    with open(config_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'D:/github/Amphion/ckpts/tta/hifigan_checkpoints/config.json'

Phoneme extraction very slow

Hi I am new to Amphion and am just starting to try the VITS recipe.

I am doing the preprocessing stage by simply running:

bash egs/tts/VITS/run.sh --stage 1

However, it seems the operation is very slow considering LJSpeech is actually not a very big dataset with only 13k audios. The weirdest part is the phoneme extraction part, on my server it is as slow as this:

Extracting content features for LJSpeech...
Extracting phoneme sequence for LJSpeech...
 30%|█████████████████▋                                        | 3987/13100 [49:53<2:03:53,  1.23it/s]

I checked the codes and it seems it is not doing something very complicated, just getting the phonemes from either lexicon or g2p_en and then save it as an independent phoneme file, so I wonder if this speed is normal, how does it operate in your local environment? If it is also very slow at your end, I wonder if you guys would consider making it work concurrently?

phone_extractor.py

    for utt in tqdm(metadata):
        uid = utt["Uid"]
        text = utt["Text"]


        phone_seq = phone_extractor.extract_phone(text)


        phone_path = os.path.join(out_path, uid + ".phone")
        with open(phone_path, "w") as fin:
            fin.write(" ".join(phone_seq))

g2p_module.py

    def preprocess_english(self, text):
        text = text.rstrip(punctuation)


        g2p = G2p()
        phones = []
        words = re.split(r"([,;.\-\?\!\s+])", text)
        for w in words:
            if w.lower() in self.lexicon:
                phones += self.lexicon[w.lower()]
            else:
                phones += list(filter(lambda p: p != " ", g2p(w)))
        phones = "}{".join(phones)
        phones = re.sub(r"\{[^\w\s]?\}", "{sp}", phones)
        phones = phones.replace("}{", " ")
        
        
        return phones

[BUG]: Unable to run stage 1 with FastSpeech2

Describe the bug

I followed the tutorials for the example recipe of FastSpeech2 and didn't pass the first stage.
This problem also occurs on my Windows laptop.

How To Reproduce

Steps to reproduce the behavior:

  1. Config/File changes: Only the local path of the dataset
  2. Run command: sh egs/tts/FastSpeech2/run.sh --stage 1

Expected behavior

Data Preparation failed and was interrupted.

Screenshots

See error:
(Amphion) harrywang@Harrys-MacBook-Air Amphion % sh egs/tts/FastSpeech2/run.sh --stage 1
/Users/harrywang/Amphion/mfa
Exprimental Configuration File: /Users/harrywang/Amphion/egs/tts/FastSpeech2/exp_config.json
Preprocess LJSpeech...
Prepare alignment LJSpeech...
0it [00:00, ?it/s]
Traceback (most recent call last):
File "/Users/harrywang/Amphion/bins/tts/preprocess.py", line 244, in
main()
File "/Users/harrywang/Amphion/bins/tts/preprocess.py", line 240, in main
preprocess(cfg, args)
File "/Users/harrywang/Amphion/bins/tts/preprocess.py", line 112, in preprocess
prepare_align(
File "/Users/harrywang/Amphion/preprocessors/processor.py", line 104, in prepare_align
ljspeech.prepare_align(dataset, dataset_path, cfg, output_path)
File "/Users/harrywang/Amphion/preprocessors/ljspeech.py", line 139, in prepare_align
wav, _ = librosa.load(wav_path, sampling_rate)
TypeError: load() takes 1 positional argument but 2 were given

Environment Information

  • Operating System: MacOS 14.2.1 (problem also occur on Windows 11)
  • Python Version: Python 3.9.15

Feature Alignment in SVC dataset

I am trying to use latent feature from Encodec as the condition for SVC diffusion network. However, I encountered some problem when aligning the length of the Encodec feature sequence to the length of Mel spectrogram. Specifically, I tried to call the offline_align() function in __getitem()__ of SVCDataset, but I am not sure how to calculate source_hop:

source_hop = (
                self.cfg.preprocess.whisper_frameshift
                * self.cfg.preprocess.whisper_downsample_rate
                * self.cfg.preprocess.sample_rate
            )

So my questions are:

  1. How does source_hop and target_hop come? I am not sure if neural codecs like Encodec or SpeechTokenizer have a "frameshift". How should I calculate source_hop on this occasion?
  2. It is said that the frameshift of content features and Mel spectrogram should not differ much. Considering this, is it still reasonable to utilize Encodec features as the condition? (the strides in Encodec encoder are [2, 4, 5, 8], so I suppose the downsample rate is 320)

ImportError: cannot import name 'VariableSampler' from 'models.base.base_sampler'

When I run "sh egs/tts/VITS/run.sh --stage 2 --name ljs_base"
I had an issue with the following:

"""
Traceback (most recent call last):
File "/home/zhangy33/data1/lz/tts/Amphion/bins/tts/train.py", line 13, in
from models.tts.naturalspeech2.ns2_trainer import NS2Trainer
File "/home/zhangy33/data1/lz/tts/Amphion/models/tts/naturalspeech2/ns2_trainer.py", line 16, in
from models.base.base_sampler import VariableSampler
ImportError: cannot import name 'VariableSampler' from 'models.base.base_sampler'
"""

I didn't find the "VariableSampler" when I checked the "models.base.base_sampler.py".

Can you help me with this problem?

[BUG]: Fix for AssertionError When Running FastSpeech2 Preprocessing (run.sh --stage 1)

Describe the bug

When running the preprocessing stage of FastSpeech2 within the Amphion project, an AssertionError was encountered, stating that the Montreal Forced Aligner (MFA) tools were not found at the expected path.

How To Reproduce

Steps to reproduce the behavior:

  1. Config/File changes: cd Amphion , I run this command in the Amphion root path.
  2. Run command: sh egs/tts/FastSpeech2/run.sh --stage 1 to initiate the preprocessing stage.
  3. See error: 'AssertionError: Please download the MFA tools to Amphion/mfa/montreal-forced-aligner/bin/mfa_align firstly.'

Expected behavior

The expected behavior was that the preprocessing stage would complete without errors, provided that all the necessary tools and dependencies were correctly installed and configured.

Additional context

Upon investigating the issue, it was discovered that the AssertionError was due to a file path issue in Amphion/preprocessors/ljspeech.py at line 41, which in turn was caused by the result of os.path.exists(lexicon) returning False at line 39. The problem was traced back to line 28, where the lexicon path should be modified to
lexicon=os.path.join("text", "lexicon", "librispeech-lexicon.txt").
After making this change, the preprocessing stage ran successfully.

Data preparation for TTA example

Hi, thanks for your nice work.

Could you provide the script for extracting the acoustic feature for the TTA task?

The beginner recipe said there are four stages for this task.

  • Data preparation
  • Train VAE
  • Train LDM model
  • Inference

I can find three scripts except the one for data preparation.

OSError: Model file not found: pretrained/contentvec/checkpoint_best_legacy_500. pt

Hello, I've encountered another strange issue, again while using vits svc.
I ran the following command →
./run.sh --stage 1
I've already placed the required model files on this path →
"/root/Amphion/pretrained/contentvec/checkpoint_best_legacy_500.pt "

Here is the error report →
Traceback (most recent call last):
File "/root/Amphion/bins/svc/preprocess.py", line 183, in
main()
File "/root/Amphion/bins/svc/preprocess.py", line 179, in main
preprocess(cfg, args)
File "/root/Amphion/bins/svc/preprocess.py", line 165, in preprocess
extract_content_features(dataset, output_path, cfg, args.num_workers)
File "/root/Amphion/bins/svc/preprocess.py", line 64, in extract_content_featu
res
content_extractor.extract_utt_content_features_dataloader(
File "/root/Amphion/processors/content_extractor.py", line 488, in extract_utt
_content_features_dataloader
extractor.load_model()
File "/root/Amphion/processors/content_extractor.py", line 247, in load_model
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task(
File "/root/miniconda3/envs/amphion/lib/python3.9/site-packages/fairseq/checkp
oint_utils.py", line 423, in load_model_ensemble_and_task
raise IOError("Model file not found: {}".format(filename))
OSError: Model file not found: pretrained/contentvec/checkpoint_best_legacy_500.
pt

[Help]: about VALLE training

After training valle on libritts, we noticed a drop in the Train Loss and a slight increase in the Valid Loss. Plus, when we trained AR and NAR models for 20 Epochs each, the synthesized speech quality wasn't great, and unfortunately, even after 100 Epochs of team training, the results were disappointing. Did the team encounter an increase in ValidLoss during the training process? I only made use of lexicon without making any other changes.

TensorBoard:
image

INFO:
image

ckpts.json
image

[Help]: Format of CustomSVCDataset

Problem Overview

I adjusted my data set according to the CustomSVCDataset data set format requirements, but when I executed sh egs/svc/MultipleContentsSVC/run.sh --stage 1, I reported an error. The error appeared in this sentence of the metadata.py file: utterances = sorted(utterances, key=lambda x: x["Duration"]), the data set must have Duration, is there anything else needed to customize the data set?

Screenshots

(If applicable, add screenshots to help explain your problem.)
image
image

RuntimeError: Placeholder storage has not been allocated on MPS device!

I tried to run the script on my macOS(m2) as down follow:
CUDA_VISIBLE_DEVICES='mps:0' accelerate launch bins/tts/inference.py \
--config "ckpts/tts/valle_libritts/args.json"
--log_level debug
--acoustics_dir ckpts/tts/valle_libritts
--output_dir ckpts/tts/valle_libritts/result
--mode "single"
--text "his is a clip of generated speech with the given text from Amphion Vall-E mode"
--text_prompt "many animals of even complex structure which live parasitically within others are wholly devoid of an alimentary cavity"
--audio_prompt ckpts/tts/valle_libritts/prompt/LJ025-0076.wav
--test_list_file None

but i got the error massages :
Uploading image.png…

CosineAnnealingLR

The params of CosineAnnealingLR scheduler in valle_trainer.py seem different with pytorch Docs.

code:

            scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
                self.cfg.train.warmup_steps,
                self.optimizer,
                eta_min=self.cfg.train.base_lr,
            )

pytorch 2.0 Docs:
torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=- 1, verbose=False)

If the T_max should be set as warmup_steps or have another special setting?

Does it support for multilanguage?

Hi, thanks for your nice work.,

I have a 20k hours turkish dataset, can i train non-english model with this repository.
I see that you used g2p module and its only for english.

Thanks,

have plans to create scripts or .py files that can run on Windows in the future?

Hello, I was wondering if there are any plans to create .py files that can run on Windows?
Currently, all the scripts in our project are .sh files, which cannot be executed on Windows.
I'm also considering whether to spend time converting the .sh scripts into .py. If someone is already working on this, then I'll just wait for it.
Thank you!

ImportError: cannot import name 'quote' from 'urllib' (/root/miniconda3/lib/pyth on3.8/urllib/__init__.py)

Hello, I encountered an error while using run.sh in vits svc. How can I resolve this?
thanks!

root@autodl-container-06e841a5b8-eff3dbef:~/Amphion/egs/svc/VitsSVC# ./run.sh --
stage 1
Exprimental Configuration File: /root/Amphion/egs/svc/VitsSVC/exp_config.json
2023-12-31 10:45:06 | INFO | fairseq.tasks.text_to_speech | Please install tenso
rboardX: pip install tensorboardX
Traceback (most recent call last):
File "/root/Amphion/bins/svc/preprocess.py", line 19, in
from processors import acoustic_extractor, content_extractor, data_augment
File "/root/Amphion/processors/data_augment.py", line 12, in
import parselmouth
File "/root/miniconda3/lib/python3.8/site-packages/parselmouth/init.py", l
ine 22, in
from parselmouth.base import Parselmouth
File "/root/miniconda3/lib/python3.8/site-packages/parselmouth/base.py", line
30, in
from parselmouth.adapters.dfp.interface import DFPInterface
File "/root/miniconda3/lib/python3.8/site-packages/parselmouth/adapters/dfp/in
terface.py", line 17, in
from urllib import quote
ImportError: cannot import name 'quote' from 'urllib' (/root/miniconda3/lib/pyth
on3.8/urllib/init.py)

When I running "sh egs/svc/MultipleContentsSVC/run.sh --stage 1",it does't work! Where can I get '[Opencpop dataset path]/segments/train.txt'

Exprimental Configuration File: /home/ayit/Downloads/Amphion/egs/svc/MultipleContentsSVC/exp_config.json
Preprocess m4singer...

Preparing test samples for m4singer...

M4Singer: 20 singers, 20896 utterances (419 unique songs)
Singers:
Alto-1 Alto-2 Alto-3 Alto-4 Alto-5 Alto-6 Alto-7 Bass-1 Bass-2 Bass-3Soprano-1 Soprano-2 Soprano-3 Tenor-1 Tenor-2 Tenor-3 Tenor-4 Tenor-5 Tenor-6 Tenor-7
0%| | 0/20 [00:00<?, ?it/s]/home/ayit/Downloads/Amphion/preprocessors/m4singer.py:106: FutureWarning: get_duration() keyword argument 'filename' has been renamed to 'path' in version 0.10.0.
This alias will be removed in version 1.0.
duration = librosa.get_duration(filename=res["Path"])
100%|█████████████████████████████████████████| 20/20 [00:05<00:00, 3.93it/s]
#Train = 20739, #Test = 122
#Train hours= 29.48246345867975, #Test hours= 0.16981873925264554
Preprocess opencpop...

Dataset splits for opencpop...

Traceback (most recent call last):
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 182, in
main()
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 178, in main
preprocess(cfg, args)
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 83, in preprocess
preprocess_dataset(
File "/home/ayit/Downloads/Amphion/preprocessors/processor.py", line 48, in preprocess_dataset
opencpop.main(dataset, output_path, dataset_path)
File "/home/ayit/Downloads/Amphion/preprocessors/opencpop.py", line 66, in main
res, hours = get_uid2utt(opencpop_path, dataset, dataset_type)
File "/home/ayit/Downloads/Amphion/preprocessors/opencpop.py", line 26, in get_uid2utt
lines = get_lines(file)
File "/home/ayit/Downloads/Amphion/preprocessors/opencpop.py", line 15, in get_lines
with open(file, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '[Opencpop dataset path]/segments/train.txt'

During preprocessing, my custom dataset could not be found.

Hello, I've encountered a problem again. It might be a simple mistake, but I can't find a way to solve it.
When running ./run.sh --stage 1, it couldn't find the files in my data set.
I've already checked the read.md, but I still can't resolve this error. I need more guidance, thank you!

My data is placed here →
/root/Amphion/egs/datasets/
My wav file is here →
/root/Amphion/egs/datasets/tingting/tingting/298.wav
The beginning part of my exp_config.json file→
{
"base_config": "config/vitssvc.json",
"model_type": "VitsSVC",
"dataset": [
"tingting"
],
"dataset_path": {
// TODO: Fill in your dataset path
"tingting": "/root/Amphion/egs/datasets/"
},

The following is the content of the error report→
(amphion) root@autodl-container-25b911bc3c-59272fc1:~/Amphion/egs/svc/VitsSVC# .
/run.sh --stage 1
Exprimental Configuration File: /root/Amphion/egs/svc/VitsSVC/exp_config.json
Preprocess tingting...
No Data Augmentation.

Preparing metadata...
Including:
tingting
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/root/Amphion/bins/svc/preprocess.py", line 183, in
main()
File "/root/Amphion/bins/svc/preprocess.py", line 179, in main
preprocess(cfg, args)
File "/root/Amphion/bins/svc/preprocess.py", line 108, in preprocess
cal_metadata(cfg)
File "/root/Amphion/preprocessors/metadata.py", line 27, in cal_metadata
assert os.path.exists(save_dir)
AssertionError

FileNotFoundError: [Errno 2] No such file or directory: '[M4Singer dataset path]/meta.json' when running sh egs/svc/MultipleContentsSVC/run.sh --stage 1

This is my termunal:
(amphion) ayit@aiexplorer:~/Downloads/Amphion$ sh egs/svc/MultipleContentsSVC/run.sh --stage 1
Exprimental Configuration File: /home/ayit/Downloads/Amphion/egs/svc/MultipleContentsSVC/exp_config.json
Preprocess m4singer...

Preparing test samples for m4singer...

Traceback (most recent call last):
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 182, in
main()
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 178, in main
preprocess(cfg, args)
File "/home/ayit/Downloads/Amphion/bins/svc/preprocess.py", line 83, in preprocess
preprocess_dataset(
File "/home/ayit/Downloads/Amphion/preprocessors/processor.py", line 50, in preprocess_dataset
m4singer.main(output_path, dataset_path)
File "/home/ayit/Downloads/Amphion/preprocessors/m4singer.py", line 71, in main
with open(meta_file, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '[M4Singer dataset path]/meta.json'

[Error] Please specify the running stage

when I running the commend(following as the guide: https://huggingface.co/amphion/valle_libritts )
'''
sh egs/tts/VALLE/run.sh --stage 3 --gpu "0"
--config "ckpts/tts/valle_libritts/args.json"
--infer_expt_dir Amphion/ckpts/tts/valle_libritts
--infer_output_dir Amphion/ckpts/tts/valle_libritts/result
--infer_mode "single"
--infer_text "This is a clip of generated speech with the given text from Amphion Vall-E model."
--infer_text_prompt "But even the unsuccessful dramatist has his moments."
'''
I got the following error message:
[Error] Please specify the running stage

Unable to run training script of Natural Speech 2

Hi,

I ran into multiple issues trying to run the training script:
In ns2_dataset.py:

  • self.utt2phone[utt] = utt_info["phones"]: where phones comes from? I suspect we need to run the phonemizer first? but I don't see extract_phone=True in the config file
  • utt_info["num_frames"] is utt_info["Duration"], right?

In exp_config_base.json:

  • use_code=true, use_pitch=true, use_phone, should extract_acoustic_token=true, extract_pitch=true, extract_phone=true also?
  • There seems to be some mismatch between tts/preprocessing.py and the config file. For example: code_dir should be acoustic_token_dir?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.