Coder Social home page Coder Social logo

pataweepr / encodec-pytorch Goto Github PK

View Code? Open in Web Editor NEW

This project forked from zhikangniu/encodec-pytorch

0.0 0.0 0.0 1.44 MB

unofficial implementation of the High Fidelity Neural Audio Compression

License: MIT License

Shell 2.61% Python 97.29% Dockerfile 0.10%

encodec-pytorch's Introduction

encodec-pytorch

This is an unofficial implementation of the paper High Fidelity Neural Audio Compression in PyTorch.

The LibriTTS960h 24khz encodec checkpoint and disc checkpoint is release in https://huggingface.co/zkniu/encodec-pytorch/tree/main

I hope we can get together to do something meaningful and rebuild encodec in this repo.

Introduction

This repository is based on encodec and EnCodec_Trainer.

Based on the EnCodec_Trainer, I have made the following changes:

  • support multi-gpu training.
  • support AMP training (you need to reduce learning rate and scale vq epsilon from 1e-5 to 1e-3, the reason you can check issue 8)
  • support hydra configuration management.
  • align the loss functions and hyperparameters.
  • support warmup scheduler in training.
  • support the test script to test the model.
  • support tensorboard to monitor the training process.

TODO:

  • support the 48khz model.

Enviroments

The code is tested on the following environment:

  • Python 3.9
  • PyTorch 2.0.0 / PyTorch 1.13
  • GeForce RTX 3090 x 4 / V100-16G x 8 / A40 x 3

In order to you can run the code, you can install the environment by the help of requirements.txt.

Usage

Training

1. Prepare dataset

I use the librispeech as the train datasets and use the datasets/generate_train_file.py generate train csv which is used in the training process. You can check the datasets/generate_train_file.py and customAudioDataset.py to understand how to prepare your own dataset. Also you can use ln -s to link the dataset to the datasets folder.

[Optional] Docker image

I provide a dockerfile to build a docker image with all the necessary dependencies.

  1. Building the image
docker build -t encodec:v1 .
  1. Using the image
# CPU running
docker run encodec:v1 <command> # you can add some parameters, such as -tid
# GPU running
docker run --gpus=all encodec:v1 <command>

2. Train

You can use the following command to train the model using multi gpu:

CUDA_VISIBLE_DEVICES=0,1,2,3 python train_multi_gpu.py \
                        distributed.torch_distributed_debug=False \
                        distributed.find_unused_parameters=True \
                        distributed.world_size=4 \
                        common.save_interval=2 \
                        common.test_interval=2 \
                        common.max_epoch=100 \
                        datasets.tensor_cut=100000 \
                        datasets.batch_size=8 \
                        datasets.train_csv_path=YOUR TRAIN DATA.csv \
                        lr_scheduler.warmup_epoch=20 \
                        optimization.lr=5e-5 \
                        optimization.disc_lr=5e-5 \

Note:

  1. if you set a small datasets.tensor_cut, you can set a large datasets.batch_size to speed up the training process.
  2. when you are training on your own dataset, I suggest you need to choose a moderate-length audio, because If you train your encodec with 1 senconds tensorcut in a small dataset and the encodec model dosen't perform well.
  3. if you encounter bug about RuntimeError(f"Mismatch in number of params: ours is {len(params)}, at least one worker has a different one."). You can use a small datasets.tensor_cut to solve this problem.
  4. if your torch version is lower 1.8, you need to check the default value of torch.stft(return_complex) in the audio_to_mel.py
  5. if you encounter bug about multi-gpu training, you can try to set distributed.torch_distributed_debug=True to get more message about this problem.
  6. the single gpu training method is similar to the multi-gpu training method, you only need to set the distributed.data_parallel=False parameter to the command, like this:
        python train_multi_gpu.py distributed.data_parallel=False
                            common.save_interval=5 \
                            common.max_epoch=100 \
                            datasets.tensor_cut=72000 \
                            datasets.batch_size=4 \
                            datasets.train_csv_path=YOUR TRAIN DATA.csv \
                            lr_scheduler.warmup_epoch=10 \
                            optimization.lr=5e-5 \
                            optimization.disc_lr=5e-5 \
  7. the loss is not converged to zero, but the model can be used to compress and decompress the audio. you can use the compression.sh to test your model in every log_interval epoch.
  8. the original paper dataset is larger than 17000h, but I only use LibriTTS960h to train the model, so the model is not good enough. If you want to train a better model, you can use the larger dataset.
  9. The code is not well tested, so there may be some bugs. If you encounter any problems, you can open an issue or contact me by email.
  10. When I add AMP training, I found the RVQ loss always be nan, and I use L2 norm to normalized quantize and x, like the code -> actually, it's unstable.
        quantize = F.normalize(quantize)  
        commit_loss = F.mse_loss(quantize.detach(), x)
  11. When you try to use amp training, you need to reduce learning rate and scale vq epsilon from 1e-5 to 1e-3, the reason you can check issue 8
  12. I suggest you need to focus on the generator loss, the commit loss it could be not converge, you can check some objective metrics about pesq, stoi.

Test

I have add a shell script to compress and decompress the audio by different bandwidth, you can use the compression.sh to test your model.

The script can be used as follows:

sh compression.sh INPUT_WAV_FILE [MODEL_NAME] [CHECKPOINT]
  • INPUT_WAV_FILE is the wav file you want to test
  • MODEL_NAME is the model name, default is encodec_24khz,support encodec_48khz, my_encodec,encodec_bw
  • CHECKPOINT is the checkpoint path, when your MODEL_NAME is my_encodec,you can point out the checkpoint

if you want to test the model at a specific bandwidth, you can use the following command:

python main.py -r -b [bandwidth] -f [INPUT_FILE] [OUTPUT_WAV_FILE] -m [MODEL_NAME] -c [CHECKPOINT]

main.py from the encodec , you can use the -h to check the help information.

Acknowledgement

Thanks to the following repositories:

LICENSE

The code is same as encodec LICENSE.

encodec-pytorch's People

Contributors

zhikangniu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.