Coder Social home page Coder Social logo

aizamaksutova / speaker_separationg_voicefilter Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 3.1 MB

Project on speaker separation within the project DL in Audio. This rep contains my implementation of VoiceFilter model and all the steps to reimplement the pipeline

Python 99.00% Shell 1.00%

speaker_separationg_voicefilter's Introduction

Speaker Separation

Project on speaker separation within the project DL in Audio. This rep contains my implementation of VoiceFilter model and all the steps to reimplement the pipeline

Model choice

For my implementation of VoiceFilter I used the model architecture which was presented within the paper VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

Result

  • Training took about 25 hours on one NVIDIA P100 GPU, yet could not reach the desired SI-SDR. Though the trend in metrics increase was very promising and the model definitely just needs more training time to hit good quality.
Metrics Ours
Median SI-SDR on LibriSpeech dev-clean 1.138
Median PESQ on LibriSpeech dev-clean 1.22

Dependencies

pip install -r requirements.txt

Dataset Generation

As there are not prepared datasets for this kind of task, I had to create the datasets on my own. This dataset solution includes WHAM noises too, which will afterwards be used to generate mixes of audio with two speakers and additional noise.

Here is the pipeline how one could do it for future research.

conda install -c conda-forge sox
git clone https://github.com/JorisCos/LibriMix
cd LibriMix 
./generate_librimix.sh storage_dir
mv storage_dir ./your_main_repo

In generate_librimix.sh you should choose only 2 speakers for this exact task.

After generating the dataset place utils/normalize-resample.sh to the head directory with all of your data to convert from .flac to .wav

vim normalize-resample.sh # set "N" as your CPU core number.
chmod a+x normalize-resample.sh
./normalize-resample.sh # this may take long

Then run in the Speaker_Separationg_VoiceFilter repo the following code

python3 generator.py -c config/data_convertion.yaml -d storage_dir/LibriSpeech -o wav_data -p 40 -n wav_data

This will output triplets of target.wav, ref.wav and mixed.wav which you will use for training

Train VoiceFilter

  1. Get pretrained model for speaker recognition system

    VoiceFilter utilizes speaker recognition system (d-vector embeddings).

    This model was trained with VoxCeleb2 dataset, where utterances are randomly fit to time length [70, 90] frames. Tests are done with window 80 / hop 40 and have shown equal error rate about 1%. Data used for test were selected from first 8 speakers of VoxCeleb1 test dataset, where 10 utterances per each speakers are randomly selected.

    The model can be downloaded at this GDrive link.

Very important, please do

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1YFmhmUok-W76JkrfA0fzQt3c-ZsfiwfL' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1YFmhmUok-W76JkrfA0fzQt3c-ZsfiwfL" -O embedder.pt && rm -rf /tmp/cookies.txt
mv ~/embedder.pt Speaker_Separationg_VoiceFilter/
  1. Training process

    Specify your train_dir, test_dir at config.yaml and then run

    python trainer.py -c [config yaml] -e [path of embedder pt file] -m [name]
    

    In my case it was

    python3 trainer.py -c config/convert.yaml -e embedder.pt -m vf_exp3
    

    This will create chkpt/name and logs/name at base directory

  2. View tensorboardX

    tensorboard --logdir ./logs
    

    My training loss for the final experiment:

  1. Resuming from checkpoint

    python trainer.py -c [config yaml] --checkpoint_path [chkpt/name/chkpt_{step}.pt] -e [path of embedder pt file] -m name
    

To evaluate my results do these steps

python3 inference.py -c config/convert.yaml -e embedder.pt --checkpoint_path model_best.pt -o results

EVAL on custom test data

in -t parameter please set path to the root dir with all three dirs of data

python3 generate_for_test.py -t data -o res2 -c config/default_test.yaml

python3 inference.py -c config/inference_test.yaml -e embedder.pt --checkpoint_path model_best.pt -o results

Bonus tasks

Added SI-SDR Loss to VoiceFilter

Look here and this metric was added to evaluation stage

Validation results on WHAM tracks

Metrics Ours
Median SI-SDR on LibriSpeech dev-clean 0.648
Median PESQ on LibriSpeech dev-clean 1.149

I generated a dataset of 1000 triplets from LibriSpeech dev_clean comparable to the one before but also adding noise to the mix data. In the updated generator code you can see where we add the noise to our mix data. Noise augmentations are taken from the WHAM dataset and randomly sampled to be added to different mixes. You can look up one of the output samples here

python3 generator_noisy.py -c config/noise_generation.yaml -d storage_dir/LibriSpeech -o noised_data -p 40 -n noised_data

#evaluation
python3 inference.py -c config/test_noise.yaml -e embedder.pt --checkpoint_path best_model.pt -o results

Tried to use different encoders

I searched up different encoders for reference audios which we could use instead of the one we used in this model. I tried to implement a simple speaker embedder based on a ConvGRU model which is described in this repo, it didn't exactly work but you can see my implementation here

Logs

To not make the readme to overflooded added all the logs to assests directory

speaker_separationg_voicefilter's People

Contributors

aizamaksutova avatar seungwonpark avatar stegben avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.