Coder Social home page Coder Social logo

waveformer's Introduction

Real-Time Target Sound Extraction

This repository provides code for the Waveformer architecture proposed in the paper. Waveformer is a low-latency target sound extraction model implementing streaming inference -- the model process a ~10 ms input audio chunk at each time step, while only looking at past chunks and no future chunks. On a Core i5 CPU using a single thread, real-time factors (RTFs) of different model configurations range from 0.66 to 0.94, with an end-to-end latency less than 20 ms.

Gradio demo Gradio demo Gradio demo

Gradio-Demo.mp4

Setup

# Commands in all sections except the Dataset section are run from repo's toplevel directory
conda create --name waveformer python=3.8
conda activate waveformer
pip install -r requirements.txt

Bring Your Own Audio

You could run the model on your audio files using the Waveformer.py script. Example commands below use the sample audio mixture provided at data/Sample.wav. If running for the first time, the script downloads the default configuration file and checkpoint to the current directory.

# Usage: python Waveformer.py [-h] [--targets TARGETS [TARGETS ...]] input output

# Single-target extraction
python Waveformer.py data/Sample.wav output_typing.wav --targets Computer_keyboard

# Multi-target extraction
python Waveformer.py data/Sample.wav output_bark_cough.wav --targets Bark Cough

List of all possible targets can be found using:

python Waveformer.py -h

Training and Evaluation

Dataset

We use Scaper toolkit to synthetically generate audio mixtures. Each audio mixture is generated on-the-fly, during training or evaluation, using Scaper's generate_from_jams function on a .jams specification file. We provide (in the step 3 below) .jams specification files for all training, validation and evaluation samples used in our experiments. The .jams specifications are generated using FSDKaggle2018 and TAU Urban Acoustic Scenes 2019 datasets as sources for foreground and background sounds, respectively. Steps to create the dataset:

  1. Go to the data directory:

     cd data
    
  2. Download FSDKaggle2018, TAU Urban Acoustic Scenes 2019, Development dataset and TAU Urban Acoustic Scenes 2019, Evaluation dataset datasets using the data/download.py script:

     python download.py
    
  3. Download and uncompress FSDSoundScapes dataset:

     wget https://targetsound.cs.washington.edu/files/FSDSoundScapes.zip
     unzip FSDSoundScapes.zip
    

    This step creates the data/FSDSoundScapes directory. FSDSoundScapes would contain .jams specifications for training, validation and test samples used in the paper. Training and evaluation pipeline expect source samples (samples in FSDKaggle2018 and TAU Urban Acoustic Scenes 2019 datasets) at specific locations realtive to FSDSoundScapes. Following steps move source samples to appropriate locations.

  4. Uncompress FSDKaggle2018 dataset and create scaper source:

     unzip FSDKaggle2018/\*.zip -d FSDKaggle2018
     python fsd_scaper_source_gen.py FSDKaggle2018 ./FSDSoundScapes/FSDKaggle2018 ./FSDSoundScapes/FSDKaggle2018
    
  5. Uncompress TAU Urban Acoustic Scenes 2019 dataset to FSDSoundScapes directory:

     unzip TAU-acoustic-sounds/\*.zip -d FSDSoundScapes/TAU-acoustic-sounds/
    

Training

python -W ignore -m src.training.train experiments/<Experiment dir with config.json> --use_cuda

Evaluation

Pretrained checkpoints are available at experiments.zip. These can be downloaded and uncompressed to appropriate locations using:

wget https://targetsound.cs.washington.edu/files/experiments.zip
unzip -o experiments.zip -d experiments

Run evaluation script:

python -W ignore -m src.training.eval experiments/<Experiment dir with config.json and checkpoints> --use_cuda

Note

During the sample generation, when the amplitude of mixture sum to greater than 1, peak normalization is used to renormalize the mixtures. This results in a bunch of Scaper warnings during training and evaluation. -W ignore flag is used for a clearner output to the console.

waveformer's People

Contributors

vb000 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.