Coder Social home page Coder Social logo

iva4cocktail's Introduction

Speech density estimation for multichannel convolutive speech separation. I use independent vector analysis (IVA) as the separation framework. The code can also be used for music or other source separations. Please check the report for details.

Unlike the currently popular end-to-end black box speech separation methods, here, how the mixtures are separated is a well formulated optimization problem. Our focus is to learn a fine neural network density model for speech by optimization certain proxy objectives. Our resultant density model can be used in different scenarios, i.e., online or batch separation, different number of sources, artificial or realistic mixtures, etc., while an end-to-end model is typically trained for one very specific scenario.

Speech Density Estimation

You need to correctly set the path parameter for WavLoader in 'main.py' to train the density model.

Set dim_h = 0 to get a feedforward neural network (FNN) density model, and 1 <= dim_h <= dim_f to get a recurrent neural network (RNN) density model. Redefine grad_nll to train a different density model.

Even with a memoryless density model, the whole signal process graph still defines a deep recurrent network due to the way to update the separation matrices. Momentum and first order optimization methods seem perform not very well (might not with the best hyperparameters in my trials). A second order method saves the hyperparameter tuning effort.

Separation Performance Comparisons

I use this code to generate room impulse response (RIR). It is light and fast. But, as the original image source method, it cannot simulate fractional delays. So, I generate the RIR with higher sampling rate, and then decimate to the correct sampling rate.

You need to correctly set the path in 'generate_mixtures.m' to run the performance comparison simulations. Sample FNN and RNN density models prepared as in the report are included here. The neural network density models consistently outperform simple ones like multivariate Laplace and a non-spherical distributions (from Te-Won Lee's group), generalized Gaussian and Student's t- distribution (from Jonathon Chambers' group).

Convergence speed comparison, two sources

alt text

Efficiency comparison 1, 10 s speech length

alt text

Efficiency comparison 2, three sources

Need to halve the step size in iva_batch, otherwise, the t-distribution prior may cause divergence. alt text

Sample separation results for subjective comparison

These are some typical mixtures and separation results of 10 sources with length 10 second. The neural network density models always have better subjective separation performance, even for the first set of sample results, where the signal to interference ratio (SIR) of multivariate Laplace model is 0.3 dB higher than that of neural network one. One reason is that SIR is not very sensitive to errors like the low pass and high pass bands permutations since most speech energy locates in low frequency band, while human ears are picky.

iva4cocktail's People

Contributors

lixilinx avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.