💥 💫 2023-09-10 update: Wiener-EM is now implemented for maximum performance!
C++17 implementation of Open-Unmix (UMX), a PyTorch neural network for music demixing.
It uses libnyquist to load audio files, the ggml file format to serialize the PyTorch weights of umxhq
and umxl
to a binary file format, and Eigen (+ OpenMP) to implement the inference of Open-Unmix.
The float32 weights of UMX are quantized to uint16 during the conversion to the binary ggml format. The size on disk for umx.cpp's weights files are therefore ~50% of the original weights (216MB vs. 432MB for umxl, 68MB vs. 136MB for umxhq), with identical BSS results.
The demixed output wav files (and their SDR score) of the main program umx.cpp
are mostly identical to the PyTorch models:
# first, standard pytorch inference
$ python ./scripts/umx_pytorch_inference.py \
--model=umxl \
--dest-dir=./umx-py-xl-out \
"/MUSDB18-HQ/test/Punkdisco - Oral Hygiene"
# then, inference with umx.cpp
$ umx.cpp.main ./ggml-umxl \
"/MUSDB18-HQ/test/Punkdisco - Oral Hygiene" \
./umx-cpp-xl-out
# evaluate both, same SDR score
$ python ./scripts/evaluate-demixed-output.py \
--musdb-root="/MUSDB18-HQ" \
./umx-py-xl-out \
'Punkdisco - Oral Hygiene'
vocals ==> SDR: 7.695 SIR: 17.312 ISR: 16.426 SAR: 8.322
drums ==> SDR: 8.899 SIR: 14.054 ISR: 14.941 SAR: 9.428
bass ==> SDR: 8.338 SIR: 14.352 ISR: 14.171 SAR: 10.971
other ==> SDR: 2.017 SIR: 6.266 ISR: 6.821 SAR: 2.410
$ python ./scripts/evaluate-demixed-output.py \
--musdb-root="/MUSDB18-HQ" \
./umx-cpp-xl-out \
'Punkdisco - Oral Hygiene'
vocals ==> SDR: 7.750 SIR: 17.510 ISR: 16.195 SAR: 8.321
drums ==> SDR: 9.010 SIR: 14.149 ISR: 14.900 SAR: 9.416
bass ==> SDR: 8.349 SIR: 14.348 ISR: 14.160 SAR: 10.990
other ==> SDR: 1.987 SIR: 6.282 ISR: 6.674 SAR: 2.461
In runtime, this is actually slower than the PyTorch inference (and probably much slower than a possible Torch C++ inference implementation).
During the recent LLM hype (GPT4, ChatGPT, LLama, etc.), ggerganov's projects llama.cpp and a previous whisper.cpp gained popularity. These projects load the pretrained weights of the underlying neural network (trained using PyTorch) and reimplement the computations needed for inference in C without using Torch.
In the past, I've worked on a few derivations of Open-Unmix, an open-source standard model for music demixing. The most recent model UMX-L achieved a higher score than UMX-HQ.
I wanted to imitate llama.cpp and whisper.cpp on a neural network that I was familiar with, so I chose UMX. UMX is a small model (136 MB for UMX-HQ, 432 MB for UMX-L), so this is more of a technical curiosity than a necessity.
- Clone the repo
Make sure you clone with submodules:
$ git clone --recurse-submodules https://github.com/sevagh/umx.cpp
- Set up Python
The first step is to create a Python environment (however you like; I'm a fan of mamba) and install the requirements.txt
file:
$ mamba create --name umxcpp python=3.10
$ mamba activate umxcpp
$ python -m pip install -r ./scripts/requirements.txt
- Dump Open-Unmix weights to ggml files (use argument
--model=umxl
,--model=umxhq
to switch between the two best pretrained models):
$ python ./scripts/convert-pth-to-ggml.py --model=umxl ./ggml-umxl
...
Skipping layer bn2.num_batches_tracked
Processing variable: fc3.weight with shape: (4098, 1024)
Processing variable: bn3.weight with shape: (4098,)
Processing variable: bn3.bias with shape: (4098,)
Processing variable: bn3.running_mean with shape: (4098,)
Processing variable: bn3.running_var with shape: (4098,)
Skipping layer bn3.num_batches_tracked
Done. Output file: ggml-umxl/ggml-model-umxl-other-u16.bin
This will load the model using PyTorch Torchhub (which implicitly downloads the weights files to the hidden torchhub folder), locate the weights files, and dump them using the ggml file format:
$ ls -latrh ggml-umxl/
total 216M
drwxrwxr-x 2 sevagh sevagh 4.0K Jun 28 10:14 .
drwxrwxr-x 13 sevagh sevagh 4.0K Jun 30 10:57 ..
-rw-rw-r-- 1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-vocals-u16.bin
-rw-rw-r-- 1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-drums-u16.bin
-rw-rw-r-- 1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-bass-u16.bin
-rw-rw-r-- 1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-other-u16.bin
- Install C++ dependencies, e.g. CMake, gcc, C++/g++, Eigen, OpenMP for your OS - my instructions are for Pop!_OS 22.04:
$ sudo apt-get install gcc g++ cmake clang-tools libeigen3-dev
- Compile umx.cpp.main with CMake:
$ mkdir -p build && cd build && cmake .. && \
make umx.cpp.main
Note: I have only tested this on my Linux-based computer (Pop!_OS 22.04), and you may need to figure out how to get the dependencies on your own.
- Run umx.cpp.main:
$ ./umx.cpp.main
Usage: ./umx.cpp.main <model dir> <wav file> <out dir>
$ ./umx.cpp.main ./ggml-umxl ./test.wav ./demix-out-umxl
umx.cpp Main driver program
Number of physical cores: 32
Input Samples: 23222488
Length in seconds: 263.294
Number of channels: 2
load_umx_model: loading model
Discovered model file "../ggml-umxl/ggml-model-umxl-other-u16.bin" in model dir../ggml-umxl/
Discovered model file "../ggml-umxl/ggml-model-umxl-drums-u16.bin" in model dir../ggml-umxl/
Discovered model file "../ggml-umxl/ggml-model-umxl-vocals-u16.bin" in model dir../ggml-umxl/
Discovered model file "../ggml-umxl/ggml-model-umxl-bass-u16.bin" in model dir../ggml-umxl/
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-bass-u16.bin
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-drums-u16.bin
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-other-u16.bin
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-vocals-u16.bin
Loaded umx model with hidden size 1024
Loading weights from model_file ../ggml-umxl/ggml-model-umxl-bass-u16.bin into target 0
Loading tensor input_mean with shape [1487, 1]
input_mean: [ 1487, 1], type = float, 0.01 MB
Loading tensor input_scale with shape [1487, 1]
input_scale: [ 1487, 1], type = float, 0.01 MB
Loading tensor output_scale with shape [2049, 1]
output_scale: [ 2049, 1], type = float, 0.01 MB
Loading tensor output_mean with shape [2049, 1]
output_mean: [ 2049, 1], type = float, 0.01 MB
Loading tensor fc1.weight with shape [2974, 1024]
fc1.weight: [ 2974, 1024], type = float, 11.62 MB
Loading tensor bn1.weight with shape [1024, 1]
bn1.weight: [ 1024, 1], type = float, 0.00 MB
Loading tensor bn1.bias with shape [1024, 1]
bn1.bias: [ 1024, 1], type = float, 0.00 MB
Loading tensor bn1.running_mean with shape [1024, 1]
bn1.running_mean: [ 1024, 1], type = float, 0.00 MB
Loading tensor bn1.running_var with shape [1024, 1]
... <truncated>
Loaded model (172 tensors, 215.68 MB) in 0.609294 s
umx_model_load returned true
Computing STFT
spec shape: (incl 2 chan) 11340 x 2049
Computing STFT magnitude
Computing STFT phase
Running inference with Eigen matrices
Writing wav file "./demix-out-umxl/target_0.wav" to ./demix-out-umxl
Encoder Status: 0
Writing wav file "./demix-out-umxl/target_2.wav" to ./demix-out-umxl
Encoder Status: 0
Writing wav file "./demix-out-umxl/target_1.wav" to ./demix-out-umxl
Encoder Status: 0
Writing wav file "./demix-out-umxl/target_3.wav" to ./demix-out-umxl
Encoder Status: 0
I took the following steps to write umx.cpp:
- Create STFT/iSTFT functions equivalent to the PyTorch stft used by Open-Unmix
The source file dsp.cpp contains STFT/iSTFT functions with center padding and window scaling using Eigen's unsupported/Eigen/FFT
(which is more or less the same as KissFFT, adapted to use the C++ standard library types).
The script compare-torch-stft.py uses openunmix.transforms.make_filterbanks
to return the same STFT/iSTFT used in UMX inference (and print some values for debugging), and from test_dsp.cpp, I was able to print the same values until I obtained the same outputs.
- Create supporting functions (load audio waveform, magnitude/phase spectrograms, and getting a complex spectrogram from the polar form)
All can be seen in dsp.cpp.
- Write convert-pth-to-ggml.py, which is borrowed from whisper.cpp.
This rather straightforwardly loads each PyTorch weight tensor and dumps them to a binary file.
-
Write model.cpp, which is also borrowed from whisper.cpp, and loads the binary files into
Eigen::MatrixXf
weight matrices -
Implement the forward inference operations using the weight matrices in umx.cpp, with the more complex LSTM code in lstm.cpp
This was done by reading the PyTorch documentation for each module, writing the equations using Eigen, and printing the outputs of each layer in PyTorch and umx.cpp until bugs were fixed and the outputs were identical.
Input/output scale + mean
PyTorch:
x = x*self.input_scale + self.input_mean
C++ with Eigen:
// clone input mix mag x
Eigen::MatrixXf x_input = x;
// apply formula x = x*input_scale + input_mean
#pragma omp parallel for
for (int i = 0; i < x_input.rows(); i++)
{
x_input.row(i) = x_input.row(i).array() *
model.input_scale.array() +
model.input_mean.array();
}
Fully-connected/linear layers (with no bias)
PyTorch:
x = self.fc1(x)
C++ with Eigen:
// y = x A^T + b
// x = (nb_frames, in_features)
// A = weights = (out_features, in_features)
// A^T = A transpose = (in_features, out_features)
x_input = x_input * model.fc1_w;
Batchnorm1d
PyTorch:
x = self.bn1(x)
C++ with Eigen:
// batchnorm1d calculation
// y=(x-E[x])/(sqrt(Var[x]+ϵ) * gamma + Beta
#pragma omp parallel for
for (int i = 0; i < x_input.rows(); i++)
{
x_input.row(i) =
(((x_input.row(i).array() -
model.bn1_rm.array()) /
(model.bn1_rv.array() + 1e-5).sqrt()) *
model.bn1_w.array() +
model.bn1_b.array())
.tanh();
}
LSTM (multilayer bidirectional)
PyTorch:
lstm_out = self.lstm(x)
C++ with Eigen:
// create Zero matrices for hidden, cell states
Eigen::MatrixXf loop_input = input;
for (int lstm_layer = 0; lstm_layer < 3; ++lstm_layer)
{
// parallelize the directions which don't depend on each other
#pragma omp parallel for
for (int direction = 0; direction < 2; ++direction)
{
// forward direction = 0: for t = 0 to seq_len - 1
// backward direction = 1: for t = seq_len - 1 to 0
for (int t = (direction == 0 ? 0 : seq_len - 1);
(direction == 0 ? t < seq_len : t > -1);
t += (direction == 0 ? 1 : -1))
{
// apply the inner input/hidden gate calculation for all gates
// W_ih * x_t + b_ih + W_hh * h_{t-1} + b_hh
//
// at the end of the loop iteration, h[lstm_layer][direction]
// will store h_t of this iteration at the beginning of the next
// loop iteration, h[lstm_layer][direction] will be h_{t-1},
// which is what we want similar for c[lstm_layer][direction]
// and c_{t-1}
//
// the initial values for h and c are 0
Eigen::MatrixXf gates =
model.lstm_ih_w[lstm_layer][direction].transpose() *
loop_input.row(t).transpose() +
model.lstm_ih_b[lstm_layer][direction] +
model.lstm_hh_w[lstm_layer][direction].transpose() *
h[lstm_layer][direction] +
model.lstm_hh_b[lstm_layer][direction];
// slice up the gates into i|f|g|o-sized chunks
Eigen::MatrixXf i_t =
sigmoid(gates.block(0, 0, hidden_state_size, 1));
Eigen::MatrixXf f_t = sigmoid(
gates.block(hidden_state_size, 0, hidden_state_size, 1));
Eigen::MatrixXf g_t = (gates.block(2 * hidden_state_size, 0,
hidden_state_size, 1))
.array()
.tanh();
Eigen::MatrixXf o_t = sigmoid(gates.block(
3 * hidden_state_size, 0, hidden_state_size, 1));
Eigen::MatrixXf c_t =
f_t.array() * c[lstm_layer][direction].array() +
i_t.array() * g_t.array();
Eigen::MatrixXf h_t = o_t.array() * (c_t.array().tanh());
// store the hidden and cell states for later use
h[lstm_layer][direction] = h_t;
c[lstm_layer][direction] = c_t;
output_per_direction[lstm_layer][direction].row(t)
<< h_t.transpose();
}
}
// after both directions are done per LSTM layer, concatenate the
// outputs
output[lstm_layer] << output_per_direction[lstm_layer][0],
output_per_direction[lstm_layer][1];
loop_input = output[lstm_layer];
}
// return the concatenated forward and backward hidden state as the final
// output
return output[2];