umx.cpp

💥 💫 2023-09-10 update: Wiener-EM is now implemented for maximum performance!

C++17 implementation of Open-Unmix (UMX), a PyTorch neural network for music demixing.

It uses libnyquist to load audio files, the ggml file format to serialize the PyTorch weights of umxhq and umxl to a binary file format, and Eigen (+ OpenMP) to implement the inference of Open-Unmix.

The float32 weights of UMX are quantized to uint16 during the conversion to the binary ggml format. The size on disk for umx.cpp's weights files are therefore ~50% of the original weights (216MB vs. 432MB for umxl, 68MB vs. 136MB for umxhq), with identical BSS results.

Performance

The demixed output wav files (and their SDR score) of the main program umx.cpp are mostly identical to the PyTorch models:

# first, standard pytorch inference
$ python ./scripts/umx_pytorch_inference.py \
    --model=umxl \
    --dest-dir=./umx-py-xl-out \
    "/MUSDB18-HQ/test/Punkdisco - Oral Hygiene"

# then, inference with umx.cpp
$ umx.cpp.main ./ggml-umxl \
    "/MUSDB18-HQ/test/Punkdisco - Oral Hygiene" \
    ./umx-cpp-xl-out

# evaluate both, same SDR score

$ python ./scripts/evaluate-demixed-output.py \
    --musdb-root="/MUSDB18-HQ" \
    ./umx-py-xl-out \
    'Punkdisco - Oral Hygiene'

vocals          ==> SDR:   7.695  SIR:  17.312  ISR:  16.426  SAR:   8.322
drums           ==> SDR:   8.899  SIR:  14.054  ISR:  14.941  SAR:   9.428
bass            ==> SDR:   8.338  SIR:  14.352  ISR:  14.171  SAR:  10.971
other           ==> SDR:   2.017  SIR:   6.266  ISR:   6.821  SAR:   2.410

$ python ./scripts/evaluate-demixed-output.py \
    --musdb-root="/MUSDB18-HQ" \
    ./umx-cpp-xl-out \
    'Punkdisco - Oral Hygiene'

vocals          ==> SDR:   7.750  SIR:  17.510  ISR:  16.195  SAR:   8.321
drums           ==> SDR:   9.010  SIR:  14.149  ISR:  14.900  SAR:   9.416
bass            ==> SDR:   8.349  SIR:  14.348  ISR:  14.160  SAR:  10.990
other           ==> SDR:   1.987  SIR:   6.282  ISR:   6.674  SAR:   2.461

In runtime, this is actually slower than the PyTorch inference (and probably much slower than a possible Torch C++ inference implementation).

Motivation

During the recent LLM hype (GPT4, ChatGPT, LLama, etc.), ggerganov's projects llama.cpp and a previous whisper.cpp gained popularity. These projects load the pretrained weights of the underlying neural network (trained using PyTorch) and reimplement the computations needed for inference in C without using Torch.

In the past, I've worked on a few derivations of Open-Unmix, an open-source standard model for music demixing. The most recent model UMX-L achieved a higher score than UMX-HQ.

I wanted to imitate llama.cpp and whisper.cpp on a neural network that I was familiar with, so I chose UMX. UMX is a small model (136 MB for UMX-HQ, 432 MB for UMX-L), so this is more of a technical curiosity than a necessity.

Instructions

Clone the repo

Make sure you clone with submodules:

$ git clone --recurse-submodules https://github.com/sevagh/umx.cpp

Set up Python

The first step is to create a Python environment (however you like; I'm a fan of mamba) and install the requirements.txt file:

$ mamba create --name umxcpp python=3.10
$ mamba activate umxcpp
$ python -m pip install -r ./scripts/requirements.txt

Dump Open-Unmix weights to ggml files (use argument --model=umxl, --model=umxhq to switch between the two best pretrained models):

$ python ./scripts/convert-pth-to-ggml.py --model=umxl ./ggml-umxl
...
Skipping layer bn2.num_batches_tracked
Processing variable:  fc3.weight  with shape:  (4098, 1024)
Processing variable:  bn3.weight  with shape:  (4098,)
Processing variable:  bn3.bias  with shape:  (4098,)
Processing variable:  bn3.running_mean  with shape:  (4098,)
Processing variable:  bn3.running_var  with shape:  (4098,)
Skipping layer bn3.num_batches_tracked
Done. Output file:  ggml-umxl/ggml-model-umxl-other-u16.bin

This will load the model using PyTorch Torchhub (which implicitly downloads the weights files to the hidden torchhub folder), locate the weights files, and dump them using the ggml file format:

$ ls -latrh ggml-umxl/
total 216M
drwxrwxr-x  2 sevagh sevagh 4.0K Jun 28 10:14 .
drwxrwxr-x 13 sevagh sevagh 4.0K Jun 30 10:57 ..
-rw-rw-r--  1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-vocals-u16.bin
-rw-rw-r--  1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-drums-u16.bin
-rw-rw-r--  1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-bass-u16.bin
-rw-rw-r--  1 sevagh sevagh 54M Jun 30 11:06 ggml-model-umxl-other-u16.bin

Install C++ dependencies, e.g. CMake, gcc, C++/g++, Eigen, OpenMP for your OS - my instructions are for Pop!_OS 22.04:

$ sudo apt-get install gcc g++ cmake clang-tools libeigen3-dev

Compile umx.cpp.main with CMake:

$ mkdir -p build && cd build && cmake .. && \
    make umx.cpp.main

Note: I have only tested this on my Linux-based computer (Pop!_OS 22.04), and you may need to figure out how to get the dependencies on your own.

Run umx.cpp.main:

$ ./umx.cpp.main
Usage: ./umx.cpp.main <model dir> <wav file> <out dir>

$ ./umx.cpp.main ./ggml-umxl ./test.wav ./demix-out-umxl
umx.cpp Main driver program
Number of physical cores: 32
Input Samples: 23222488
Length in seconds: 263.294
Number of channels: 2
load_umx_model: loading model
Discovered model file "../ggml-umxl/ggml-model-umxl-other-u16.bin" in model dir../ggml-umxl/
Discovered model file "../ggml-umxl/ggml-model-umxl-drums-u16.bin" in model dir../ggml-umxl/
Discovered model file "../ggml-umxl/ggml-model-umxl-vocals-u16.bin" in model dir../ggml-umxl/
Discovered model file "../ggml-umxl/ggml-model-umxl-bass-u16.bin" in model dir../ggml-umxl/
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-bass-u16.bin
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-drums-u16.bin
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-other-u16.bin
Checking the magic of model_file ../ggml-umxl/ggml-model-umxl-vocals-u16.bin
Loaded umx model with hidden size 1024
Loading weights from model_file ../ggml-umxl/ggml-model-umxl-bass-u16.bin into target 0
Loading tensor input_mean with shape [1487, 1]
      input_mean: [ 1487,     1], type = float,   0.01 MB
Loading tensor input_scale with shape [1487, 1]
     input_scale: [ 1487,     1], type = float,   0.01 MB
Loading tensor output_scale with shape [2049, 1]
    output_scale: [ 2049,     1], type = float,   0.01 MB
Loading tensor output_mean with shape [2049, 1]
     output_mean: [ 2049,     1], type = float,   0.01 MB
Loading tensor fc1.weight with shape [2974, 1024]
      fc1.weight: [ 2974,  1024], type = float,  11.62 MB
Loading tensor bn1.weight with shape [1024, 1]
      bn1.weight: [ 1024,     1], type = float,   0.00 MB
Loading tensor bn1.bias with shape [1024, 1]
        bn1.bias: [ 1024,     1], type = float,   0.00 MB
Loading tensor bn1.running_mean with shape [1024, 1]
bn1.running_mean: [ 1024,     1], type = float,   0.00 MB
Loading tensor bn1.running_var with shape [1024, 1]

... <truncated>

Loaded model (172 tensors, 215.68 MB) in 0.609294 s
umx_model_load returned true
Computing STFT
spec shape: (incl 2 chan) 11340 x 2049
Computing STFT magnitude
Computing STFT phase
Running inference with Eigen matrices

Writing wav file "./demix-out-umxl/target_0.wav" to ./demix-out-umxl
Encoder Status: 0
Writing wav file "./demix-out-umxl/target_2.wav" to ./demix-out-umxl
Encoder Status: 0
Writing wav file "./demix-out-umxl/target_1.wav" to ./demix-out-umxl
Encoder Status: 0
Writing wav file "./demix-out-umxl/target_3.wav" to ./demix-out-umxl
Encoder Status: 0

Design

I took the following steps to write umx.cpp:

Create STFT/iSTFT functions equivalent to the PyTorch stft used by Open-Unmix

The source file dsp.cpp contains STFT/iSTFT functions with center padding and window scaling using Eigen's unsupported/Eigen/FFT (which is more or less the same as KissFFT, adapted to use the C++ standard library types).

The script compare-torch-stft.py uses openunmix.transforms.make_filterbanks to return the same STFT/iSTFT used in UMX inference (and print some values for debugging), and from test_dsp.cpp, I was able to print the same values until I obtained the same outputs.

Create supporting functions (load audio waveform, magnitude/phase spectrograms, and getting a complex spectrogram from the polar form)

All can be seen in dsp.cpp.

Write convert-pth-to-ggml.py, which is borrowed from whisper.cpp.

This rather straightforwardly loads each PyTorch weight tensor and dumps them to a binary file.

Write model.cpp, which is also borrowed from whisper.cpp, and loads the binary files into Eigen::MatrixXf weight matrices
Implement the forward inference operations using the weight matrices in umx.cpp, with the more complex LSTM code in lstm.cpp

This was done by reading the PyTorch documentation for each module, writing the equations using Eigen, and printing the outputs of each layer in PyTorch and umx.cpp until bugs were fixed and the outputs were identical.

Layer implementations

Input/output scale + mean

PyTorch:

x = x*self.input_scale + self.input_mean

C++ with Eigen:

// clone input mix mag x
Eigen::MatrixXf x_input = x;

// apply formula x = x*input_scale + input_mean
#pragma omp parallel for
for (int i = 0; i < x_input.rows(); i++)
{
    x_input.row(i) = x_input.row(i).array() *
                              model.input_scale.array() +
                          model.input_mean.array();
}

Fully-connected/linear layers (with no bias)

PyTorch:

x = self.fc1(x)

C++ with Eigen:

// y = x A^T + b
// x = (nb_frames, in_features)
// A = weights = (out_features, in_features)
// A^T = A transpose = (in_features, out_features)
x_input = x_input * model.fc1_w;

Batchnorm1d

PyTorch:

x = self.bn1(x)

C++ with Eigen:

// batchnorm1d calculation
// y=(x-E[x])/(sqrt(Var[x]+ϵ) * gamma + Beta
#pragma omp parallel for
    for (int i = 0; i < x_input.rows(); i++)
    {
        x_input.row(i) =
            (((x_input.row(i).array() -
               model.bn1_rm.array()) /
              (model.bn1_rv.array() + 1e-5).sqrt()) *
                 model.bn1_w.array() +
             model.bn1_b.array())
                .tanh();
    }

LSTM (multilayer bidirectional)

PyTorch:

lstm_out = self.lstm(x)

C++ with Eigen:

// create Zero matrices for hidden, cell states

Eigen::MatrixXf loop_input = input;

for (int lstm_layer = 0; lstm_layer < 3; ++lstm_layer)
{
// parallelize the directions which don't depend on each other
#pragma omp parallel for
    for (int direction = 0; direction < 2; ++direction)
    {
        // forward direction = 0: for t = 0 to seq_len - 1
        // backward direction = 1: for t = seq_len - 1 to 0
        for (int t = (direction == 0 ? 0 : seq_len - 1);
             (direction == 0 ? t < seq_len : t > -1);
             t += (direction == 0 ? 1 : -1))
        {
            // apply the inner input/hidden gate calculation for all gates
            // W_ih * x_t + b_ih + W_hh * h_{t-1} + b_hh
            //
            // at the end of the loop iteration, h[lstm_layer][direction]
            // will store h_t of this iteration at the beginning of the next
            // loop iteration, h[lstm_layer][direction] will be h_{t-1},
            // which is what we want similar for c[lstm_layer][direction]
            // and c_{t-1}
            //
            // the initial values for h and c are 0
            Eigen::MatrixXf gates =
                model.lstm_ih_w[lstm_layer][direction].transpose() *
                    loop_input.row(t).transpose() +
                model.lstm_ih_b[lstm_layer][direction] +
                model.lstm_hh_w[lstm_layer][direction].transpose() *
                    h[lstm_layer][direction] +
                model.lstm_hh_b[lstm_layer][direction];

            // slice up the gates into i|f|g|o-sized chunks
            Eigen::MatrixXf i_t =
                sigmoid(gates.block(0, 0, hidden_state_size, 1));
            Eigen::MatrixXf f_t = sigmoid(
                gates.block(hidden_state_size, 0, hidden_state_size, 1));
            Eigen::MatrixXf g_t = (gates.block(2 * hidden_state_size, 0,
                                               hidden_state_size, 1))
                                      .array()
                                      .tanh();
            Eigen::MatrixXf o_t = sigmoid(gates.block(
                3 * hidden_state_size, 0, hidden_state_size, 1));

            Eigen::MatrixXf c_t =
                f_t.array() * c[lstm_layer][direction].array() +
                i_t.array() * g_t.array();
            Eigen::MatrixXf h_t = o_t.array() * (c_t.array().tanh());

            // store the hidden and cell states for later use
            h[lstm_layer][direction] = h_t;
            c[lstm_layer][direction] = c_t;

            output_per_direction[lstm_layer][direction].row(t)
                << h_t.transpose();
        }
    }

    // after both directions are done per LSTM layer, concatenate the
    // outputs
    output[lstm_layer] << output_per_direction[lstm_layer][0],
        output_per_direction[lstm_layer][1];

    loop_input = output[lstm_layer];
}

// return the concatenated forward and backward hidden state as the final
// output
return output[2];

k2m5t2 / umx.cpp Goto Github PK

umx.cpp's Introduction

umx.cpp

Performance

Motivation

Instructions

Design

Layer implementations

umx.cpp's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent