kuielab / mdx-net Goto Github PK

KUIELAB-MDX-Net got the 2nd place on the Leaderboard A and the 3rd place on the Leaderboard B in the MDX-Challenge ISMIR 2021

Home Page: https://www.aicrowd.com/challenges/music-demixing-challenge-ismir-2021/

License: MIT License

Shell 0.74% Python 99.26%

mdx-challenge music-source-separation source-separation pytorch pytorch-lightning hydra wandb

mdx-net's People

Contributors

Stargazers

Watchers

Forkers

chenchy kimberleyjensen spinachr miblue119 cnh2769 xj-martin diggerdu ma5onic straystray elv-zhounan ws-choi liujingxiu23 render-ai ultraderek

mdx-net's Issues

Error on building pesq "pesq/cypesq.c:6:10: fatal error: Python.h: No such file or directory"

Error encountered while "pip install -r requrements.txt"

Solution:
We need to install libpythonX.X-dev to build pesq.
The version should be followed your SYSTEM PYTHON VERSION, not conda python version, because pesq will build by gcc on your system

ex)
(mdx-net) cnh2769@SPV02:~/_Project/mdx-net$ which python3
/home/cnh2769/anaconda3/envs/mdx-net/bin/python3
(mdx-net) cnh2769@SPV02:~/_Project/mdx-net$ python3 --version
Python 3.8.11
(mdx-net) cnh2769@SPV02:~/Project/mdx-net$ /usr/bin/python3 --version
Python 3.7.11
(mdx-net) cnh2769@SPV02:~/Project/mdx-net$ sudo apt install libpython3.7-dev

Data Parallel

Please can you explain me the meaning of these model settings

l: 3
g: 32
k: 3
bn: 8

Please can you explain me what l,g,k and bn mean and the uses. Thank you.

yaml for 12, 13 servers

YAML

.yaml files for 12 and 13 servers
will be located in
- configs/experiment/12_${target_name}.yaml
- configs/experiment/13_${target_name}.yaml

parameter checklist

.ENV File

your wandb api key (send me using kakaotalk)

Final model configs

dim_t: 256
model:
num_blocks: 11
l: 3 => 4
g: 32 => 36
k: 3
bn: 8
bias: False

augmentation: pitch=[-2,+2], tempo=[-20,+20]

Data Augmentation

implement test code for below
data.zip

Originally posted by @rlaalstjr47 in #3 (comment)

TODO

gen script with hydra
auto metadata caching for data augmentation
parameterized data aug with hydra
debugging to check if it is well-designed

duplicated wandb session created

It is possible to separate a music track with this model on a computer with an Intel CPU without an Nvidia GPU, with or without OpenVINO

How to train a model to separate an instrument

Hi all,

This might be a silly question
When I use a tool named ultimate voice remover, I noticed the excellent behavior from mdx-net models. So I plan to use this to train a model to separate special instruments from an ensemble. But I have got no idea how to start the work.

Thanks a lot!

How do you save the mixer model?

Hi !

I am trying to train the mixer model but it only saves a .ckpt file that is around 58mb. When i run predict_blend with my mixer checkpoint I get this error

RuntimeError: Error(s) in loading state_dict for Mixer: Missing key(s) in state_dict: "linear.weight". Unexpected key(s) in state_dict: "epoch", "global_step", "pytorch-lightning_version", "state_dict", "hparams_name", "hyper_parameters".

In your repo the mixer model is very very small and it says to "save .pt, the only learnable parameters in Mixer" Could you tell me how to do this please?

Thanks !

EarlyStop

Encounter a error while executing training process

Hi, I followed all you mentioned step. And I executed python run.py experiment=multigpu_vocals model=ConvTDFNet_vocals to start training then encounter following problem. 💔

RuntimeError: Early stopping conditioned on metric `val/sdr` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: `train/loss`, `train/loss_step`, `train/loss_epoch`

Appreciate to get your reply. ✨✨

[Torch 1.13.1>2.0.0) RuntimeError: istft requires a complex-valued input tensor matching the output from stft with return_complex=True

I'm using this Colab. It is sketchily modified to work with current Python 3.10 environment introduced in the end of April in all Colabs.
It uses onnxruntime 1.14 to fix ModuleNotFoundError: No module named 'onnxruntime' error and following code modification in line 57 in this main.py

self.onnx_models[c] = ort.InferenceSession(os.path.join(args.onnx,model.target_name+'.onnx'), providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])

to fix following error
TypeError: view_as_complex(): argument 'input' (position 1) must be Tensor, not builtin_function_or_method

And beside torch, to fix more issues since the changes to Colab at the begining of the year: soundfile 0.12.1 (for random soundfile errors) and librosa 0.9.1 (for demucs to work again)

The only problem now states Torch 2.0.0 which we will be forced to use in the future when Colab will change runtime to Python 3.11 probably next year. For now, Torch 1.13.1 works, but I thought it won't harm to ask about the solution here.

The error with torch 2.0.0:

Loading checkpoint... done
2023-05-01 18:23:22.498005355 [W:onnxruntime:Default, onnxruntime_pybind_state.cc:515 CreateExecutionProviderInstance] Failed to create TensorrtExecutionProvider. Please reference https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#requirements to ensure all dependencies are met.
Loading onnx model... done
Processing base:   0% 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/drive/MyDrive/MDX_Colab/main.py", line 371, in <module>
    main()
  File "/content/drive/MyDrive/MDX_Colab/main.py", line 353, in main
    pred.prediction(
  File "/content/drive/MyDrive/MDX_Colab/main.py", line 73, in prediction
    sources = self.demix(mix.T)
  File "/content/drive/MyDrive/MDX_Colab/main.py", line 149, in demix
    base_out = self.demix_base(segmented_mix, margin_size=margin)
  File "/content/drive/MyDrive/MDX_Colab/main.py", line 188, in demix_base
    tar_waves = model.istft(torch.tensor(_ort.run(None, {'input': spek.cpu().numpy()})[0]))#.cpu()
  File "/content/drive/MyDrive/MDX_Colab/models.py", line 158, in istft
    x = torch.istft(x, n_fft=self.n_fft, hop_length=self.hop, window=self.window, center=True)
RuntimeError: istft requires a complex-valued input tensor matching the output from stft with return_complex=True.
Processing base:   0% 0/1 [00:07<?, ?it/s]

If it's actually a dependency issue, we can use tensorrt 8.6.0 (if it's actually not already), but pypi doesn't have required CUDA 11.6 (11.7 is currently used).
__
tensorrt 8.6.0 didn't help, and 11.6 is not recommended for torch 2.0.0
https://pytorch.org/blog/deprecation-cuda-python-support/
Still stuck.

Edit.
The issue is fixed. I'll write later what has been changed. The changes are based on Vocal Remover 5 Colab by Audio Hacker which is Torch 2.0 compatible.

Presentation slide from readme = 404 error not found

The link to the presentation slide is dead

Separation without training

Hello, is it possible to run separation using KUIELab-MDX-Net without training the model from scratch? Do you share some pretrained models? It would ease the evaluation of your solution and its usage in a real-case scenario.

How to train a model that can fully extract the 44100hz frequency

I want to train a 2 stems model

I noticed that in the yaml configuration of each model, there are some parameters that will affect the final frequency cutoff, it seems that multigpu_drums.yaml can handle the full 44100hz frequency, but with the reduction of num_blocks (11 => 9), the model size will also decrease accordingly (29mb => 21mb).

Although using something like multigpu_drums.yaml can handle 44100hz in full, but the model shrinks instead. Does this affect the final accuracy?

It seems that dim_t, hop_length, overlap, num_blocks these parameters have a wonderful complementarity that I cannot understand, maybe this 'complementarity' is designed for the competition(mix to demucs), but I want to apply this to the real world without demucs(only mdx-net, after some testing, I think the potential of mdx-net is higher than demucs).

When I try to change num_blocks from 9 to 11, the results of inference have overlapping and broken voices... do you have any good parameters recommendations for me to train a full 44100hz one without loss of accuracy (i.e. the model does not Shrinking)

how to use auto_lr_find (NameError: name 'trainer' is not defined)

Hi i am trying to use lightning auto_lr_find i set to = true and run the command "trainer.tune(model)" i get the error

NameError: name 'trainer' is not defined

please do you have any advice for this?

initialization

current: default is xavier

minseok recommendation: remove xavier

Encountered errors while executing training process #2

(Using Leaderboard_B)
First I was stuck solving the environment and I let it sit for 30 min, but conda never finished creating the env from the yml.
Because I was using a cloud instance, I didn't have time to wait and I did this instead:

conda create -n mdx-net
conda update conda
conda config --add channels conda-forge
conda activate mdx-net
sudo apt-get install soundstretch
python -m pip install -r requirements.txt
python src/utils/data_augmentation.py --data_dir /real/path/to/musdbhq/ --train True --test True

It seems that the model doesn't allow me to train it with songs that don't contain vocals.

python src/utils/data_augmentation.py --data_dir /home/ubuntu/mdx-files/musdb/ --train True --test True
 10%|███████████████▉                                                                                                                                                     | 11/114 [01:13<11:25,  6.65s/it]
Traceback (most recent call last):
  File "src/utils/data_augmentation.py", line 111, in <module>
    main(parser.parse_args())
  File "src/utils/data_augmentation.py", line 30, in main
    save_shifted_dataset(p, t, data_dir, 'train')
  File "src/utils/data_augmentation.py", line 92, in save_shifted_dataset
    source = load_wav(in_path.joinpath(s_name+'.wav'))
  File "src/utils/data_augmentation.py", line 102, in load_wav
    return sf.read(path, samplerate=sr, dtype='float32')[0].T
  File "/home/ubuntu/.local/lib/python3.8/site-packages/soundfile.py", line 256, in read
    with SoundFile(file, 'r', samplerate, channels,
  File "/home/ubuntu/.local/lib/python3.8/site-packages/soundfile.py", line 629, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/soundfile.py", line 1183, in _open
    _error_check(_snd.sf_error(file_ptr),
  File "/home/ubuntu/.local/lib/python3.8/site-packages/soundfile.py", line 1357, in _error_check
    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening '/home/ubuntu/mdx-files/musdb/train/Artificial Intelligence - Native Instruments/vocals.wav': System error.

I deleted the songs that didn't contain vocals, then the data augmentation succeeded, but all attempts to train failed and I didn't have time to do debugging in the cloud GPU instance.

Here is the output from: python run.py experiment=multigpu_other model=ConvTDFNet_other

/usr/lib/python3/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /usr/lib/python3/dist-packages/torchvision/image.so: undefined symbol: _ZNK3c106IValue23reportToTensorTypeErrorEv
  warn(f"Failed to load image Python extension: {e}")
Traceback (most recent call last):
  File "run.py", line 7, in <module>
    from pytorch_lightning.utilities import rank_zero_info
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pytorch_lightning/__init__.py", line 20, in <module>
    from pytorch_lightning import metrics  # noqa: E402
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pytorch_lightning/metrics/__init__.py", line 15, in <module>
    from pytorch_lightning.metrics.classification import (  # noqa: F401
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/__init__.py", line 14, in <module>
    from pytorch_lightning.metrics.classification.accuracy import Accuracy  # noqa: F401
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 16, in <module>
    from torchmetrics import Accuracy as _Accuracy
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torchmetrics/__init__.py", line 14, in <module>
    from torchmetrics import functional  # noqa: E402
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torchmetrics/functional/__init__.py", line 14, in <module>
    from torchmetrics.functional.audio.pit import permutation_invariant_training, pit, pit_permutate
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torchmetrics/functional/audio/__init__.py", line 26, in <module>
    from torchmetrics.functional.audio.pesq import perceptual_evaluation_speech_quality  # noqa: F401
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torchmetrics/functional/audio/pesq.py", line 20, in <module>
    import pesq as pesq_backend
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pesq/__init__.py", line 5, in <module>
    from ._pesq import pesq, pesq_batch
  File "/home/ubuntu/.local/lib/python3.8/site-packages/pesq/_pesq.py", line 8, in <module>
    from .cypesq import cypesq, cypesq_retvals, cypesq_error_message as pesq_error_message
  File "__init__.pxd", line 238, in init cypesq
ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 80 from PyObject

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0    36W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   34C    P0    33W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

Yaml Integration

Validation for DDP

validation logic for DDP is needed

wandb only logs top k val loss

A Baseline for SDX 2023

We are Implementing a baseline model for SDX2023.

Currently, a script to train vocals are pused to sdx_label.

target
- vocals: python run.py experiment=multigpu_vocals datamodule=moisesdb23_labelnoise_v1
- drums
- bass
- other
datamodule
- tested on moisesdb23_labelnoise_v1
- tested on moisesdb23 bleeding
- on-the-fly data augmentation support
bug
- multi gpus with ddp not working

Please feel free to make a PR for any items on this todo list.

wandb:

vocals on labelnoise: https://wandb.ai/wschoi/mdx_vocals/runs/hlssqa8a?workspace=user-wschoi

Sync Batchnorm test

quick fixes

related #16

auto lr find

cmd, wandb train loss is different

mskim revision

please check the below class is designed correctly.

mdx-net/src/models/mdxnet.py

Line 128 in b9eb84d

class ConvTDFNet(AbstractMDXNet):

TFC-TDF-U-Net's performance on Musdb18

Our main approach is based on TFC-TDF-U-Net [3].

This model was originally proposed for Singing Voice Separation.

but it turns out that this model also performs well for other musical instruments (drums, bass, other)

More information: please see page 53 of the following dissertation
- Choi, Woosung, Deep Learning-based Latent Source Analysis for Source-aware Audio Manipulation. PhD Dissertation. Korea University, 2021.
- TFC-TDF-U-Net Repository

[3] Choi, Woosung, et al. "Investigating u-nets with various intermediate blocks for spectrogram-based singing voice separation." 21th International Society for Music Information Retrieval Conference, ISMIR. 2020.