sigsep / open-unmix-pytorch Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 179.0 4.93 MB

Open-Unmix - Music Source Separation for PyTorch

Home Page: https://sigsep.github.io/open-unmix/

License: MIT License

Dockerfile 0.22% Python 97.20% Mako 2.15% Shell 0.43%

open-unmix-pytorch's People

Contributors

Stargazers

Watchers

Forkers

suwoncjh hiyoung-asr agangzz xzm2004260 liyucode templeblock dendisuhubdy vusd aanugraha suwadith loretoparisi pigip bourdalas vichoko faroit vichuda shoegazerstella nsouvira irentang birdgun nindidooo mcspx huguanglong tamwaiban chenchy nkueterman kno3a87 kevinzhangcode diggerdu morris-frank agolynski wannaphong wasabi-anr-project cyberluke russellizadi methevoz jyt1234 leo-pacheco-tal grugor qinxiaoyi mattfjh hwaninhawaii enricguso cpvlordelo macken107 seth814 alex-mocanu themidwestcanapps aliosamahassan georgetz15 boomwang 5l1v3r1 mthrok aneeshathrey ja14000 felipebetancur dankwartrustow shaheenkdr laughbuddha psanna77 tobe2d wl3b10s sun-peach darius522 1uka cprakashagr skratchdot wuhuaha cameronmaske gormonn drewmee bosnyan appleholic fxmarty xiongmaoxia simpleishappy frizzid07 drawfish baldwin-disso keweichen andres-carranza dlesz u7karshs syams86 somic zhaojy1 pc2752 keunwoochoi mutjinde windstudent akiboy96-newid hyoputer ialy1595 vrv18 cxz mtlong mynameisnhan jomarimendoza oliver-tautz nkgevorgyan

open-unmix-pytorch's Issues

Input-stage standardization

🐛 Bug

Hi,
Sorry for the bother. I have a basic question about the input-stage standardization.

After the STFT transform the model does:
x += self.input_mean
x *= self.input_scale

and the same, with the opposite order in the output-stage
x *= self.output_scale
x += self.output_mean

I'm wonder about the input stage part, if we want the normalize the spectrogram to be a zero-mean and with STD of one, don't we need to subtract the samples by the mean and dividing by the std ? like this:
x -= self.input_mean
x /= self.input_scale

Any help will be very appreciated.
Thank you!

GPU Utilization too low

Hi,

I use Open-Unmix training on my data set (includes MUSDB stem version + other) and it took place on Nvidia RTX2080 cards without SSD and without nb workers.
My GPU utilization as i see it with the command "nvidia-smi" is 2%-11% (Cuda is enabled and got print GPU usage True, though Torchaudio usage is False).
However in you're description about the training process, you mentioned that your GPU utilization got to 90%.

What is the reason for my low GPU Utilization? Is it related to the fact that torchaudio is not used?
Can you please give an approximation upon the expected range of GPU utilization?
Thank you very much.

Cuda Out of Memory Error on Longer Files

🐛 Bug

Hello,

I am trying to test out the torchfilters branch of this project. It works fine on shorter audio clips, but when the audio file is around 4 to 5 minutes in length, the program crashes with a CudaOutOfMemoryError.

To Reproduce

Steps to reproduce the behavior:

Run test.py on a music file about 4 or 5 minutes in length.

Traceback (most recent call last):
  File "/home/user/unmix/test.py", line 74, in separate
    estimates, model_rate = separator(audio_torch, rate)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user/unmix/unmix/filtering.py", line 833, in forward
    for sample in range(nb_samples)], dim=0)
  File "/home/user/unmix/filtering.py", line 833, in <listcomp>
    for sample in range(nb_samples)], dim=0)
  File "/home/user/anaconda3/lib/python3.7/site-packages/torchaudio/functional.py", line 130, in istft
    onesided, signal_sizes=(n_fft,))  # size (channel, n_frames, n_fft)
RuntimeError: CUDA out of memory. Tried to allocate 454.00 MiB (GPU 0; 7.43 GiB total capacity; 6.02 GiB already allocated; 218.94 MiB free; 690.49 MiB cached)

Expected behavior

The program should finish execution on files of longer length as well. Is there a way to split the audio every one or two minutes, or use an audio loader in such a way that the entire song isn't loaded into CUDA memory at once, so that way it doesn't crash?

Thank you!

Environment

Please add some information about your environment

PyTorch Version (e.g., 1.2): 1.2
OS (e.g., Linux): Linux
torchaudio loader (y/n): Y
Python version: 3.7
CUDA/cuDNN version: 10.0/7.6
Any other relevant information:

Additional context

RuntimeError: Backend "sox_io" is not one of available backends: ['soundfile'].

🐛 Bug

I am trying to run umx in Windows 10 64 + Anaconda 3.
The installation ("pip install openunmix") seemed to pass without any problem but "umx anyfile.wav" failed:

(base) C:\Users\Vita\audio-separation\open-unmix-2021>umx nakonci.wav
Traceback (most recent call last):
  File "c:\users\vita\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\vita\anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Vita\Anaconda3\Scripts\umx.exe\__main__.py", line 9, in <module>
  File "c:\users\vita\anaconda3\lib\site-packages\openunmix\cli.py", line 118, in separate
    torchaudio.set_audio_backend(args.audio_backend)
  File "c:\users\vita\anaconda3\lib\site-packages\torchaudio\backend\utils.py", line 44, in set_audio_backend
    f'Backend "{backend}" is not one of '
RuntimeError: Backend "sox_io" is not one of available backends: ['soundfile'].

The strange thing was that I got the same error message with full path to the file and even with a non-existing file so it seemed umx could not even load the input file.

Then I noticed this:
Note that we support all files that can be read by torchaudio, depending on the set backend (either soundfile (libsndfile) or sox).
Adding "--audio-backend sox_io" resulted in the same error message, but "--audio-backend soundfile" finally made it work.
Maybe the default setting should change...?

git Repo issue

Hi, I have been waiting this for long!
I wold like to report my experiences on training MUSDB18 (Ubuntu 18.04, GTX1080Ti, CUDA10).

"python train.py --root ./musdb18 --target vocals" yielded the following errors.
1: raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError:
My Response: "git init" created bunch of files under .git

2: Then two different error messages: raise ValueError("Reference at %r does not exist" % ref_path)
ValueError: Reference at 'refs/heads/master' does not exist
My Response: As suggested, created .git/refs/heads/master and wrote "ref: refs/heads/master" in the text there
Result: this stopped the error.

3: python train.py --root ./musdb18 finally runs without an error message, but nvidia-smi shows no GPU usage and returns none.

Any suggestions? Thanks!!

Cant load dataset

🐛 Bug

When I run train.py with a custom dataset, the dataset doesnt load and I get the error: "IndexError: Cannot choose from an empty sequence". When I print the length of the dataset I get a non zero value.

This is the command I use to run the train.py script:
"! python train.py --dataset sourcefolder --root /content/data --target-dir gt--interferer-dirs interfer --ext .wav --nb-train-samples 1000 --nb-valid-samples 100"

I am running the code in Google Colab.

output_mean should be zeros?

https://github.com/sigsep/open-unmix-pytorch/blob/master/model.py#L187

I think that unless provided (no output scaling), output_mean should be torch.zeros() and not torch.ones()

STL2 did not use additional training data

The main page says STL2 isn't included in the comparison because it used additional training data.
According to this page, STL2 (multi-instrument Wave-U-Net) didn't use additional training data. Which one is right? I think the confusion arose because STL1 does use additional data (CCMixter).

cc: @f90

Improve dataset statistics for sourcefolder dataset

🐛 Dataset Statistics do not work for `sourcefolder` dataset

The get_statistics function was designed to iterate over the complete audio data in an deterministic manner, therefore loading the full audio samples. This doesn't work together with the sourcefolder dataset as it allows to have different length of files as it get short chunks of fixed lengths from each item.

Expected behavior

sourcefolder dataset should work with get_statistics

Proposed solutions

Solution 1

replace dataset_scaler.seq_duration = None with dataset_scaler.seq_duration = args.seq_dur. That would solve the issue but then would only train the dataset statistics on the first n seconds from each sample.

Solution 2

use stochastic sampling and use a dataloader instead of a dataset: e.g.:

def get_statistics(args, dataloader):
    scaler = sklearn.preprocessing.StandardScaler()

    spec = torch.nn.Sequential(
        model.STFT(n_fft=args.nfft, n_hop=args.nhop),
        model.Spectrogram(mono=True)
    )

    pbar = tqdm.tqdm(dataloader, disable=args.quiet)
    for x, y in pbar:
        pbar.set_description("Compute dataset statistics")
        X = spec(x)
        scaler.partial_fit(np.squeeze(X))

    std = np.maximum(
        scaler.scale_,
        1e-4*np.max(scaler.scale_)
    )
    return scaler.mean_, std

stats_sampler = torch.utils.data.DataLoader(
    train_dataset, batch_size=1,
    sampler=sampler, **dataloader_kwargs
)

the second option would get better distributed samples and users can maybe specify an argument that selects the number of samples randomly drawn to train the dataset statistics

The detail procedure to reproduce the evaluation results of UMX pre-trained model

Hi Sirs,

I'm new to UMX. I tried to reproduce the fantastic results that you made on the website.
I use only umx vocals-c8df74a5.pth to do evaluation (eval.py) with MUSDB18 testset (50 songs)
Here is my result :
UMX1 accompaniment ISR 18.950225
SAR 12.290675
SDR 11.881972
SIR 20.425005
vocals ISR 14.368638
SAR 5.715235
SDR 5.567850
SIR 12.480217
The SDR of vocals is 5.567850 and much worse than your result of 6.32
May I know how to reproduce your result?
What are the musdb/museval version you use?
I use musdb 0.3.1, museval 0.3.0.

I also plotted the boxplot and it just a little bit better than Wave-U-Net 44KHz pre-trained model.
I'm wondering that what did I do wrong?
Hope to receive your response.
Thanks in advance.

mstfc

[Question] About mp3 input files

I have MP3 files at 128 kb/s like

  Metadata:
    encoder         : Lavf58.20.100
  Duration: 00:02:22.11, start: 0.025057, bitrate: 128 kb/s
    Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 128 kb/s
    Metadata:
      encoder         : Lavc58.35

and I therefore converto to wav so that I get the a 22050 Hz file as for the dataset specification:

ffmpeg -i file.mp3 -acodec pcm_s16le -ar 22050 file.wav
Metadata:
    encoder         : Lavf58.12.100
  Duration: 00:02:22.06, bitrate: 705 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 2 channels, s16, 705 kb/s

The separation works, my guess is if this is the best approach from the given mp3 sample rate and bit rate.
Thank you.

Add missing arguments to docs and inference

--outdir is not mentioned in the inference docs
--samplerate is not mentioned in the inference docs
--nb_channels should also be an argument for inference

Sources with different numbers of channels for sourcefolder dataset

🐛 Bug

First of all, I have to acknowledge the authors of open-unmix for this obviously awesome work ;)

My issue is about the sourcefolder dataset, which cannot handle sources with different numbers of channels. Let's assume that we have two folders of sources, the first one contains mono signals, the second one stereo signals. For training, we also set nb_channels to 1. In __getitem__ of SourceFolderDataset, an error is raised when trying to stack the sources, before summing them to create the mixture (line 358 of data.py).

To Reproduce

Steps to reproduce the behavior:

Create two folders of sources, one with stereo signals and the other one with mono signals.
Launch training with nb-channels to 1, below is the command I used:

python train.py --root ./data-sourcefolder --dataset sourcefolder --interferer-dirs noise --target-dir speech --nb-train-samples 20000 --nb-valid-samples 2000 --seq-dur 2.0 --source-augmentations gain --hidden-size 256 --nb-channels 1 --nfft 1024 --nhop 256 --nb-workers 4

We get the following error:

Traceback (most recent call last):
  File "train.py", line 294, in <module>
    main()
  File "train.py", line 177, in main
    scaler_mean, scaler_std = get_statistics(args, train_dataset)
  File "train.py", line 68, in get_statistics
    x, y = dataset_scaler[ind]
  File "/data/recherche/python/speech_enhancement/open-unmix-pytorch/data.py", line 385, in __getitem__
    stems = torch.stack(audio_sources)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 2 and 1 in dimension 1 at /tmp/pip-req-build-58y_cjjl/aten/src/TH/generic/THTensor.cpp:689

Expected behavior

We could expect that because we set nb_channels to 1, stereo source signals would be downmixed so that they could be mixed with the monophonic sources.

Environment

PyTorch Version: 1.2.0
OS: Ubuntu
torchaudio loader: no
Python version: 3.7.3
CUDA/cuDNN version: CUDA 10.1 - cuDNN 7.6.0

Cannot use double precision with wiener filtering

🐛 Bug

It seems because of this line: https://github.com/sigsep/open-unmix-pytorch/blob/master/openunmix/filtering.py#L301, where dtype is not provided and fixed, trying to use wiener filtering with double precision will fail.

  File "/Users/defossez/projs/demucs/env/lib/python3.8/site-packages/openunmix/filtering.py", line 472, in wiener
    y = expectation_maximization(y, mix_stft, iterations, eps=eps)[0]
  File "/Users/defossez/projs/demucs/env/lib/python3.8/site-packages/openunmix/filtering.py", line 301, in expectation_maximization
    y[t, ...] = torch.tensor(0.0, device=x.device)
RuntimeError: Index put requires the source and destination dtypes match, got Double for the destination and Float for the source.

To Reproduce

Call wiener filtering function with a magnitude that is float64, and complex spectrogram of the mixture that is complex128.

Expected behavior

Expected call to succeed with high precision inputs.

Environment

PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 11.4 (x86_64)
GCC version: Could not collect
Clang version: 11.0.0
CMake version: version 3.19.1
Libc version: N/A

Python version: 3.8.8 (default, Feb 24 2021, 13:46:16)  [Clang 10.0.0 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0
[pip3] torchaudio==0.9.0
[pip3] torchvision==0.9.1
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      233
[conda] mkl-service               2.3.0            py38h9ed2024_0
[conda] mkl_fft                   1.2.0            py38hc64f4ea_0
[conda] mkl_random                1.1.1            py38h959d312_0
[conda] numpy                     1.20.2                   pypi_0    pypi
[conda] torch                     1.8.1                    pypi_0    pypi
[conda] torchaudio                0.8.1                    pypi_0    pypi
[conda] torchvision               0.9.1                    pypi_0    pypi

Refactor open-unmix as a package

It seems that there is an interest to use just the pre-trained weights from open-unmix. To improve usability we will make open-unmix a pypi (and possibly conda-forge) package.

I get an error What should I do?

🐛 Bug

h$ python3 train.py --dataset musdb
Using GPU: True
Using Torchaudio: False
Traceback (most recent call last):
File "train.py", line 294, in
main()
File "train.py", line 158, in main
train_dataset, valid_dataset, args = data.load_datasets(parser, args)
File "/home/scss/DeepLearning/VocalEX/Separation/open-unmix-pytorch/open-unmix-pytorch/data.py", line 226, in load_datasets
**dataset_kwargs
File "/home/scss/DeepLearning/VocalEX/Separation/open-unmix-pytorch/open-unmix-pytorch/data.py", line 751, in init
*args, **kwargs
TypeError: init() got an unexpected keyword argument 'root'

To Reproduce

Steps to reproduce the behavior:

1.　Python3 train.py　--root (Datasets)

Expected behavior

Environment

Please add some information about your environment

PyTorch Version (e.g., 1.0.0):
OS (e.g., Linux):
torchaudio loader (y/n): N
Python version:
CUDA/cuDNN version: 440
Any other relevant information:

If unsure you can paste the output from the pytorch environment collection script
(or fill out the checklist below manually).

You can get that script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Additional context

Obtaining weights for streaming implementation

I'm interested in implementing a real-time, streaming version of the separation method.

Do you have any advice on how to extract the model weights for this?

Would it be best to retrain, and save the weights during training?

Using umx programmatically instead of via cli.

I'm actually going to use it in another script but there's some pre-processing before the separate function gets called in test.py (the part after if __name__ == '__main__').

I wrapped it up in a whole function and was wondering if that's a good approach?

Like,

def main(input_files, samplerate, niter, alpha, softmask, residual_model, model,
         targets=('vocals', 'drums', 'bass', 'other'), outdir=None, no_cuda=False):

and then at the end call it by

main(args.input, args.samplerate, args.niter, args.alpha, args.softmask, args.residual_model, args.model, args.targets, args.outdir, args.no_cuda)

This doesn't change the cli functionality but allows me to import the main function for external use.

Train single channel model using left or right channel

Currently using train.py --nb-channels 1 will apply a downmix in the spectral domain inside the model to feed in only single channel audio.

However, I can think about applications where we do not have access to a wiener filter and therefore apply the model to each channel individually. In that case the performance might be better when the model was trained on just the left or the right channel. This can be fixed since we use channel swap augmentation.

Random Seeds

What are the random seeds you used for the different targets?

Audio tracks availability

Hello,
is the dataset available as audio tracks for other projects? If so, under what terms?

Best regards

Does bandwidth extension even exist as a feature?

Your training docs mention that an aligned dataset can be used for Bandwidth Extension (Low Bandwidth -> High Bandwidth) as mentioned here.
Previously I have trained models for Source Separation (Mixture -> Target) and Denoising (Noisy -> Clean) and they're working as intended.
But training for Bandwidth Extension doesn't provide any noticeable enhancements at all, this is the output spectrogram and this is how it's supposed to be.
At first I thought I did a mistake to convert the source from 22050Hz to 44100Hz, so I tried training with 22050Hz and 48000Hz files directly but it would throw this error:

Using GPU: True
Using Torchaudio:  True
16748it [00:21, 795.79it/s]
15it [00:00, 657.46it/s]
Compute dataset statistics: 100%|███████| 16416/16416 [1:46:10<00:00,  2.58it/s]
Training Epoch:   0%|                                  | 0/1000 [00:00<?, ?it/strain.py:31: UserWarning: Using a target size (torch.Size([278, 16, 1, 2049])) that is different to the input size (torch.Size([126, 16, 1, 2049])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  loss = torch.nn.functional.mse_loss(Y_hat, Y)
Training batch:   0%|                                  | 0/1026 [00:01<?, ?it/s]
Training Epoch:   0%|                                  | 0/1000 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 295, in <module>
    main()
  File "train.py", line 244, in main
    train_loss = train(args, unmix, device, train_sampler, optimizer)
  File "train.py", line 31, in train
    loss = torch.nn.functional.mse_loss(Y_hat, Y)
  File "/home/mgt/anaconda3/envs/basepy37/lib/python3.7/site-packages/torch/nn/functional.py", line 2203, in mse_loss
    expanded_input, expanded_target = torch.broadcast_tensors(input, target)
  File "/home/mgt/anaconda3/envs/basepy37/lib/python3.7/site-packages/torch/functional.py", line 52, in broadcast_tensors
    return torch._C._VariableFunctions.broadcast_tensors(tensors)
RuntimeError: The size of tensor a (126) must match the size of tensor b (278) at non-singleton dimension 0

As I suspected it's expecting same size files, but then what am I supposed to do?
Could you please illustrate how to train this for Bandwidth Extension? I had zero trouble with Source Separation and Denoising tasks, so I expected this to work the same way.
This is genuinely driving me crazy, I REALLY need this, any help is much appreciated, thanks!

Fix augmentation with multiple workers

we made a mistake like a lot of other repositories in the data augmention engine
https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/

To Reproduce

set nb_workers to a value higher than 1 to have identical seeds in each worker.

potential fix

Fix is given in the blog post. And also here: pytorch/pytorch#5059 (comment)

Possibly also happens in asteroid musdb18 dataset code

README News links 404

Hey OpenUnmixers!

I'm excited about all of the great work you've been doing! Congrats on the latest releases! :D

I just wanted to point out that two links under the News Section of your README are 404'ing:

The "Release Notes" link in the 14/02/2021 update goes to https://github.com/sigsep/open-unmix-pytorch/blob/master, which gives me a github 404.
The link to the Speech Enhancement model by Sony links to https://sigsep.github.io/open-unmix/se, which 404s on the sigsep page.

Thanks!
Ethan

travis tests fail due to missing mkl library

travis doesn't support mkl out-of-the-box which is why the unit tests currently fail. See issue here. A work around seems to install the intel mkl library through apt.

Add progress bar for cli inference

as proposed in #91

Convert training model to pytorch-lightning

lightning has matured enough to be used to refactor open-unmix.

A little confused while using istft in test.py

Hi sirs,

Sorry to bother.
This is not a bug, but I don't know whom I can ask.

I have a question about using istft() in test.py.
def istft(X, rate=44100, n_fft=4096, n_hopsize=1024): t, audio = scipy.signal.istft( X / (n_fft / 2), rate, nperseg=n_fft, noverlap=n_fft - n_hopsize, boundary=True ) return audio

Why does the input data "X" need to be divided by "(n_fft / 2)" ?
What is the purpose of it?

Thanks for your help.
mstfc

Hardware requirements for test

🐛 Bug

It seems to be relatively easy to get out of memory for the first example provided in the README on the GPU. Maybe it would be nice to add some hardware requirements or estimation how much memory you need per second of input signal.

To Reproduce

Steps to reproduce the behavior:

>>> python test.py ~/data/musdb18-wav/test/Al\ James\ -\ Schoolboy\ Facination/mixture.wav --model umxhq
Traceback (most recent call last):
  File "test.py", line 301, in <module>
    device=device
  File "test.py", line 166, in separate
    use_softmask=softmask)
  File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 260, in wiener
    y = expectation_maximization(y/max_abs, x_scaled, iterations, eps=eps)[0]
  File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 141, in expectation_maximization
    eps)
  File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 511, in get_local_gaussian_model
    C_j = _covariance(y_j)
  File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 468, in _covariance
    y_j.dtype)
MemoryError

Environment

Please add some information about your environment

Any other relevant information: NVIDIA GP107M [GeForce GTX 1050 Mobile]

If unsure you can paste the output from the pytorch environment collection script
(or fill out the checklist below manually).

You can get that script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce GTX 1050
Nvidia driver version: 430.40
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.13.3
[conda] mkl                       2019.4                      243  
[conda] pytorch                   1.2.0           py3.7_cuda10.0.130_cudnn7.6.2_0    pytorch

Docker command example not working

🐛 Bug

The docker command listed on the github homepage fails with RuntimeError: Error loading audio file: failed to open file umx

To Reproduce

Steps to reproduce the behavior:

T>docker run -v ~/Music/:/data -it faroit/open-unmix-pytorch umx "/data/track1.wav" --outdir /data/track1
Using cpu
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/vocals-b62c91ce.pth" to /root/.cache/torch/hub/checkpoints/vocals-b62c91ce.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34.0M/34.0M [00:04<00:00, 8.69MB/s]
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/drums-9619578f.pth" to /root/.cache/torch/hub/checkpoints/drums-9619578f.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34.0M/34.0M [00:04<00:00, 8.30MB/s]
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/bass-8d85a5bd.pth" to /root/.cache/torch/hub/checkpoints/bass-8d85a5bd.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34.0M/34.0M [00:05<00:00, 6.65MB/s]
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/other-b52fbbf7.pth" to /root/.cache/torch/hub/checkpoints/other-b52fbbf7.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34.0M/34.0M [00:04<00:00, 8.02MB/s]
formats: can't open input file `umx': No such file or directory
Traceback (most recent call last):
File "/opt/conda/bin/umx", line 8, in
sys.exit(separate())
File "/opt/conda/lib/python3.8/site-packages/openunmix/cli.py", line 160, in separate
audio, rate = data.load_audio(input_file, start=args.start, dur=args.duration)
File "/opt/conda/lib/python3.8/site-packages/openunmix/data.py", line 58, in load_audio
sig, rate = torchaudio.load(path)
File "/opt/conda/lib/python3.8/site-packages/torchaudio/backend/sox_io_backend.py", line 152, in load
return torch.ops.torchaudio.sox_io_load_audio_file(
RuntimeError: Error loading audio file: failed to open file umx

Expected behavior

I can see it complaining about not loading an audio file, at least while I figure out how the syntax applies to WINDOWS and docker, but it seems to be complaining about loading umx so perhaps a typo in the docker command?

Environment

Docker on Windows 10.

Please add some information about your environment

This stuff shouldn't be relevant for docker, yeah?

PyTorch Version (e.g., 1.2):
OS (e.g., Linux):
torchaudio loader (y/n):
Python version:
CUDA/cuDNN version:
Any other relevant information:

If unsure you can paste the output from the pytorch environment collection script
(or fill out the checklist below manually).

You can get that script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Additional context

Can't just evaluate one goal

when I run “python openunmix/evaluate.py --root /local/musdb18 --targets vocals --model my_model --residual acc”
appear：
Traceback (most recent call last):
File "openunmix/evaluate.py", line 260, in
results.add_track(scores)
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/museval/aggregate.py", line 183, in add_track
self.df = self.df.append(track.df, ignore_index=True)
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/museval/aggregate.py", line 113, in df
return json2df(simplejson.loads(self.json), self.track_name)
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/museval/aggregate.py", line 413, in json2df
df = pd.melt(
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/pandas/core/reshape/melt.py", line 64, in melt
raise KeyError(
KeyError: "The following 'id_vars' are not present in the DataFrame: ['name', 'time']"

Add umxl as new default

Better performance == more fun ;-)

Improve vocal-accompaniment separation without wiener filter

🚀 Model Improvement

In the vocal/accompaniment scenario, separating with --niter 0 --residual gets only to 3.9 dB SDR for vocals, whereas with the --niter 1 the scores get up to 6.0.

Motivation

The scores without wiener filtering should only be slighly worse than with.

UMX-L 1.2 installation?

Is there any installation instructions for the brand new version yet with larger training table?

Thanks, Rog

num_samples should be a positive integer value, but got num_samples=0

I'm trying to use my own data for training with the FixedSourcesTrackFolderDataset. Unfortunatly, the data doesn't seem to be recognized or found by the dataloader. I am using normal wav files (not stems), organized in the following folder structure:

   dataset
         valid
              0
                guitar.wav
                piano.wav
                ...
              1
              2
              ...
         train
              3
                guitar.wav
                piano.wav
                ...
              4
              5
              ...

Next, I issue:
python train.py --root /path/to/dataset --dataset trackfolder_fix --target-file piano.wav --interferer-files cello.wav guitar.wav hi-hat.wav

The following error is thrown:
ValueError: num_samples should be a positive integer value, but got num_samples=0

Is this a problem with the format (folder structure) in which the data is provided? From reading the documentation I can't figure out if a Pytorch dataclass has to be created beforehand or not. If so, how does that fit into the folder structure?

Iterative usage lead to memory failure

🐛 Bug

Possible memory leak produces OOM error on a minimal amount of memory allocation while using separate functions in an iterative manner.

To Reproduce

Get yourself with a set of 100 normal sized mp3 files and do this:

Steps to reproduce the behavior:

for filename in files:
    result = separate_music_file(
        file,
        'cpu',
        ['vocals'],
        # etc
    )
    print(result)

Expected behavior

If the mp3 file and the model can fit in memory I hope it finishes without error.
If the mp3 and the model can't fit in memory I expect to fail in a consistent way.

Environment

Please add some information about your environment

PyTorch Version (e.g., 1.2): 1.3
OS (e.g., Linux): Windows 10
torchaudio loader (y/n): n
Python version: 3.7
CUDA/cuDNN version: 10.1
Available memory: 8 GB (6 really discounting SO usage)

Additional context

The error message says it fails to allocate 6,000,000 bytes ~ 6 MB, so it looks like the mp3 file isn't big enough. Also, I tried a "split & retry mechanism", and it doesn't really matter the file size, the program fails after some iterations with any input size.

I think there could be a memory leak.
I'm still testing some changes, for example, adding torch.no_grad and cleaning caches between iterations but no use so far.
I'll keep you updated.

A few questions regarding audio separation in general and going by this model

I find the concept here very interesting, because it illustrates quite well the possibilities of today. But the last few days I've been thinking about something that I'd just like to ask because I don't have the background knowledge. I hope that you as experts can give me some information. Here the thoughts are summarized:

Let's assume the following case: I have two signals that I want to separate from each other. In this case it is normal speech and music (no singing, just speech). I could now take a standard model (like this one) and train it on it. So far so good. But unlike the normal "music" separation I have some other problems and challenges. Let's assume that the music is a music bed which serves as a base. This can be talked over by many different people and can be used in many ways. In radio/broadcasting, for example, this is part of everyday life. There is no correlation between voice and music. For these reasons I have more than one version available from the music source (more than 100 or even thousand times), which means that the music bed may have been talked over by many people. But the background music is always exactly the same in all cases.

Now to my question or assumption: Is it possible to teach a neural network to extract only the "similar" or "same" signals that are present in each file? My idea would be that you could simply extract the music bed, because it is present in all recordings. Only the volume is not always the same but the content itself is.

Is this a purely theoretical scenario, or could you build something like this? If so, how much do you think experienced people will spend on this? How would you teach a network to do that?

Sorry it's a little off topic. But I would simply be interested in the opinion here. Is this just a fantasy, or can something like this actually be implemented with the available resources?

Add simple way to fine-tune pretrained models

The current training code does provide a way to fine-tune models given a checkpoint file. However:

we do not provide the checkpoints on zenodo (that include the optimizer states)
there is commandline interface option to load umx or umxhq pretrained models for training

A little bug in notation of definition of OpenUnmix

class Spectrogram(nn.Module):
    def __init__(
        self,
        power=1,
        mono=True
    ):
        super(Spectrogram, self).__init__()
        self.power = power
        self.mono = mono

    def forward(self, stft_f):
        """
        Input: complex STFT
            (nb_samples, nb_bins, nb_frames, 2)
        Output: Power/Mag Spectrogram
            (nb_frames, nb_samples, nb_channels, nb_bins)
        """
        stft_f = stft_f.transpose(2, 3)
        # take the magnitude
        stft_f = stft_f.pow(2).sum(-1).pow(self.power / 2.0)

        # downmix in the mag domain
        if self.mono:
            stft_f = torch.mean(stft_f, 1, keepdim=True)

        # permute output for LSTM convenience
        return stft_f.permute(2, 0, 1, 3)

input shape should be (nb_samples, nb_channels, nb_bins, nbframes, 2)
It will confuse to understand.

[Question] Ideal/oracle performance of source estimate + mix phase

Hello,
I've been interested in running various oracle benchmark methods to check if different types of spectrogram (CQT, etc.) can be useful for source separation.
Initially, I was working with the IRM1/2 and IBM1/2 from https://github.com/sigsep/sigsep-mus-oracle

However I noticed that Open-Unmix uses the strategy of "estimate of source magnitude + phase of original mix" (but it has an option to use soft masking instead). Is it valuable to create an "oracle phase-inversion" method?

So, soft mask/IRM1 "ceiling" of performance (the known IRM1 oracle mask calculation) is like (using vocals stem as an example):

mix = <load mix>                          # mixed track
vocals_gt = <load vocals stem>   # ground truth

vocals_irm1 = abs(stft(vocals_gt)) / abs(stft(mix))

vocals_est = istft(vocals_irm1 * stft(mix)) # estimate after "round trip" through soft mask

Now, for the phase inversion method, we could do the following:

mix = <load mix>                          # mixed track
vocals_gt = <load vocals stem>   # ground truth

mix_phase = phase(stft(mix))
vocals_gt_magnitude = abs(stft(vocals_gt))

vocals_stft = pol2cart(vocals_gt_magnitude, mix_phase)

vocals_est = istft(vocals_stft)  # estimate after "round trip" through phase inversion

Does this make sense to do? Has anybody done this before? What could this method be called?

sourcefolder training

Hello, again.
@sigsep:
Sorry to bother you, but I should have another novice mistake on training a "sourcefolder" dataset.
Specifically, I am using DCASE2013_subtask2/singlesounds_stereo that has 320 wav files containing 16 classes of environmental noises (alert, clearthroat, cough, etc, 20 files each).I separated them into different folders according to the noise labels (./DCASE2013 (as root)/train/alert/alert01.wav, alert02.wav), etc.

When I tried the following comand, the error occured.
Command: python train.py --dataset sourcefolder --root ./DCASE2013 --target-dir alert --interferer-dirs clearthroat cough

Error message:
Using GPU: True
100%|█████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 36.16it/s]
100%|██████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 10745.44it/s]
0%| | 0/1000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 291, in
main()
File "train.py", line 174, in main
scaler_mean, scaler_std = get_statistics(args, train_dataset)
File "train.py", line 66, in get_statistics
x, y = dataset_scaler[ind]
File "/xxx/open-unmix-pytorch/data.py", line 367, in getitem
source_path = random.choice(self.source_tracks[source])
File "/xxx/anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/random.py", line 261, in choice
raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence

Am I missing something? Looks like it does not find the training files.

Save pre-trained model if loaded from torch hub

🐛 Bug

If the model hasn't been manually downloaded then the default umxhq model from torch hub gets downloaded every time.

To Reproduce

Run umx on two files one by one
Model downloaded twice

Expected behavior

If downloading then it should be saved so as to not download everytime

torch.hub.set_dir could be set to model_path perhaps to save the model there (haven't tried yet).

Typo in README - input tensor shape of OpenUnmix

Hello,
I believe the true input shape of OpenUnmix (the spectrogram model, not the on-the-fly waveform one) is this, taken from the code:

(nb_samples, nb_channels, nb_bins, nb_frames)

This corresponds to the (I, F, T) that I've seen in the oracle code (I = channels, F = frequency bins, T = time frames).

The README describes the shape in a different order:

models.OpenUnmix: The core open-unmix takes magnitude spectrograms directly (e.g. when pre-computed and loaded from disk). In that case, the input is of shape (nb_frames, nb_samples, nb_channels, nb_bins)

About PyTorch Mobile

🚀 Model Improvement

Facebok has just announced PyTorch Mobile for both iOS and Android devices in PyTorch 1.3. They run new quantization algorithms (FBGEMM and QNNPACK state-of-the-art quantized kernel back ends) for this mobile version.

Motivation

Having a quantized model running on the device would be an interesting challenge.
It would be interesting to try the quantization of the model to make it ready to run on the device.
For more info about PyTorch 1.3 here

Objective Evaluation

Not really open, MUSDB18-HQ is not availiable

I've tried numerous times to "Request access" from zenodo.org but they ignore my requests. My guess is you have to be an RIAA member to get access. Is there someplace to actually get this data? I have a good amount of separate track music I would like to augment MUSDB18-HQ with.

Availability on Android

Hi, I just wanted to know if it's possible to use Open-Unmix on android via Pytorch, I know there is usage of Pytorch on Android for image processing but I haven't found any examples to help me use Open-Unmix on android.
.

training speech of MUSDB18 is very slow

My system: Ubuntu 16.04, one GTX1080Ti, CUDA9, 24 core CPU

When training MUSDB18 using default unmix model, there are 544 batches, the iteration time of one bactch is about 21 seconds, so the training time of total 544 batches is 21*544 sec = 11424 sec = 3.1 hours, which is very slow.

PS: my training script is:
python train.py --root path/to/musdb18 --target vocals

I suspect that there are something wrong in my training process. What about your training time of MUSDB18? Thanks

Training

Hello,

I want to transfer-learn the default dataset on a dataset of my own (or just train my own dataset from scratch if easier). I have mixture.wav files and the wav files for the individual instruments as well. I want to be able to separate everything in the song. I have some questions, though:

What dataset type should I use for this application?
Does each individual song's wav files need to be the same length, or does every song in the dataset and their wav files need to be the same length? Basically, can different songs be different lengths?

I'm wondering this because I was messing around with train.py and got an error of NotImplementedError: Non-relative patterns are unsupported.

Would I get better single-instrument performance if I used the aligned ("denoising") dataset since it would be just focusing on the target sound and the noise? For example, if I just wanted to separate the bass from a song.
Also is there a Colab notebook that is set up to train?

Sorry if this doesn't make sense I tried to make it as clear as possible.

Confusion about 'vocals SDR'

Dear Sir or Madam,
Hello. Thank you for your sharing firstly.
I run your codes only for separating 'vocals' according to your .md file.I get Aggrated Scores
vocals ==> SDR: 5.415 SIR: 10.950 ISR: 14.831 SAR: 5.533, which is quite different from the ideal result.
Could you please tell me something wrong? What should I do to reproduce your results. By the way, I use the dataset musdb18 --is-wav
Thank you very much.

Set default augmentations

I'm trying to train open-unmix from scratch. The validation losses after early stopping patience are not as good as what's shown in training.md: https://github.com/sigsep/open-unmix-pytorch/blob/master/docs/training.md

I'm using the exact open-unmix-pytorch codebase with no modifications. My training script is:

for target in drums vocals other bass;
do
        python scripts/train.py \
                --root=~/MUSDB18-HQ/ --is-wav --nb-workers=4 --batch-size=16 --epochs=1000 \
                --target="$target" \
                --outpu="umx-baseline"
done

So far, drums and vocals have trained to the following lowest validation loss:
Drums: 0.93 (compared to 0.7 of the claimed training.md)
Vocals: 1.1 (compared to 0.992 of the claimed training.md)

These aren't huge differences, but I'm wondering if there's any explanation. Is it the random seed which allowed your drum model to as far down as 0.7?

sigsep / open-unmix-pytorch Goto Github PK

open-unmix-pytorch's People

Contributors

Stargazers

Watchers

Forkers

open-unmix-pytorch's Issues

🐛 Bug

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

🐛 Bug

🐛 Bug

🐛 Dataset Statistics do not work for sourcefolder dataset

Expected behavior

Proposed solutions

Solution 1

Solution 2

🐛 Bug

To Reproduce

Expected behavior

Environment

🐛 Bug

To Reproduce

Expected behavior

Environment

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

To Reproduce

potential fix

🐛 Bug

To Reproduce

Environment

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

🚀 Model Improvement

Motivation

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

🐛 Bug

To Reproduce

Expected behavior

🚀 Model Improvement

Motivation

Objective Evaluation

Recommend Projects

Recommend Topics

Recommend Org

🐛 Dataset Statistics do not work for `sourcefolder` dataset