sigsep / open-unmix-pytorch Goto Github PK
View Code? Open in Web Editor NEWOpen-Unmix - Music Source Separation for PyTorch
Home Page: https://sigsep.github.io/open-unmix/
License: MIT License
Open-Unmix - Music Source Separation for PyTorch
Home Page: https://sigsep.github.io/open-unmix/
License: MIT License
Hi,
Sorry for the bother. I have a basic question about the input-stage standardization.
After the STFT transform the model does:
x += self.input_mean
x *= self.input_scale
and the same, with the opposite order in the output-stage
x *= self.output_scale
x += self.output_mean
I'm wonder about the input stage part, if we want the normalize the spectrogram to be a zero-mean and with STD of one, don't we need to subtract the samples by the mean and dividing by the std ? like this:
x -= self.input_mean
x /= self.input_scale
Any help will be very appreciated.
Thank you!
Hi,
I use Open-Unmix training on my data set (includes MUSDB stem version + other) and it took place on Nvidia RTX2080 cards without SSD and without nb workers.
My GPU utilization as i see it with the command "nvidia-smi" is 2%-11% (Cuda is enabled and got print GPU usage True, though Torchaudio usage is False).
However in you're description about the training process, you mentioned that your GPU utilization got to 90%.
What is the reason for my low GPU Utilization? Is it related to the fact that torchaudio is not used?
Can you please give an approximation upon the expected range of GPU utilization?
Thank you very much.
Hello,
I am trying to test out the torchfilters branch of this project. It works fine on shorter audio clips, but when the audio file is around 4 to 5 minutes in length, the program crashes with a CudaOutOfMemoryError.
Steps to reproduce the behavior:
Traceback (most recent call last):
File "/home/user/unmix/test.py", line 74, in separate
estimates, model_rate = separator(audio_torch, rate)
File "/home/user/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/user/unmix/unmix/filtering.py", line 833, in forward
for sample in range(nb_samples)], dim=0)
File "/home/user/unmix/filtering.py", line 833, in <listcomp>
for sample in range(nb_samples)], dim=0)
File "/home/user/anaconda3/lib/python3.7/site-packages/torchaudio/functional.py", line 130, in istft
onesided, signal_sizes=(n_fft,)) # size (channel, n_frames, n_fft)
RuntimeError: CUDA out of memory. Tried to allocate 454.00 MiB (GPU 0; 7.43 GiB total capacity; 6.02 GiB already allocated; 218.94 MiB free; 690.49 MiB cached)
The program should finish execution on files of longer length as well. Is there a way to split the audio every one or two minutes, or use an audio loader in such a way that the entire song isn't loaded into CUDA memory at once, so that way it doesn't crash?
Thank you!
Please add some information about your environment
I am trying to run umx in Windows 10 64 + Anaconda 3.
The installation ("pip install openunmix") seemed to pass without any problem but "umx anyfile.wav" failed:
(base) C:\Users\Vita\audio-separation\open-unmix-2021>umx nakonci.wav
Traceback (most recent call last):
File "c:\users\vita\anaconda3\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\vita\anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\Vita\Anaconda3\Scripts\umx.exe\__main__.py", line 9, in <module>
File "c:\users\vita\anaconda3\lib\site-packages\openunmix\cli.py", line 118, in separate
torchaudio.set_audio_backend(args.audio_backend)
File "c:\users\vita\anaconda3\lib\site-packages\torchaudio\backend\utils.py", line 44, in set_audio_backend
f'Backend "{backend}" is not one of '
RuntimeError: Backend "sox_io" is not one of available backends: ['soundfile'].
The strange thing was that I got the same error message with full path to the file and even with a non-existing file so it seemed umx could not even load the input file.
Then I noticed this:
Note that we support all files that can be read by torchaudio, depending on the set backend (either soundfile (libsndfile) or sox).
Adding "--audio-backend sox_io" resulted in the same error message, but "--audio-backend soundfile" finally made it work.
Maybe the default setting should change...?
Hi, I have been waiting this for long!
I wold like to report my experiences on training MUSDB18 (Ubuntu 18.04, GTX1080Ti, CUDA10).
"python train.py --root ./musdb18 --target vocals" yielded the following errors.
1: raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError:
My Response: "git init" created bunch of files under .git
2: Then two different error messages: raise ValueError("Reference at %r does not exist" % ref_path)
ValueError: Reference at 'refs/heads/master' does not exist
My Response: As suggested, created .git/refs/heads/master and wrote "ref: refs/heads/master" in the text there
Result: this stopped the error.
3: python train.py --root ./musdb18 finally runs without an error message, but nvidia-smi shows no GPU usage and returns none.
Any suggestions? Thanks!!
When I run train.py with a custom dataset, the dataset doesnt load and I get the error: "IndexError: Cannot choose from an empty sequence". When I print the length of the dataset I get a non zero value.
This is the command I use to run the train.py script:
"! python train.py --dataset sourcefolder --root /content/data --target-dir gt--interferer-dirs interfer --ext .wav --nb-train-samples 1000 --nb-valid-samples 100"
I am running the code in Google Colab.
https://github.com/sigsep/open-unmix-pytorch/blob/master/model.py#L187
I think that unless provided (no output scaling), output_mean should be torch.zeros() and not torch.ones()
The main page says STL2 isn't included in the comparison because it used additional training data.
According to this page, STL2 (multi-instrument Wave-U-Net) didn't use additional training data. Which one is right? I think the confusion arose because STL1 does use additional data (CCMixter).
cc: @f90
sourcefolder
datasetThe get_statistics
function was designed to iterate over the complete audio data in an deterministic manner, therefore loading the full audio samples. This doesn't work together with the sourcefolder dataset as it allows to have different length of files as it get short chunks of fixed lengths from each item.
sourcefolder
dataset should work with get_statistics
replace dataset_scaler.seq_duration = None
with dataset_scaler.seq_duration = args.seq_dur
. That would solve the issue but then would only train the dataset statistics on the first n seconds from each sample.
use stochastic sampling and use a dataloader instead of a dataset: e.g.:
def get_statistics(args, dataloader):
scaler = sklearn.preprocessing.StandardScaler()
spec = torch.nn.Sequential(
model.STFT(n_fft=args.nfft, n_hop=args.nhop),
model.Spectrogram(mono=True)
)
pbar = tqdm.tqdm(dataloader, disable=args.quiet)
for x, y in pbar:
pbar.set_description("Compute dataset statistics")
X = spec(x)
scaler.partial_fit(np.squeeze(X))
std = np.maximum(
scaler.scale_,
1e-4*np.max(scaler.scale_)
)
return scaler.mean_, std
stats_sampler = torch.utils.data.DataLoader(
train_dataset, batch_size=1,
sampler=sampler, **dataloader_kwargs
)
the second option would get better distributed samples and users can maybe specify an argument that selects the number of samples randomly drawn to train the dataset statistics
Hi Sirs,
I'm new to UMX. I tried to reproduce the fantastic results that you made on the website.
I use only umx vocals-c8df74a5.pth to do evaluation (eval.py) with MUSDB18 testset (50 songs)
Here is my result :
UMX1 accompaniment ISR 18.950225
SAR 12.290675
SDR 11.881972
SIR 20.425005
vocals ISR 14.368638
SAR 5.715235
SDR 5.567850
SIR 12.480217
The SDR of vocals is 5.567850 and much worse than your result of 6.32
May I know how to reproduce your result?
What are the musdb/museval version you use?
I use musdb 0.3.1, museval 0.3.0.
I also plotted the boxplot and it just a little bit better than Wave-U-Net 44KHz pre-trained model.
I'm wondering that what did I do wrong?
Hope to receive your response.
Thanks in advance.
mstfc
I have MP3 files at 128 kb/s like
Metadata:
encoder : Lavf58.20.100
Duration: 00:02:22.11, start: 0.025057, bitrate: 128 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 128 kb/s
Metadata:
encoder : Lavc58.35
and I therefore converto to wav so that I get the a 22050 Hz file as for the dataset specification:
ffmpeg -i file.mp3 -acodec pcm_s16le -ar 22050 file.wav
Metadata:
encoder : Lavf58.12.100
Duration: 00:02:22.06, bitrate: 705 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 22050 Hz, 2 channels, s16, 705 kb/s
The separation works, my guess is if this is the best approach from the given mp3 sample rate and bit rate.
Thank you.
First of all, I have to acknowledge the authors of open-unmix for this obviously awesome work ;)
My issue is about the sourcefolder dataset, which cannot handle sources with different numbers of channels. Let's assume that we have two folders of sources, the first one contains mono signals, the second one stereo signals. For training, we also set nb_channels
to 1
. In __getitem__
of SourceFolderDataset
, an error is raised when trying to stack the sources, before summing them to create the mixture (line 358 of data.py
).
Steps to reproduce the behavior:
nb-channels
to 1
, below is the command I used:python train.py --root ./data-sourcefolder --dataset sourcefolder --interferer-dirs noise --target-dir speech --nb-train-samples 20000 --nb-valid-samples 2000 --seq-dur 2.0 --source-augmentations gain --hidden-size 256 --nb-channels 1 --nfft 1024 --nhop 256 --nb-workers 4
Traceback (most recent call last):
File "train.py", line 294, in <module>
main()
File "train.py", line 177, in main
scaler_mean, scaler_std = get_statistics(args, train_dataset)
File "train.py", line 68, in get_statistics
x, y = dataset_scaler[ind]
File "/data/recherche/python/speech_enhancement/open-unmix-pytorch/data.py", line 385, in __getitem__
stems = torch.stack(audio_sources)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 2 and 1 in dimension 1 at /tmp/pip-req-build-58y_cjjl/aten/src/TH/generic/THTensor.cpp:689
We could expect that because we set nb_channels
to 1
, stereo source signals would be downmixed so that they could be mixed with the monophonic sources.
It seems because of this line: https://github.com/sigsep/open-unmix-pytorch/blob/master/openunmix/filtering.py#L301, where dtype is not provided and fixed, trying to use wiener filtering with double precision will fail.
File "/Users/defossez/projs/demucs/env/lib/python3.8/site-packages/openunmix/filtering.py", line 472, in wiener
y = expectation_maximization(y, mix_stft, iterations, eps=eps)[0]
File "/Users/defossez/projs/demucs/env/lib/python3.8/site-packages/openunmix/filtering.py", line 301, in expectation_maximization
y[t, ...] = torch.tensor(0.0, device=x.device)
RuntimeError: Index put requires the source and destination dtypes match, got Double for the destination and Float for the source.
Call wiener filtering function with a magnitude that is float64, and complex spectrogram of the mixture that is complex128.
Expected call to succeed with high precision inputs.
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 11.4 (x86_64)
GCC version: Could not collect
Clang version: 11.0.0
CMake version: version 3.19.1
Libc version: N/A
Python version: 3.8.8 (default, Feb 24 2021, 13:46:16) [Clang 10.0.0 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0
[pip3] torchaudio==0.9.0
[pip3] torchvision==0.9.1
[conda] blas 1.0 mkl
[conda] mkl 2019.4 233
[conda] mkl-service 2.3.0 py38h9ed2024_0
[conda] mkl_fft 1.2.0 py38hc64f4ea_0
[conda] mkl_random 1.1.1 py38h959d312_0
[conda] numpy 1.20.2 pypi_0 pypi
[conda] torch 1.8.1 pypi_0 pypi
[conda] torchaudio 0.8.1 pypi_0 pypi
[conda] torchvision 0.9.1 pypi_0 pypi
It seems that there is an interest to use just the pre-trained weights from open-unmix. To improve usability we will make open-unmix a pypi (and possibly conda-forge) package.
h$ python3 train.py --dataset musdb
Using GPU: True
Using Torchaudio: False
Traceback (most recent call last):
File "train.py", line 294, in
main()
File "train.py", line 158, in main
train_dataset, valid_dataset, args = data.load_datasets(parser, args)
File "/home/scss/DeepLearning/VocalEX/Separation/open-unmix-pytorch/open-unmix-pytorch/data.py", line 226, in load_datasets
**dataset_kwargs
File "/home/scss/DeepLearning/VocalEX/Separation/open-unmix-pytorch/open-unmix-pytorch/data.py", line 751, in init
*args, **kwargs
TypeError: init() got an unexpected keyword argument 'root'
Steps to reproduce the behavior:
1.γPython3 train.pyγ--root (Datasets)
Please add some information about your environment
If unsure you can paste the output from the pytorch environment collection script
(or fill out the checklist below manually).
You can get that script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
I'm interested in implementing a real-time, streaming version of the separation method.
Do you have any advice on how to extract the model weights for this?
Would it be best to retrain, and save the weights during training?
I'm actually going to use it in another script but there's some pre-processing before the separate
function gets called in test.py
(the part after if __name__ == '__main__'
).
I wrapped it up in a whole function and was wondering if that's a good approach?
Like,
def main(input_files, samplerate, niter, alpha, softmask, residual_model, model,
targets=('vocals', 'drums', 'bass', 'other'), outdir=None, no_cuda=False):
and then at the end call it by
main(args.input, args.samplerate, args.niter, args.alpha, args.softmask, args.residual_model, args.model, args.targets, args.outdir, args.no_cuda)
This doesn't change the cli functionality but allows me to import the main function for external use.
Currently using train.py --nb-channels 1
will apply a downmix in the spectral domain inside the model to feed in only single channel audio.
However, I can think about applications where we do not have access to a wiener filter and therefore apply the model to each channel individually. In that case the performance might be better when the model was trained on just the left or the right channel. This can be fixed since we use channel swap augmentation.
What are the random seeds you used for the different targets?
Hello,
is the dataset available as audio tracks for other projects? If so, under what terms?
Best regards
Your training docs mention that an aligned dataset can be used for Bandwidth Extension (Low Bandwidth -> High Bandwidth) as mentioned here.
Previously I have trained models for Source Separation (Mixture -> Target) and Denoising (Noisy -> Clean) and they're working as intended.
But training for Bandwidth Extension doesn't provide any noticeable enhancements at all, this is the output spectrogram and this is how it's supposed to be.
At first I thought I did a mistake to convert the source from 22050Hz to 44100Hz, so I tried training with 22050Hz and 48000Hz files directly but it would throw this error:
Using GPU: True
Using Torchaudio: True
16748it [00:21, 795.79it/s]
15it [00:00, 657.46it/s]
Compute dataset statistics: 100%|βββββββ| 16416/16416 [1:46:10<00:00, 2.58it/s]
Training Epoch: 0%| | 0/1000 [00:00<?, ?it/strain.py:31: UserWarning: Using a target size (torch.Size([278, 16, 1, 2049])) that is different to the input size (torch.Size([126, 16, 1, 2049])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
loss = torch.nn.functional.mse_loss(Y_hat, Y)
Training batch: 0%| | 0/1026 [00:01<?, ?it/s]
Training Epoch: 0%| | 0/1000 [00:01<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 295, in <module>
main()
File "train.py", line 244, in main
train_loss = train(args, unmix, device, train_sampler, optimizer)
File "train.py", line 31, in train
loss = torch.nn.functional.mse_loss(Y_hat, Y)
File "/home/mgt/anaconda3/envs/basepy37/lib/python3.7/site-packages/torch/nn/functional.py", line 2203, in mse_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/home/mgt/anaconda3/envs/basepy37/lib/python3.7/site-packages/torch/functional.py", line 52, in broadcast_tensors
return torch._C._VariableFunctions.broadcast_tensors(tensors)
RuntimeError: The size of tensor a (126) must match the size of tensor b (278) at non-singleton dimension 0
As I suspected it's expecting same size files, but then what am I supposed to do?
Could you please illustrate how to train this for Bandwidth Extension? I had zero trouble with Source Separation and Denoising tasks, so I expected this to work the same way.
This is genuinely driving me crazy, I REALLY need this, any help is much appreciated, thanks!
we made a mistake like a lot of other repositories in the data augmention engine
https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/
set nb_workers to a value higher than 1 to have identical seeds in each worker.
Fix is given in the blog post. And also here: pytorch/pytorch#5059 (comment)
Possibly also happens in asteroid musdb18 dataset code
Hey OpenUnmixers!
I'm excited about all of the great work you've been doing! Congrats on the latest releases! :D
I just wanted to point out that two links under the News Section of your README are 404'ing:
14/02/2021
update goes to https://github.com/sigsep/open-unmix-pytorch/blob/master, which gives me a github 404.Thanks!
Ethan
travis doesn't support mkl out-of-the-box which is why the unit tests currently fail. See issue here. A work around seems to install the intel mkl library through apt.
as proposed in #91
lightning has matured enough to be used to refactor open-unmix.
Hi sirs,
Sorry to bother.
This is not a bug, but I don't know whom I can ask.
I have a question about using istft() in test.py.
def istft(X, rate=44100, n_fft=4096, n_hopsize=1024): t, audio = scipy.signal.istft( X / (n_fft / 2), rate, nperseg=n_fft, noverlap=n_fft - n_hopsize, boundary=True ) return audio
Why does the input data "X" need to be divided by "(n_fft / 2)" ?
What is the purpose of it?
Thanks for your help.
mstfc
It seems to be relatively easy to get out of memory for the first example provided in the README
on the GPU. Maybe it would be nice to add some hardware requirements or estimation how much memory you need per second of input signal.
Steps to reproduce the behavior:
>>> python test.py ~/data/musdb18-wav/test/Al\ James\ -\ Schoolboy\ Facination/mixture.wav --model umxhq
Traceback (most recent call last):
File "test.py", line 301, in <module>
device=device
File "test.py", line 166, in separate
use_softmask=softmask)
File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 260, in wiener
y = expectation_maximization(y/max_abs, x_scaled, iterations, eps=eps)[0]
File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 141, in expectation_maximization
eps)
File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 511, in get_local_gaussian_model
C_j = _covariance(y_j)
File "/home/audeering.local/hwierstorf/.anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/site-packages/norbert/__init__.py", line 468, in _covariance
y_j.dtype)
MemoryError
Please add some information about your environment
If unsure you can paste the output from the pytorch environment collection script
(or fill out the checklist below manually).
You can get that script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0.130
OS: Ubuntu 18.04.3 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce GTX 1050
Nvidia driver version: 430.40
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy==1.13.3
[conda] mkl 2019.4 243
[conda] pytorch 1.2.0 py3.7_cuda10.0.130_cudnn7.6.2_0 pytorch
Steps to reproduce the behavior:
T>docker run -v ~/Music/:/data -it faroit/open-unmix-pytorch umx "/data/track1.wav" --outdir /data/track1
Using cpu
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/vocals-b62c91ce.pth" to /root/.cache/torch/hub/checkpoints/vocals-b62c91ce.pth
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 34.0M/34.0M [00:04<00:00, 8.69MB/s]
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/drums-9619578f.pth" to /root/.cache/torch/hub/checkpoints/drums-9619578f.pth
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 34.0M/34.0M [00:04<00:00, 8.30MB/s]
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/bass-8d85a5bd.pth" to /root/.cache/torch/hub/checkpoints/bass-8d85a5bd.pth
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 34.0M/34.0M [00:05<00:00, 6.65MB/s]
Downloading: "https://zenodo.org/api/files/1c8f83c5-33a5-4f59-b109-721fdd234875/other-b52fbbf7.pth" to /root/.cache/torch/hub/checkpoints/other-b52fbbf7.pth
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 34.0M/34.0M [00:04<00:00, 8.02MB/s]
formats: can't open input file `umx': No such file or directory
Traceback (most recent call last):
File "/opt/conda/bin/umx", line 8, in
sys.exit(separate())
File "/opt/conda/lib/python3.8/site-packages/openunmix/cli.py", line 160, in separate
audio, rate = data.load_audio(input_file, start=args.start, dur=args.duration)
File "/opt/conda/lib/python3.8/site-packages/openunmix/data.py", line 58, in load_audio
sig, rate = torchaudio.load(path)
File "/opt/conda/lib/python3.8/site-packages/torchaudio/backend/sox_io_backend.py", line 152, in load
return torch.ops.torchaudio.sox_io_load_audio_file(
RuntimeError: Error loading audio file: failed to open file umx
I can see it complaining about not loading an audio file, at least while I figure out how the syntax applies to WINDOWS and docker, but it seems to be complaining about loading umx so perhaps a typo in the docker command?
Docker on Windows 10.
Please add some information about your environment
This stuff shouldn't be relevant for docker, yeah?
If unsure you can paste the output from the pytorch environment collection script
(or fill out the checklist below manually).
You can get that script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
when I run βpython openunmix/evaluate.py --root /local/musdb18 --targets vocals --model my_model --residual accβ
appearοΌ
Traceback (most recent call last):
File "openunmix/evaluate.py", line 260, in
results.add_track(scores)
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/museval/aggregate.py", line 183, in add_track
self.df = self.df.append(track.df, ignore_index=True)
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/museval/aggregate.py", line 113, in df
return json2df(simplejson.loads(self.json), self.track_name)
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/museval/aggregate.py", line 413, in json2df
df = pd.melt(
File "/home/bianyuren/anaconda3/envs/umx-gpu-pytorch_1_8/lib/python3.8/site-packages/pandas/core/reshape/melt.py", line 64, in melt
raise KeyError(
KeyError: "The following 'id_vars' are not present in the DataFrame: ['name', 'time']"
Better performance == more fun ;-)
In the vocal/accompaniment scenario, separating with --niter 0 --residual
gets only to 3.9 dB SDR for vocals, whereas with the --niter 1
the scores get up to 6.0.
The scores without wiener filtering should only be slighly worse than with.
Is there any installation instructions for the brand new version yet with larger training table?
Thanks, Rog
I'm trying to use my own data for training with the FixedSourcesTrackFolderDataset. Unfortunatly, the data doesn't seem to be recognized or found by the dataloader. I am using normal wav files (not stems), organized in the following folder structure:
dataset
valid
0
guitar.wav
piano.wav
...
1
2
...
train
3
guitar.wav
piano.wav
...
4
5
...
Next, I issue:
python train.py --root /path/to/dataset --dataset trackfolder_fix --target-file piano.wav --interferer-files cello.wav guitar.wav hi-hat.wav
The following error is thrown:
ValueError: num_samples should be a positive integer value, but got num_samples=0
Is this a problem with the format (folder structure) in which the data is provided? From reading the documentation I can't figure out if a Pytorch dataclass has to be created beforehand or not. If so, how does that fit into the folder structure?
Possible memory leak produces OOM error on a minimal amount of memory allocation while using separate
functions in an iterative manner.
Get yourself with a set of 100 normal sized mp3 files and do this:
Steps to reproduce the behavior:
for filename in files:
result = separate_music_file(
file,
'cpu',
['vocals'],
# etc
)
print(result)
If the mp3 file and the model can fit in memory I hope it finishes without error.
If the mp3 and the model can't fit in memory I expect to fail in a consistent way.
Please add some information about your environment
The error message says it fails to allocate 6,000,000 bytes ~ 6 MB, so it looks like the mp3 file isn't big enough. Also, I tried a "split & retry mechanism", and it doesn't really matter the file size, the program fails after some iterations with any input size.
I think there could be a memory leak.
I'm still testing some changes, for example, adding torch.no_grad and cleaning caches between iterations but no use so far.
I'll keep you updated.
I find the concept here very interesting, because it illustrates quite well the possibilities of today. But the last few days I've been thinking about something that I'd just like to ask because I don't have the background knowledge. I hope that you as experts can give me some information. Here the thoughts are summarized:
Let's assume the following case: I have two signals that I want to separate from each other. In this case it is normal speech and music (no singing, just speech). I could now take a standard model (like this one) and train it on it. So far so good. But unlike the normal "music" separation I have some other problems and challenges. Let's assume that the music is a music bed which serves as a base. This can be talked over by many different people and can be used in many ways. In radio/broadcasting, for example, this is part of everyday life. There is no correlation between voice and music. For these reasons I have more than one version available from the music source (more than 100 or even thousand times), which means that the music bed may have been talked over by many people. But the background music is always exactly the same in all cases.
Now to my question or assumption: Is it possible to teach a neural network to extract only the "similar" or "same" signals that are present in each file? My idea would be that you could simply extract the music bed, because it is present in all recordings. Only the volume is not always the same but the content itself is.
Is this a purely theoretical scenario, or could you build something like this? If so, how much do you think experienced people will spend on this? How would you teach a network to do that?
Sorry it's a little off topic. But I would simply be interested in the opinion here. Is this just a fantasy, or can something like this actually be implemented with the available resources?
The current training code does provide a way to fine-tune models given a checkpoint file. However:
umx
or umxhq
pretrained models for trainingclass Spectrogram(nn.Module):
def __init__(
self,
power=1,
mono=True
):
super(Spectrogram, self).__init__()
self.power = power
self.mono = mono
def forward(self, stft_f):
"""
Input: complex STFT
(nb_samples, nb_bins, nb_frames, 2)
Output: Power/Mag Spectrogram
(nb_frames, nb_samples, nb_channels, nb_bins)
"""
stft_f = stft_f.transpose(2, 3)
# take the magnitude
stft_f = stft_f.pow(2).sum(-1).pow(self.power / 2.0)
# downmix in the mag domain
if self.mono:
stft_f = torch.mean(stft_f, 1, keepdim=True)
# permute output for LSTM convenience
return stft_f.permute(2, 0, 1, 3)
input shape should be (nb_samples, nb_channels, nb_bins, nbframes, 2)
It will confuse to understand.
Hello,
I've been interested in running various oracle benchmark methods to check if different types of spectrogram (CQT, etc.) can be useful for source separation.
Initially, I was working with the IRM1/2 and IBM1/2 from https://github.com/sigsep/sigsep-mus-oracle
However I noticed that Open-Unmix uses the strategy of "estimate of source magnitude + phase of original mix" (but it has an option to use soft masking instead). Is it valuable to create an "oracle phase-inversion" method?
So, soft mask/IRM1 "ceiling" of performance (the known IRM1 oracle mask calculation) is like (using vocals stem as an example):
mix = <load mix> # mixed track
vocals_gt = <load vocals stem> # ground truth
vocals_irm1 = abs(stft(vocals_gt)) / abs(stft(mix))
vocals_est = istft(vocals_irm1 * stft(mix)) # estimate after "round trip" through soft mask
Now, for the phase inversion method, we could do the following:
mix = <load mix> # mixed track
vocals_gt = <load vocals stem> # ground truth
mix_phase = phase(stft(mix))
vocals_gt_magnitude = abs(stft(vocals_gt))
vocals_stft = pol2cart(vocals_gt_magnitude, mix_phase)
vocals_est = istft(vocals_stft) # estimate after "round trip" through phase inversion
Does this make sense to do? Has anybody done this before? What could this method be called?
Hello, again.
@sigsep:
Sorry to bother you, but I should have another novice mistake on training a "sourcefolder" dataset.
Specifically, I am using DCASE2013_subtask2/singlesounds_stereo that has 320 wav files containing 16 classes of environmental noises (alert, clearthroat, cough, etc, 20 files each).I separated them into different folders according to the noise labels (./DCASE2013 (as root)/train/alert/alert01.wav, alert02.wav), etc.
When I tried the following comand, the error occured.
Command: python train.py --dataset sourcefolder --root ./DCASE2013 --target-dir alert --interferer-dirs clearthroat cough
Error message:
Using GPU: True
100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 36.16it/s]
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:00<00:00, 10745.44it/s]
0%| | 0/1000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 291, in
main()
File "train.py", line 174, in main
scaler_mean, scaler_std = get_statistics(args, train_dataset)
File "train.py", line 66, in get_statistics
x, y = dataset_scaler[ind]
File "/xxx/open-unmix-pytorch/data.py", line 367, in getitem
source_path = random.choice(self.source_tracks[source])
File "/xxx/anaconda3/envs/open-unmix-pytorch-gpu/lib/python3.7/random.py", line 261, in choice
raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence
Am I missing something? Looks like it does not find the training files.
If the model hasn't been manually downloaded then the default umxhq model from torch hub gets downloaded every time.
If downloading then it should be saved so as to not download everytime
torch.hub.set_dir
could be set to model_path
perhaps to save the model there (haven't tried yet).
Hello,
I believe the true input shape of OpenUnmix (the spectrogram model, not the on-the-fly waveform one) is this, taken from the code:
(nb_samples, nb_channels, nb_bins, nb_frames)
This corresponds to the (I, F, T)
that I've seen in the oracle code (I = channels, F = frequency bins, T = time frames).
The README describes the shape in a different order:
models.OpenUnmix: The core open-unmix takes magnitude spectrograms directly (e.g. when pre-computed and loaded from disk). In that case, the input is of shape (nb_frames, nb_samples, nb_channels, nb_bins)
Facebok has just announced PyTorch Mobile for both iOS and Android devices in PyTorch 1.3. They run new quantization algorithms (FBGEMM and QNNPACK state-of-the-art quantized kernel back ends) for this mobile version.
Having a quantized model running on the device would be an interesting challenge.
It would be interesting to try the quantization of the model to make it ready to run on the device.
For more info about PyTorch 1.3 here
I've tried numerous times to "Request access" from zenodo.org but they ignore my requests. My guess is you have to be an RIAA member to get access. Is there someplace to actually get this data? I have a good amount of separate track music I would like to augment MUSDB18-HQ with.
Hi, I just wanted to know if it's possible to use Open-Unmix on android via Pytorch, I know there is usage of Pytorch on Android for image processing but I haven't found any examples to help me use Open-Unmix on android.
.
My system: Ubuntu 16.04, one GTX1080Ti, CUDA9, 24 core CPU
When training MUSDB18 using default unmix model, there are 544 batches, the iteration time of one bactch is about 21 seconds, so the training time of total 544 batches is 21*544 sec = 11424 sec = 3.1 hours, which is very slow.
PS: my training script is:
python train.py --root path/to/musdb18 --target vocals
I suspect that there are something wrong in my training process. What about your training time of MUSDB18? Thanks
Hello,
I want to transfer-learn the default dataset on a dataset of my own (or just train my own dataset from scratch if easier). I have mixture.wav files and the wav files for the individual instruments as well. I want to be able to separate everything in the song. I have some questions, though:
What dataset type should I use for this application?
Does each individual song's wav files need to be the same length, or does every song in the dataset and their wav files need to be the same length? Basically, can different songs be different lengths?
I'm wondering this because I was messing around with train.py and got an error of NotImplementedError: Non-relative patterns are unsupported
.
Would I get better single-instrument performance if I used the aligned ("denoising") dataset since it would be just focusing on the target sound and the noise? For example, if I just wanted to separate the bass from a song.
Also is there a Colab notebook that is set up to train?
Sorry if this doesn't make sense I tried to make it as clear as possible.
Dear Sir or Madam,
Hello. Thank you for your sharing firstly.
I run your codes only for separating 'vocals' according to your .md file.I get Aggrated Scores
vocals ==> SDR: 5.415 SIR: 10.950 ISR: 14.831 SAR: 5.533, which is quite different from the ideal result.
Could you please tell me something wrong? What should I do to reproduce your results. By the way, I use the dataset musdb18 --is-wav
Thank you very much.
I'm trying to train open-unmix from scratch. The validation losses after early stopping patience are not as good as what's shown in training.md: https://github.com/sigsep/open-unmix-pytorch/blob/master/docs/training.md
I'm using the exact open-unmix-pytorch codebase with no modifications. My training script is:
for target in drums vocals other bass;
do
python scripts/train.py \
--root=~/MUSDB18-HQ/ --is-wav --nb-workers=4 --batch-size=16 --epochs=1000 \
--target="$target" \
--outpu="umx-baseline"
done
So far, drums and vocals have trained to the following lowest validation loss:
Drums: 0.93 (compared to 0.7 of the claimed training.md)
Vocals: 1.1 (compared to 0.992 of the claimed training.md)
These aren't huge differences, but I'm wondering if there's any explanation. Is it the random seed which allowed your drum model to as far down as 0.7?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.