kaituoxu / conv-tasnet Goto Github PK

A PyTorch implementation of Conv-TasNet described in "TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation" with Permutation Invariant Training (PIT).

License: MIT License

Shell 10.12% Perl 32.59% Python 56.90% Makefile 0.39%

audio-separation conv-tasnet permutation-invariant-training pit pytorch source-separation speech-separation tasnet

conv-tasnet's Issues

Why the result I train can only reach the loss:( -14 valid and -16 train), the evaluate result is not 15.45 too

作者您好，请问一下我训练了100个epoch, 可是trainloss一直在-14，validloss一直在-16，评估的sdr什么的也一直在12或13左右。数据集我都是用生成的min文件夹里的，20000张训练集那个。模型超参我也是用的一样的。我尝试用你训练的最好模型再测试了一遍，结果是Average SDR improvement: 12.93， Average SISNR improvement: 12.50
我现在去很想问一下这种情况该怎么办，我不知道怎么才能提升sisnr，谢谢作者。

Hello, author. May I ask you that I have trained 100 epoch, but train loss is at -14, valid loss is at -16 for so long and never went down, and the assessed SDR has been around 12 . I used the generated min folder as your instruction said , 20,000 training wsj0 sets. I use the same model super-parameters. I tried to test again with the best model you trained and uploaded, and the result was Average SDR improvement: 12.93, Average SISNR improvement: 12.50
Now I would like to ask what to do about this situation. I don't know how to improve sisnr. The result I trained is not as what you said 15. Thank you very much.

The evaluating process was terribly slow

Is there anything wrong in evaluate.py?

different Data in evaluate.py and separate.py

I want to separate a mix sound with separate.py, but the separated sounds are noisy.
The mixture was female-male mix and I also test it in evaluate.py, the result was ~14 dB improvement in SDRi.
I test this mix file in separate.py, but when I listen to separated files, I find that they are separated female and male sound but they both were noisy.
I don't know what is the reason of noise.

Is this because of using different EvalDataLoader and DataLoader in separate.py and evaluate.py?

Thanks for your helpful repo.

Reproduce results

Hi, can you show your reproduce results of SDR on test set? thanks

Zero mean norm for SDR loss

Hi, I'm preparing a lecture on source separation …. Do you know where the zero_mean norm for the sdr losses is coming from and whats the intuition behind it? Was this in the original conv-tasnet paper?

Can't find WSJ0 Dataset

i can't find wsj0 dataset for training network !
I am looking for wsj0 dataset. Please help me.

Error when loading pretrained model

Hi,
when loading the pretrained model downloaded from https://pan.baidu.com/s/1-Rqm7GwpV7Cc1XzHSpHROg#list/path=%2F, some error happened:
Traceback (most recent call last):
File "src/separate.py", line 99, in
separate(args)
File "src/separate.py", line 39, in separate
model = TasNet.load_model(args.model_path)
File "/qgrapework/sspworks/TasNet_kaituoxu_20190624/src/tasnet.py", line 44, in load_model
model = cls.load_model_from_package(package)
File "/qgrapework/sspworks/TasNet_kaituoxu_20190624/src/tasnet.py", line 50, in load_model_from_package
package['hidden_size'], package['num_layers']
KeyError: 'hidden_size

RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'

like WSJ0-mix, I make some micture from Librispeech in which tr, cv and tt dir contain 20000, 5000,3000 respectively . But I try to run your scripts, something wrong happened. Can you help me to figure it out? Thx.

And the key error of the train log is as follows:
"Training...
Traceback (most recent call last):
File "/home/yjm/Conv-TasNet/egs/LibriSpeech/../../src/train.py", line 145, in
main(args)
File "/home/yjm/Conv-TasNet/egs/LibriSpeech/../../src/train.py", line 139, in main
solver.train()
File "/home/yjm/Conv-TasNet/src/solver.py", line 76, in train
tr_avg_loss = self._run_one_epoch(epoch)
File "/home/yjm/Conv-TasNet/src/solver.py", line 178, in _run_one_epoch
estimate_source = self.model(padded_mixture)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in worker
output = module(*input, **kwargs)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yjm/Conv-TasNet/src/conv_tasnet.py", line 54, in forward
est_source = self.decoder(mixture_w, est_mask)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yjm/Conv-TasNet/src/conv_tasnet.py", line 141, in forward
est_source = overlap_and_add(est_source, self.L//2) # M x C x T
File "/home/yjm/Conv-TasNet/src/utils.py", line 45, in overlap_and_add
result.index_add(-2, frame, subframe_signal)
RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'"

Quations reagrding Decoder

Hi,

Thank you for an awesome, well-organized repository.
I have a question regarding the Decoder block. The paper states, they use a 1-D transposed convolution operation for the generating the decoder basis functions. paper

However, I see you use a linear dense layer in the decoder

Conv-TasNet/src/conv_tasnet.py

Line 126 in 94eac10

self.basis_signals = nn.Linear(N, L, bias=False)

Could you explain the reason for this choice?

GPU RAM Issue

In cv you have set cv_maxlen = 6 or 8. it doesn't take files above 6 to 8 seconds. When I tried to change that, I faced ram issues. Training has been done but when validation started, i faced out of memory issue. so i cropped the wav files to 4secs and didn't face any issue.

Can you explain me why is that we are facing RAM issue when trying to send more data.
For my project, as i am working on music data, need to increase sample rate. 8000 to 22050. 2.5 times more than your implementation, this needs to split my data into less than 2secs to pass through validation.

Is there no other way to solve this ram issue? Why is it occupying too much space in ram?
Thanks in advance.

Questions about the SI-SNR

Thanks for your helpful sharing, but there still some question bothering me.
First, when I run your code, the loss is negative since your loss function is - SNR, but the negative loss seem not common for deep learning.
Then I noticed that, when calculating SNR, s_target is defined by both the clean and estimated source. It confused me much, could you give some explanation?

How to modify pit when pit is prohibited

training: AttributeError: 'ConvTasNet' object has no attribute 'module'

when i try to run train.py, in the beginning, it all goes well, while, when the saving model proceeds, it seems that it failed.
And my environment is: python 3.8, torch 1.10.1+cu113
i will be very grateful if anyone give me a help.
$E(HIDZC_GVJJQ7{QDV_`YL6$

C++ implementation of the Convtasnet model

Has anyone been able to try c++ implementation of ConvTasNet model. Atleast the inference from a pretrained model. I am facing issues wrt real time performance of the model.

What are the actual real time factors we can expect?

Thanks in advance

how to create mixture of my own audio files

can i test the speech separation method using my own audio speech files?
how to create audio mixture from my own speech files?

AssertionError while training separation model for 3 speaker scenario(C=3)

File "/nfs/users/Conv-TasNet/src/pit_criterion.py", line 21, in cal_loss
source_lengths)
File "/nfs/users/Conv-TasNet/src/pit_criterion.py", line 34, in cal_si_snr_with_pit
assert source.size() == estimate_source.size()
AssertionError

C is set to 3 and the training data is formatted accordingly with mix and s1,s2,s3.

Any support is appreciated
Thank you

Question regarding Normalization

How to implement cLN variant used in this paper which is next version of conv-tasnet released by the same authors.
Again, Thanks for this amazing implementation 🥇 .

typo at line 68, 96

line 68, 96 # generate minibach infomations

# generate minibatch informations

Sample of separated files for spkr1 and spkr2.

Hi,
Could you please provide a few samples of separated files.
I trained a model on my own dataset,and I just want to compare the result of separation by hearing the output files of spkr1 and spkr2.

thanks in advance.

Question about the summation of all outputs from 1-D Conv Block.

Hi,

According to the network structure from the paper, all outputs from 1-D Conv Blocks are added together. However, I did not find this operation from the code. I am new to pytorch, so I may miss something. Could you help explain this a bit?

Thanks

Missing a specfic wsj0 folder designated by mix_2_spk_tr.txt for creating 2 speaker mixtures

I tried to set up wsj0 data and to run Conv-TasNet. I have wsj0 corpus downloaded from LDC repository. I followed the instruction in ...../egs/wsj0/run.sh and obtained .wavs successfully converted by sph2pipe. Then I ran create_wav_2speakers.m on MATLAB to get 2 speaker mixture .wavs but failed running in the middle because a specific file, ...../wsj0/si_tr_s/401/401c020s.wav' was missing. That .m program reads out two file names at a time from mix_2_spk_tr.txt to create mixture from them but the whole directory ..../401/... to which that .wav belongs is missing in the original wsj0 file tree. How to fix this problem?

error when trying to reduce batch_size less than 64 on DSD100 dataset

hey,
First of all, I am very thankful for your amazing work.
I am trying to test the flexibility of the model with DSD100 dataset.
Testing to see if the model can separate singer and drums instead of 2-speakers.
I am facing issue while trying to reduce the batch_size to 3.(default = 128)
In solver.py -> _run_one_epoch() -> it is not entering this for loop
{ for i, (data) in enumerate(data_loader): } when batch_size is less than 64
else: ram is exceeding 12GB (Titan V)
Can you please help me understanding this error. What could be the difference in datasets?
Thanks :D

why LOSS is Negative?

Hi,
the loss function is usually positive, why is it negative here?

Bad performance when using for speech enhancement

Hi, very nice work. I noticed that some people are using Conv-TasNet for speech enhancement and get good results. While I encountered some problem while using this code for speech enhancement... I am trying to split clean speech and noise from a noisy speech. I am using VCTK dataset. The waveform of the results seem very weird...

When I changed the activation of mask to sigmoid, the result is still not good.

I wonder anyone has a thought how to solve this problem. Thanks in advance!

batch_size

according to table, the batch_size is 3. But what is the segment of each waveform during training? (Is it 3 sec, in this condition what is batch_size?)
according to default batch_size in train.py code that is 128 , which value is batch_size ?

Thanks a lot.

Changing the architecture to 2D conv layers

What are your predictions about changing the 1D structure with 2D ones (with a 1x1 kernel)?

Script to make same mixture as WSJ0-2mix and WSJ0- 3mix from any open source dataset e.g., Librispeech

Hi all!
we dont have access to WSJ dataset.
Could you please guide us how to make the mixture same as WSJ0-2mix from any open source dataset like Librispeech or any other open source dataset?

why did you use normalization = False in librosa.write in separate.py ?

when I put normalization = True the output is clear without any disturbance. did you get estimate_source values between -1 to 1 using normalization = False?
As this is done in time domain, our output range is -1 to 1 but, when we use Relu the output can be more than 1 right? Please help!

Cannot get the same evaluation SI-SNRi, even if using the pretrained model

Hi, thanks for the code and the pretrained model, they really help me a lot!

When I trying to use your pretrained model provided in the link pan.baidu.com/s/1-Rqm7GwpV7Cc1XzHSpHROg, I found that, when running the evaluate.py, the result is very different from your evaluate.log.
In the evaluate.log, it appears "Average SISNR improvement: 15.45"
However, when I run it, it is around 9.8

I assume that, we should have the same json files in data/tt/. In this case, we have the same code and same weights, we should have the same SISNRi 15.45.
I am wondering what makes the difference. Could I know the commit id of your repo when you run the evaluate.py. And could I have a look at your data/tt/mix.json(maybe just the starting 10 lines)

Below are the first few lines of my data/tt/mix.json
[ [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/445c0206_0.60431_22gc0105_-0.60431.wav", 33301 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/420c020h_1.1139_442c0203_-1.1139.wav", 51541 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/22go0107_0.079969_051c010u_-0.079969.wav", 30391 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/444o0314_2.1819_053o020e_-2.1819.wav", 25624 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/423o0304_1.419_420c020x_-1.419.wav", 48961 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/423o030b_1.4753_053o0209_-1.4753.wav", 44774 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/441o030o_1.9903_445c020y_-1.9903.wav", 26795 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/22ga010u_0.43921_443o030l_-0.43921.wav", 45120 ],

If this is not our difference, what other possibilities are there? Thanks!

Function not clear

Conv-TasNet/src/pit_criterion.py

Line 101 in 94eac10

def get_mask(source, source_lengths):

Hi, can you explain why this function is used?

Reproduce results - mixing SNR ration

Hi,
Thanks a lot for sharing the code!!
I'm trying to reproduce your results. I'm running everything exactly as specified, however my best model reaches: SDR: 13.4, SNR-SI: 12.8.

I was wondering if you changed the mixing SNR ration between the speakers to be in the range of [-5, 5]? or did you leave it to be in the range of [0, 5]?

RuntimeError: CUDA out of memory.

even set batch_size = 3 as default with 3 GPU , I still get the following error, I wanna know how can I solve the problem now?

RuntimeError: CUDA out of memory. Tried to allocate 18.75 MiB (GPU 0; 11.90 GiB total capacity; 9.54 GiB already allocated; 9.00 MiB free; 159.98 MiB cached)

+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 00000000:02:00.0 Off | N/A |
| 25% 44C P8 11W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 00000000:03:00.0 Off | N/A |
| 23% 39C P8 9W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A |
| 23% 36C P8 9W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

Questions about the training and pre-trained model

@kaituoxu Hi , It's really a nice work . Thanks for sharing the code.
However, I have a problem about the training. How many wave-files you used for training ?
I use 100000 '.wav' files (about for training and batch-size=10 (2GPU ,2080Ti_).It seems each epoch needs about 8 hours. It's too slow.
And could you please share the pre-trained model?
I am looking for your replay

Question about accelerating

It seems that it is low to load dataset, any way to accelate?

您好，感谢您的分享。
但是我发现训练模型的时候，读取数据似乎太慢了，有什么办法能够跑得更快一些吗？
谢谢，祝好！

why are long audio files ignored?

Conv-TasNet/src/data.py

Line 80 in 73d42bf

if num_segments > batch_size:

hi, thanks for sharing this!

If I'm not mistaken, in the above code segment it seems that audio files that are longer than one minibatch are ignored for the training. Why? They could be read in segments, otherwise a lot of audio from the database is not used at all.

How long does training take usually in 1 GPU(Nvidia 1080Ti) use case?

I started run.sh to initiate training session(stage2) and is still running after full 4days. I checked htop and nvidia-smi and both tell the training is running. Does it normally take that long ? Or does anything wrong happen? The run.sh parameter setting is default as listed below.

#!/bin/bash

Created on 2018/12

Author: Kaituo XU

-- START IMPORTANT

* If you have mixture wsj0 audio, modify `data` to your path that including tr, cv and tt.

* If you jsut have origin sphere format wsj0 , modify `wsj0_origin` to your path and

modify `wsj0_wav` to path that put output wav format wsj0, then read and run stage 1 part.

After that, modify `data` and run from stage 2.

wsj0_origin=/home/xxx/xxxx/Speech_Corpus/csr_1
wsj0_wav=/home/xxx/xxxxx/Speech_Corpus/wsj0-wav/wsj0
data=/home/xxx/xxxxx/Speech_Corpus/wsj-mix/2speakers/wav8k/min/
stage=2 # Modify this to control to start from which stage

-- END

dumpdir=data # directory to put generated json file

-- START Conv-TasNet Config

train_dir=$dumpdir/tr
valid_dir=$dumpdir/cv
evaluate_dir=$dumpdir/tt
separate_dir=$dumpdir/tt
sample_rate=8000
segment=4 # seconds
cv_maxlen=6 # seconds

Training config

use_cuda=1
id=0
epochs=100
half_lr=1
early_stop=0
max_norm=5

minibatch

shuffle=1
batch_size=3
num_workers=4

optimizer

optimizer=adam
lr=1e-3
momentum=0
l2=0

save and visualize

checkpoint=0
continue_from=""
print_freq=10
visdom=0
visdom_epoch=0
visdom_id="Conv-TasNet Training"

evaluate

ev_use_cuda=0
cal_sdr=1

-- END Conv-TasNet Config

exp tag

tag="" # tag for managing experiments.

ngpu=1 # always 1

. utils/parse_options.sh || exit 1;
. ./cmd.sh
. ./path.sh

if [ $stage -le 0 ]; then
echo "Stage 0: Convert sphere format to wav format and generate mixture"
local/data_prepare.sh --data ${wsj0_origin} --wav_dir ${wsj0_wav}

echo "NOTE: You should generate mixture by yourself now.
You can use tools/create-speaker-mixtures.zip which is download from
http://www.merl.com/demos/deep-clustering/create-speaker-mixtures.zip
If you don't have Matlab and want to use Octave, I suggest to replace
all mkdir(...) in create_wav_2speakers.m with system(['mkdir -p '...])
due to mkdir in Octave can not work in 'mkdir -p' way.
e.g.:
mkdir([output_dir16k '/' min_max{i_mm} '/' data_type{i_type}]);
->
system(['mkdir -p ' output_dir16k '/' min_max{i_mm} '/' data_type{i_type}]);"
exit 1
fi

if [ $stage -le 1 ]; then
echo "Stage 1: Generating json files including wav path and duration"
[ ! -d $dumpdir ] && mkdir $dumpdir
preprocess.py --in-dir $data --out-dir $dumpdir --sample-rate $sample_rate
fi

if [ -z ${tag} ]; then
expdir=exp/train_r${sample_rate}_N${N}_L${L}_B${B}_H${H}_P${P}_X${X}_R${R}C${C}${norm_type}causal${causal}${mask_nonlinear}_epoch${epochs}_half${half_lr}_norm${max_norm}_bs${batch_size}worker${num_workers}${optimizer}_lr${lr}mmt${momentum}l2${l2}basename $train_dir
else
expdir=exp/train${tag}
fi

if [ $stage -le 2 ]; then
echo "Stage 2: Training"
${cuda_cmd} --gpu ${ngpu} ${expdir}/train.log
CUDA_VISIBLE_DEVICES="$id"
train.py
--train_dir $train_dir
--valid_dir $valid_dir
--sample_rate $sample_rate
--segment $segment
--cv_maxlen $cv_maxlen
--N $N
--L $L
--B $B
--H $H
--P $P
--X $X
--R $R
--C $C
--norm_type $norm_type
--causal $causal
--mask_nonlinear $mask_nonlinear
--use_cuda $use_cuda
--epochs $epochs
--half_lr $half_lr
--early_stop $early_stop
--max_norm $max_norm
--shuffle $shuffle
--batch_size $batch_size
--num_workers $num_workers
--optimizer $optimizer
--lr $lr
--momentum $momentum
--l2 $l2
--save_folder ${expdir}
--checkpoint $checkpoint
--continue_from "$continue_from"
--print_freq ${print_freq}
--visdom $visdom
--visdom_epoch $visdom_epoch
--visdom_id "$visdom_id"
fi
(the rest omitted)

"mask_nonlinear" choose problem

Hi kaituo xu:
when I train your code on a new datasets I meet some problem. when I choose "mask_nonlinear" of relu , I can get similar results with you. However, when I choose ""mask_nonlinear" of softmax, the training loss is always -1. 4 . Did you meet the same problem ,when you choose "mask_nonlinear" of softmax? the choose in the Conv-tasnet is softmax, why do you choose relu?

gLN & causal=True cannot be applied at the same time.

if it is causal, it should not use "gLN". But there‘s no constraint in your code.
And cLN should calculate the cumulative mean & var over time steps.

Minor mistake?

Hi, I notice one small thing when I am running your code. It seems that when you are loading the data, you just sort according to the data length. And it seems that some audios in wsj_0 have the exact same length and it makes the mix/s1/s2 not matched.

However, I think it doesn't hurt the training since this situation is rare. I meet this problem because I want to parallel the inference stage with multiple CPUs and I get different results with the GPU version.

But thanks for your very organized code! It is very helpful to me.

channel mismatch

Why does half of my data have a channel mismatch problem?

请问怎么接着上一次的训练

checkpoint好像不行啊，没有选项啊，而且还报错

Evaluation seems ton only run on sequences with length < cv_max_len

Hi,
Thanks for providing your implementation of Conv-Tasnet. I trained a model and when I evaluated it on the test set (using run.sh) I was surprised to only see 2618 sequences evaluated (while the test set size is 3000). It seems that the AudioDataset created in evaluate.py uses the default cv_maxlen parameter (8 seconds), such that only test sequences shorter than 8 seconds are evaluated (2618 sequences). This would mean that the test SDR is not representative of how well the model performs on longer utterances. I attached the line where the AudioDataset is created.
Best regards,
Neil

Conv-TasNet/src/evaluate.py

Line 48 in 73d42bf

dataset = AudioDataset(args.data_dir, args.batch_size,

Help

Hi, I would like to do speech enhancement with your work, do you have a model available? and how do I use your code to enhance speech?
Thank you!

Original implementation of "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation"

As far as I can see in the code, I am missing the skip connections.

As it is depicted in the photo, the outputs of all 1-D Conv blocks are summed and passed into the PRelu component.

All the same, what is happening in your code is that you pass the output of the last 1D Conv block into the PRelu component.

kaituoxu / conv-tasnet Goto Github PK

conv-tasnet's Issues

Created on 2018/12

Author: Kaituo XU

-- START IMPORTANT

* If you have mixture wsj0 audio, modify data to your path that including tr, cv and tt.

* If you jsut have origin sphere format wsj0 , modify wsj0_origin to your path and

modify wsj0_wav to path that put output wav format wsj0, then read and run stage 1 part.

After that, modify data and run from stage 2.