kaituoxu / conv-tasnet Goto Github PK

A PyTorch implementation of Conv-TasNet described in "TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation" with Permutation Invariant Training (PIT).

License: MIT License

Shell 10.12% Perl 32.59% Python 56.90% Makefile 0.39%

speech-separation source-separation audio-separation pit pytorch tasnet conv-tasnet permutation-invariant-training

conv-tasnet's Introduction

Conv-TasNet

A PyTorch implementation of Conv-TasNet described in "TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation".

Results

From	N	L	B	H	P	X	R	Norm	Causal	batch size	SI-SNRi(dB)	SDRi(dB)
Paper	256	20	256	512	3	8	4	gLN	X	-	14.6	15.0
Here	256	20	256	512	3	8	4	gLN	X	3	15.5	15.7

Install

PyTorch 0.4.1+
Python3 (Recommend Anaconda)
pip install -r requirements.txt
If you need to convert wjs0 to wav format and generate mixture files, cd tools; make

Usage

If you already have mixture wsj0 data:

$ cd egs/wsj0, modify wsj0 data path data to your path in the beginning of run.sh.
$ bash run.sh, that's all!

If you just have origin wsj0 data (sphere format):

$ cd egs/wsj0, modify three wsj0 data path to your path in the beginning of run.sh.
Convert sphere format wsj0 to wav format and generate mixture. Stage 0 part provides an example.
$ bash run.sh, that's all!

You can change hyper-parameter by $ bash run.sh --parameter_name parameter_value, egs, $ bash run.sh --stage 3. See parameter name in egs/aishell/run.sh before . utils/parse_options.sh.

Workflow

Workflow of egs/wsj0/run.sh:

Stage 0: Convert sphere format to wav format and generate mixture (optional)
Stage 1: Generating json files including wav path and duration
Stage 2: Training
Stage 3: Evaluate separation performance
Stage 4: Separate speech using Conv-TasNet

More detail

# Set PATH and PYTHONPATH
$ cd egs/wsj0/; . ./path.sh
# Train:
$ train.py -h
# Evaluate performance:
$ evaluate.py -h
# Separate mixture audio:
$ separate.py -h

How to visualize loss?

If you want to visualize your loss, you can use visdom to do that:

Open a new terminal in your remote server (recommend tmux) and run $ visdom
Open a new terminal and run $ bash run.sh --visdom 1 --visdom_id "<any-string>" or $ train.py ... --visdom 1 --vidsdom_id "<any-string>"
Open your browser and type <your-remote-server-ip>:8097, egs, 127.0.0.1:8097
In visdom website, chose <any-string> in Environment to see your loss

How to resume training?

$ bash run.sh --continue_from <model-path>

How to use multi-GPU?

Use comma separated gpu-id sequence, such as:

$ bash run.sh --id "0,1"

How to solve out of memory?

When happened in training, try to reduce batch_size or use more GPU. $ bash run.sh --batch_size <lower-value>
When happened in cross validation, try to reduce cv_maxlen. $ bash run.sh --cv_maxlen <lower-value>

conv-tasnet's People

Contributors

Stargazers

Watchers

Forkers

sunilsivadas hlthu entn-at whmrtm ishine maoxin7676 jhuiac cdyangbo ktangri jfsantos byfaith aarriandiaga yansd-c wangyang2014 orangebaowang lihao0214 fncode246 svj1991 dendisuhubdy speechdnn aishinchi starhxh hiyoung-asr huangziliandy xingws ismallfish enk100 kuonanhong satishpas2 twistedmove kohei0912 youssefbenjelloun1 z592694590 shaoboh swhan9873 pigip hongyu-speech pohanchi aksh1080 pineking xiongmaoxia nobel861017 luo404 a7532ariel haiciyang afd77 ckshei thumblas ofekcohen1 zhouyunnudt ofshellohicy 5l1v3r1 grantl10 yuzhongshanyue alongwithyou yoshonabee road2018 spxnn fn246 tuyenbk caozhengquan vipchengrui rameshkunasi donghwa-kim gxu82 mtxing chenchy yoavramon runngezhang gaoyiyeah owen864720655 jefferyoung96 jie-fei okrio fulin-wei zhangxinaaaa zzhang68 cheriylan acbdef123 wang-asher zhangdamenggit aghaee1367 ruizhecao96 groadabike chienchiang chenzl03 jackhong345 debangliu arvin-xd xiexukang jwr1995 lucehe alvinli04 thomasrigoni7 gary109 shaun95 sciai-ai juneren jiajiazh amazleo

conv-tasnet's Issues

typo at line 68, 96

line 68, 96 # generate minibach infomations

# generate minibatch informations

AssertionError while training separation model for 3 speaker scenario(C=3)

File "/nfs/users/Conv-TasNet/src/pit_criterion.py", line 21, in cal_loss
source_lengths)
File "/nfs/users/Conv-TasNet/src/pit_criterion.py", line 34, in cal_si_snr_with_pit
assert source.size() == estimate_source.size()
AssertionError

C is set to 3 and the training data is formatted accordingly with mix and s1,s2,s3.

Any support is appreciated
Thank you

why are long audio files ignored?

Conv-TasNet/src/data.py

Line 80 in 73d42bf

if num_segments > batch_size:

hi, thanks for sharing this!

If I'm not mistaken, in the above code segment it seems that audio files that are longer than one minibatch are ignored for the training. Why? They could be read in segments, otherwise a lot of audio from the database is not used at all.

Reproduce results - mixing SNR ration

Hi,
Thanks a lot for sharing the code!!
I'm trying to reproduce your results. I'm running everything exactly as specified, however my best model reaches: SDR: 13.4, SNR-SI: 12.8.

I was wondering if you changed the mixing SNR ration between the speakers to be in the range of [-5, 5]? or did you leave it to be in the range of [0, 5]?

Why the result I train can only reach the loss:( -14 valid and -16 train), the evaluate result is not 15.45 too

作者您好，请问一下我训练了100个epoch, 可是trainloss一直在-14，validloss一直在-16，评估的sdr什么的也一直在12或13左右。数据集我都是用生成的min文件夹里的，20000张训练集那个。模型超参我也是用的一样的。我尝试用你训练的最好模型再测试了一遍，结果是Average SDR improvement: 12.93， Average SISNR improvement: 12.50
我现在去很想问一下这种情况该怎么办，我不知道怎么才能提升sisnr，谢谢作者。

Hello, author. May I ask you that I have trained 100 epoch, but train loss is at -14, valid loss is at -16 for so long and never went down, and the assessed SDR has been around 12 . I used the generated min folder as your instruction said , 20,000 training wsj0 sets. I use the same model super-parameters. I tried to test again with the best model you trained and uploaded, and the result was Average SDR improvement: 12.93, Average SISNR improvement: 12.50
Now I would like to ask what to do about this situation. I don't know how to improve sisnr. The result I trained is not as what you said 15. Thank you very much.

Can't find WSJ0 Dataset

i can't find wsj0 dataset for training network !
I am looking for wsj0 dataset. Please help me.

gLN & causal=True cannot be applied at the same time.

if it is causal, it should not use "gLN". But there‘s no constraint in your code.
And cLN should calculate the cumulative mean & var over time steps.

error when trying to reduce batch_size less than 64 on DSD100 dataset

hey,
First of all, I am very thankful for your amazing work.
I am trying to test the flexibility of the model with DSD100 dataset.
Testing to see if the model can separate singer and drums instead of 2-speakers.
I am facing issue while trying to reduce the batch_size to 3.(default = 128)
In solver.py -> _run_one_epoch() -> it is not entering this for loop
{ for i, (data) in enumerate(data_loader): } when batch_size is less than 64
else: ram is exceeding 12GB (Titan V)
Can you please help me understanding this error. What could be the difference in datasets?
Thanks :D

Original implementation of "Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation"

As far as I can see in the code, I am missing the skip connections.

As it is depicted in the photo, the outputs of all 1-D Conv blocks are summed and passed into the PRelu component.

All the same, what is happening in your code is that you pass the output of the last 1D Conv block into the PRelu component.

channel mismatch

Why does half of my data have a channel mismatch problem?

why did you use normalization = False in librosa.write in separate.py ?

when I put normalization = True the output is clear without any disturbance. did you get estimate_source values between -1 to 1 using normalization = False?
As this is done in time domain, our output range is -1 to 1 but, when we use Relu the output can be more than 1 right? Please help!

how to create mixture of my own audio files

can i test the speech separation method using my own audio speech files?
how to create audio mixture from my own speech files?

Question regarding Normalization

How to implement cLN variant used in this paper which is next version of conv-tasnet released by the same authors.
Again, Thanks for this amazing implementation 🥇 .

"mask_nonlinear" choose problem

Hi kaituo xu:
when I train your code on a new datasets I meet some problem. when I choose "mask_nonlinear" of relu , I can get similar results with you. However, when I choose ""mask_nonlinear" of softmax, the training loss is always -1. 4 . Did you meet the same problem ,when you choose "mask_nonlinear" of softmax? the choose in the Conv-tasnet is softmax, why do you choose relu?

training: AttributeError: 'ConvTasNet' object has no attribute 'module'

when i try to run train.py, in the beginning, it all goes well, while, when the saving model proceeds, it seems that it failed.
And my environment is: python 3.8, torch 1.10.1+cu113
i will be very grateful if anyone give me a help.
$E(HIDZC_GVJJQ7{QDV_`YL6$

Questions about the SI-SNR

Thanks for your helpful sharing, but there still some question bothering me.
First, when I run your code, the loss is negative since your loss function is - SNR, but the negative loss seem not common for deep learning.
Then I noticed that, when calculating SNR, s_target is defined by both the clean and estimated source. It confused me much, could you give some explanation?

Evaluation seems ton only run on sequences with length < cv_max_len

Hi,
Thanks for providing your implementation of Conv-Tasnet. I trained a model and when I evaluated it on the test set (using run.sh) I was surprised to only see 2618 sequences evaluated (while the test set size is 3000). It seems that the AudioDataset created in evaluate.py uses the default cv_maxlen parameter (8 seconds), such that only test sequences shorter than 8 seconds are evaluated (2618 sequences). This would mean that the test SDR is not representative of how well the model performs on longer utterances. I attached the line where the AudioDataset is created.
Best regards,
Neil

Conv-TasNet/src/evaluate.py

Line 48 in 73d42bf

dataset = AudioDataset(args.data_dir, args.batch_size,

Question about accelerating

It seems that it is low to load dataset, any way to accelate?

您好，感谢您的分享。
但是我发现训练模型的时候，读取数据似乎太慢了，有什么办法能够跑得更快一些吗？
谢谢，祝好！

Question about the summation of all outputs from 1-D Conv Block.

Hi,

According to the network structure from the paper, all outputs from 1-D Conv Blocks are added together. However, I did not find this operation from the code. I am new to pytorch, so I may miss something. Could you help explain this a bit?

Thanks

why LOSS is Negative?

Hi,
the loss function is usually positive, why is it negative here?

C++ implementation of the Convtasnet model

Has anyone been able to try c++ implementation of ConvTasNet model. Atleast the inference from a pretrained model. I am facing issues wrt real time performance of the model.

What are the actual real time factors we can expect?

Thanks in advance

Questions about the training and pre-trained model

@kaituoxu Hi , It's really a nice work . Thanks for sharing the code.
However, I have a problem about the training. How many wave-files you used for training ?
I use 100000 '.wav' files (about for training and batch-size=10 (2GPU ,2080Ti_).It seems each epoch needs about 8 hours. It's too slow.
And could you please share the pre-trained model?
I am looking for your replay

batch_size

according to table, the batch_size is 3. But what is the segment of each waveform during training? (Is it 3 sec, in this condition what is batch_size?)
according to default batch_size in train.py code that is 128 , which value is batch_size ?

Thanks a lot.

Sample of separated files for spkr1 and spkr2.

Hi,
Could you please provide a few samples of separated files.
I trained a model on my own dataset,and I just want to compare the result of separation by hearing the output files of spkr1 and spkr2.

thanks in advance.

RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'

like WSJ0-mix, I make some micture from Librispeech in which tr, cv and tt dir contain 20000, 5000,3000 respectively . But I try to run your scripts, something wrong happened. Can you help me to figure it out? Thx.

And the key error of the train log is as follows:
"Training...
Traceback (most recent call last):
File "/home/yjm/Conv-TasNet/egs/LibriSpeech/../../src/train.py", line 145, in
main(args)
File "/home/yjm/Conv-TasNet/egs/LibriSpeech/../../src/train.py", line 139, in main
solver.train()
File "/home/yjm/Conv-TasNet/src/solver.py", line 76, in train
tr_avg_loss = self._run_one_epoch(epoch)
File "/home/yjm/Conv-TasNet/src/solver.py", line 178, in _run_one_epoch
estimate_source = self.model(padded_mixture)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in worker
output = module(*input, **kwargs)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yjm/Conv-TasNet/src/conv_tasnet.py", line 54, in forward
est_source = self.decoder(mixture_w, est_mask)
File "/home/yjm/anaconda3/envs/tensorflow/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yjm/Conv-TasNet/src/conv_tasnet.py", line 141, in forward
est_source = overlap_and_add(est_source, self.L//2) # M x C x T
File "/home/yjm/Conv-TasNet/src/utils.py", line 45, in overlap_and_add
result.index_add(-2, frame, subframe_signal)
RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index'"

Cannot get the same evaluation SI-SNRi, even if using the pretrained model

Hi, thanks for the code and the pretrained model, they really help me a lot!

When I trying to use your pretrained model provided in the link pan.baidu.com/s/1-Rqm7GwpV7Cc1XzHSpHROg, I found that, when running the evaluate.py, the result is very different from your evaluate.log.
In the evaluate.log, it appears "Average SISNR improvement: 15.45"
However, when I run it, it is around 9.8

I assume that, we should have the same json files in data/tt/. In this case, we have the same code and same weights, we should have the same SISNRi 15.45.
I am wondering what makes the difference. Could I know the commit id of your repo when you run the evaluate.py. And could I have a look at your data/tt/mix.json(maybe just the starting 10 lines)

Below are the first few lines of my data/tt/mix.json
[ [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/445c0206_0.60431_22gc0105_-0.60431.wav", 33301 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/420c020h_1.1139_442c0203_-1.1139.wav", 51541 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/22go0107_0.079969_051c010u_-0.079969.wav", 30391 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/444o0314_2.1819_053o020e_-2.1819.wav", 25624 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/423o0304_1.419_420c020x_-1.419.wav", 48961 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/423o030b_1.4753_053o0209_-1.4753.wav", 44774 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/441o030o_1.9903_445c020y_-1.9903.wav", 26795 ], [ "datasets/data/wsj0-mix/2speakers/wav8k/min/tt/mix/22ga010u_0.43921_443o030l_-0.43921.wav", 45120 ],

If this is not our difference, what other possibilities are there? Thanks!

Changing the architecture to 2D conv layers

What are your predictions about changing the 1D structure with 2D ones (with a 1x1 kernel)?

Function not clear

Conv-TasNet/src/pit_criterion.py

Line 101 in 94eac10

def get_mask(source, source_lengths):

Hi, can you explain why this function is used?

Error when loading pretrained model

Hi,
when loading the pretrained model downloaded from https://pan.baidu.com/s/1-Rqm7GwpV7Cc1XzHSpHROg#list/path=%2F, some error happened:
Traceback (most recent call last):
File "src/separate.py", line 99, in
separate(args)
File "src/separate.py", line 39, in separate
model = TasNet.load_model(args.model_path)
File "/qgrapework/sspworks/TasNet_kaituoxu_20190624/src/tasnet.py", line 44, in load_model
model = cls.load_model_from_package(package)
File "/qgrapework/sspworks/TasNet_kaituoxu_20190624/src/tasnet.py", line 50, in load_model_from_package
package['hidden_size'], package['num_layers']
KeyError: 'hidden_size

The evaluating process was terribly slow

Is there anything wrong in evaluate.py?

RuntimeError: CUDA out of memory.

even set batch_size = 3 as default with 3 GPU , I still get the following error, I wanna know how can I solve the problem now?

RuntimeError: CUDA out of memory. Tried to allocate 18.75 MiB (GPU 0; 11.90 GiB total capacity; 9.54 GiB already allocated; 9.00 MiB free; 159.98 MiB cached)

+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 00000000:02:00.0 Off | N/A |
| 25% 44C P8 11W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 00000000:03:00.0 Off | N/A |
| 23% 39C P8 9W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 00000000:04:00.0 Off | N/A |
| 23% 36C P8 9W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

请问怎么接着上一次的训练

checkpoint好像不行啊，没有选项啊，而且还报错

Reproduce results

Hi, can you show your reproduce results of SDR on test set? thanks

How long does training take usually in 1 GPU(Nvidia 1080Ti) use case?

I started run.sh to initiate training session(stage2) and is still running after full 4days. I checked htop and nvidia-smi and both tell the training is running. Does it normally take that long ? Or does anything wrong happen? The run.sh parameter setting is default as listed below.

#!/bin/bash

Created on 2018/12

Author: Kaituo XU

-- START IMPORTANT

* If you have mixture wsj0 audio, modify `data` to your path that including tr, cv and tt.

* If you jsut have origin sphere format wsj0 , modify `wsj0_origin` to your path and

modify `wsj0_wav` to path that put output wav format wsj0, then read and run stage 1 part.

After that, modify `data` and run from stage 2.

wsj0_origin=/home/xxx/xxxx/Speech_Corpus/csr_1
wsj0_wav=/home/xxx/xxxxx/Speech_Corpus/wsj0-wav/wsj0
data=/home/xxx/xxxxx/Speech_Corpus/wsj-mix/2speakers/wav8k/min/
stage=2 # Modify this to control to start from which stage

-- END

dumpdir=data # directory to put generated json file

-- START Conv-TasNet Config

train_dir=$dumpdir/tr
valid_dir=$dumpdir/cv
evaluate_dir=$dumpdir/tt
separate_dir=$dumpdir/tt
sample_rate=8000
segment=4 # seconds
cv_maxlen=6 # seconds

Training config

use_cuda=1
id=0
epochs=100
half_lr=1
early_stop=0
max_norm=5

minibatch

shuffle=1
batch_size=3
num_workers=4

optimizer

optimizer=adam
lr=1e-3
momentum=0
l2=0

save and visualize

checkpoint=0
continue_from=""
print_freq=10
visdom=0
visdom_epoch=0
visdom_id="Conv-TasNet Training"

evaluate

ev_use_cuda=0
cal_sdr=1

-- END Conv-TasNet Config

exp tag

tag="" # tag for managing experiments.

ngpu=1 # always 1

. utils/parse_options.sh || exit 1;
. ./cmd.sh
. ./path.sh

if [ $stage -le 0 ]; then
echo "Stage 0: Convert sphere format to wav format and generate mixture"
local/data_prepare.sh --data ${wsj0_origin} --wav_dir ${wsj0_wav}

echo "NOTE: You should generate mixture by yourself now.
You can use tools/create-speaker-mixtures.zip which is download from
http://www.merl.com/demos/deep-clustering/create-speaker-mixtures.zip
If you don't have Matlab and want to use Octave, I suggest to replace
all mkdir(...) in create_wav_2speakers.m with system(['mkdir -p '...])
due to mkdir in Octave can not work in 'mkdir -p' way.
e.g.:
mkdir([output_dir16k '/' min_max{i_mm} '/' data_type{i_type}]);
->
system(['mkdir -p ' output_dir16k '/' min_max{i_mm} '/' data_type{i_type}]);"
exit 1
fi

if [ $stage -le 1 ]; then
echo "Stage 1: Generating json files including wav path and duration"
[ ! -d $dumpdir ] && mkdir $dumpdir
preprocess.py --in-dir $data --out-dir $dumpdir --sample-rate $sample_rate
fi

if [ -z ${tag} ]; then
expdir=exp/train_r${sample_rate}_N${N}_L${L}_B${B}_H${H}_P${P}_X${X}_R${R}C${C}${norm_type}causal${causal}${mask_nonlinear}_epoch${epochs}_half${half_lr}_norm${max_norm}_bs${batch_size}worker${num_workers}${optimizer}_lr${lr}mmt${momentum}l2${l2}basename $train_dir
else
expdir=exp/train${tag}
fi

if [ $stage -le 2 ]; then
echo "Stage 2: Training"
${cuda_cmd} --gpu ${ngpu} ${expdir}/train.log
CUDA_VISIBLE_DEVICES="$id"
train.py
--train_dir $train_dir
--valid_dir $valid_dir
--sample_rate $sample_rate
--segment $segment
--cv_maxlen $cv_maxlen
--N $N
--L $L
--B $B
--H $H
--P $P
--X $X
--R $R
--C $C
--norm_type $norm_type
--causal $causal
--mask_nonlinear $mask_nonlinear
--use_cuda $use_cuda
--epochs $epochs
--half_lr $half_lr
--early_stop $early_stop
--max_norm $max_norm
--shuffle $shuffle
--batch_size $batch_size
--num_workers $num_workers
--optimizer $optimizer
--lr $lr
--momentum $momentum
--l2 $l2
--save_folder ${expdir}
--checkpoint $checkpoint
--continue_from "$continue_from"
--print_freq ${print_freq}
--visdom $visdom
--visdom_epoch $visdom_epoch
--visdom_id "$visdom_id"
fi
(the rest omitted)

Minor mistake?

Hi, I notice one small thing when I am running your code. It seems that when you are loading the data, you just sort according to the data length. And it seems that some audios in wsj_0 have the exact same length and it makes the mix/s1/s2 not matched.

However, I think it doesn't hurt the training since this situation is rare. I meet this problem because I want to parallel the inference stage with multiple CPUs and I get different results with the GPU version.

But thanks for your very organized code! It is very helpful to me.

Quations reagrding Decoder

Hi,

Thank you for an awesome, well-organized repository.
I have a question regarding the Decoder block. The paper states, they use a 1-D transposed convolution operation for the generating the decoder basis functions. paper

However, I see you use a linear dense layer in the decoder

Conv-TasNet/src/conv_tasnet.py

Line 126 in 94eac10

self.basis_signals = nn.Linear(N, L, bias=False)

Could you explain the reason for this choice?

Bad performance when using for speech enhancement

Hi, very nice work. I noticed that some people are using Conv-TasNet for speech enhancement and get good results. While I encountered some problem while using this code for speech enhancement... I am trying to split clean speech and noise from a noisy speech. I am using VCTK dataset. The waveform of the results seem very weird...

When I changed the activation of mask to sigmoid, the result is still not good.

I wonder anyone has a thought how to solve this problem. Thanks in advance!

different Data in evaluate.py and separate.py

I want to separate a mix sound with separate.py, but the separated sounds are noisy.
The mixture was female-male mix and I also test it in evaluate.py, the result was ~14 dB improvement in SDRi.
I test this mix file in separate.py, but when I listen to separated files, I find that they are separated female and male sound but they both were noisy.
I don't know what is the reason of noise.

Is this because of using different EvalDataLoader and DataLoader in separate.py and evaluate.py?

Thanks for your helpful repo.

Help

Hi, I would like to do speech enhancement with your work, do you have a model available? and how do I use your code to enhance speech?
Thank you!

Zero mean norm for SDR loss

Hi, I'm preparing a lecture on source separation …. Do you know where the zero_mean norm for the sdr losses is coming from and whats the intuition behind it? Was this in the original conv-tasnet paper?

GPU RAM Issue

In cv you have set cv_maxlen = 6 or 8. it doesn't take files above 6 to 8 seconds. When I tried to change that, I faced ram issues. Training has been done but when validation started, i faced out of memory issue. so i cropped the wav files to 4secs and didn't face any issue.

Can you explain me why is that we are facing RAM issue when trying to send more data.
For my project, as i am working on music data, need to increase sample rate. 8000 to 22050. 2.5 times more than your implementation, this needs to split my data into less than 2secs to pass through validation.

Is there no other way to solve this ram issue? Why is it occupying too much space in ram?
Thanks in advance.

Missing a specfic wsj0 folder designated by mix_2_spk_tr.txt for creating 2 speaker mixtures

I tried to set up wsj0 data and to run Conv-TasNet. I have wsj0 corpus downloaded from LDC repository. I followed the instruction in ...../egs/wsj0/run.sh and obtained .wavs successfully converted by sph2pipe. Then I ran create_wav_2speakers.m on MATLAB to get 2 speaker mixture .wavs but failed running in the middle because a specific file, ...../wsj0/si_tr_s/401/401c020s.wav' was missing. That .m program reads out two file names at a time from mix_2_spk_tr.txt to create mixture from them but the whole directory ..../401/... to which that .wav belongs is missing in the original wsj0 file tree. How to fix this problem?

How to modify pit when pit is prohibited

Script to make same mixture as WSJ0-2mix and WSJ0- 3mix from any open source dataset e.g., Librispeech

Hi all!
we dont have access to WSJ dataset.
Could you please guide us how to make the mixture same as WSJ0-2mix from any open source dataset like Librispeech or any other open source dataset?