Authors: Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang,Yihui Fu, Jian Wu, Bihong Zhang, Lei Xie
huyanxin / deepcomplexcrn Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
Authors: Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang,Yihui Fu, Jian Wu, Bihong Zhang, Lei Xie
Hi, yanxin, 想问下DCCRNpaper中用的模型可以作为pretrain model release出来么 ?
您好,我刚入门语音处理不久,最近在复现贵团队的DCCRN模型。我使用您论文中提到的训练参数,将DNS challenge数据集中的音频分为每段15s去训练。使用训练得到的模型对完整的音频文件进行测试,出来的效果没有问题非常好。
然后为了测试其实时推理的效果,我将音频文件切分为每37.5ms一个clip,然后将这些clips按照时序送入网络,得到网络的输出后再按照时序进行拼接,最后得到完整的处理后的音频。在推理的过程中我在您提供的官方模型代码的基础上修改和添加了:
(1)将“stft左右两边各补300采样点的零”改为“左边补300采样点的前一帧末尾的数据,右边补300采样点的零”
(2)手动更新LSTM的状态h和c
(3)给encoder添加buffer。即将前一帧的6次encoder的输入的(时序维度上)最后一个特征作为当前帧6次encoder时padding的内容(而不是padding零)
然而基于以上操作进行推理,出来的音频效果并不好(几乎没有降噪效果)。
因为网络对完整的音频文件进行推理出来的效果很好,所以可以确定网络的训练是有效的。而使用clip的方法去推理效果不好,我觉得可能还是推理的部分哪里没有和训练部分保持一致。
请问是我对您提供的官方模型代码进行修改和添加的过程中出现了错误,还是说我还需要再对代码进行一些其它修改?
Hi! I think the number of parameters reported in the paper are off.
For example, DCCRN-CL is reported as 32, 64, 128, 256, 256, 256 filters each with kernel size 5*2=10, for a total of 3.7M params. But isn't the number of params of the Conv2Ds roughly follows? (ignoring LSTM)
in * ks * out = total
------+-------+-----+----------
Encoder
1 * 5 * 2 * 32 = 320
32 * 5 * 2 * 64 = 20,480
64 * 5 * 2 * 128 = 81,920
128 * 5 * 2 * 256 = 327,680
+ 256 * 5 * 2 * 256 = 655,360
+ 256 * 5 * 2 * 256 = 655,360
Decoder
+ 256 * 5 * 2 * 256 = 655,360
+ 512 * 5 * 2 * 256 = 1,310,720 (has skip conn input)
+ 512 * 5 * 2 * 256 = 1,310,720 (has skip conn input)
+ 512 * 5 * 2 * 128 = 655,360 (has skip conn input)
256 * 5 * 2 * 64 = 163,840 (has skip conn input)
128 * 5 * 2 * 32 = 40,960 (has skip conn input)
64 * 5 * 2 * 1 = 640 (has skip conn input)
-------------------------------
= 5,878,720 COMPLEX params, i.e. 11,757,440 REAL params
Cant see it is stated in the paper, how many number of past the frames is used? Only that it uses 6 frames from the future.
Thanks for the great work.
When I tried to convert the model to onnx to measure exec time, there is an error. It seems onnx does not support atan2?
How to solve this problem?
RuntimeError: Exporting the operator atan2 to ONNX opset version 11 is not supported. Please open a bug to request ONNX export support for the missing operator.
In dc_crn.py inputs and labels are random tensor.
How could I process sample wav files?
Is it right to change line 305 at dc_crn.py into some kind of wav file read such as sf.read('./noreverb_fileid_6.wav')[0] in conv_stft.py?
Also, I am confuse to put which data should be used in 'labels' in line 306 in dc_crn.py.
I'm trying to apply 'ICASSP_blind_test_set' in 'DNS-Challenge'
https://github.com/microsoft/DNS-Challenge/tree/master/datasets/ICASSP_blind_test_set
In addition, is there any pretrained model that I could reproduce exactly same result of yours?
It would be pleasure if you help me.
in complexnn.py:
def forward(self, inputs):
if isinstance(inputs,list):
real, imag = inputs
elif isinstance(inputs, torch.Tensor):
real, imag = torch.chunk(inputs,-1)
r2r_out = self.real_lstm(real)[0]
r2i_out = self.imag_lstm(real)[0]
i2r_out = self.real_lstm(imag)[0]
i2i_out = self.imag_lstm(imag)[0]
real_out = r2r_out - i2i_out
imag_out = i2r_out + r2i_out
if self.projection_dim is not None:
real_out = self.r_trans(real_out)
imag_out = self.i_trans(imag_out)
#print(real_out.shape,imag_out.shape)
return [real_out, imag_out]
Why is lstm not receiving the hidden states as input? Shouldn't it be something like this?:
r2r_out, (self.hn, self.cn) = self.real_lstm(real,(self.hn, self.cn))
Would you please tell me which kind of PESQ score are you using in the paper? Is it narrowband PESQ or wideband PESQ? Because when I reproduce the code I found the result is not what I expected as I use a different dataset.
can your offer the log? Epoch 2, loss value is 12, is it right?
您好,文章中几处细节有些不理解。
希望得到您的答复。
Hi! Thanks for the paper, it’s a nice read, and congratulations for performing so well in the DNS challenge!
I have a question about the training procedure used for the models that are being compared in the paper. Did you train all of the models (except the E-Aug variant) on identical data and with same number of epochs, augmentation, etc? In particular, did you use the same training data and process for the CL variant and DCUNet-16 or did you use a version of DCUNet trained on a different dataset? Specifically, did you use RIRs when training DCUNet?
I am asking because I wonder if there is anything inherent to DCUNet that makes it worse for reverberated speech or if the 0.5 PESQ difference is a result of not training on reverberated data.
Did you compare to any other DCUNet variant than DCUNet-16?
Thank you!
Congratulations for performing so well in the DNS challenge!
I have a question about calculation method of STFT. It seems that STFT by convolution is different from STFT in the conventional method. I have tested the librosa libaray and WebRTC, their results are different.
Congratulations!
I got confused when trying to re-implement your work. In the 3.2 session of the paper, it says 'This eventually leads to 6 frames look-head, totally 6 × 6.25 = 37.5 ms'.
What is the latest version of Python and what versions of packages were you using then you worked on this model?
Could you add a requirements.txt
file?
您好,感谢您优秀的工作!
在复现您的论文时,我们使用DNS数据集生成的30s语音进行训练和测试,但是最后的结果不是很理想,想请问一下论文中训练数据和测试数据的长度和格式是怎样的?谢谢!
The function name for Hann window has been changed from 'hanning' to 'hann'.
I made a pr for this issue, please check.
Model seems only remove noise on samples but not reverberation, is that right? If the model can remove both noise and reverb?
您好,我看了你的源码,conv-stft里面主要就是一个conv1d,那你是用1为卷积模拟傅里叶变换的过程吗,并且参数都是可学习的,是不是说你的模型其实还是一个time-domain的模型?
As the paper claimed, one frame takes around 3.12 ms. How does this number come from? Thanks for sharing the experience.
which pytorch, python version, does the project use ?
你好,祝贺贵团队在DNS challenge中获得好成绩。
在对贵团队的论文进行复现的过程中,我遇到了以下几个问题,渴求能够得到解答:
1、在ComplexEncoder和ComplexLSTM连接时的维度问题?
ComplexEncoder最后输出的结果应该为4维(batch, channel, H, W),而ComplexLSTM的输入要求是(batch, timestep,feature_dim),请问这边是如何处理匹配的呢?
2、在ComplexLSTM后面紧接着Dense层,但是此时的LSTM已经是复数形式,Dense层如何和其进行匹配?
3、Dense层的输出同样为3维(batch, timestep,feature_dim),请问贵团队是如何处理恢复成4维(batch, channel, H, W)以匹配ComplexDecoder的呢?
望解答,再次感谢贵团队的贡献!
It's easy enough to show the model's output at time t is effected by samples from t+900 (implying 56.25msec anti-causality).
Simply put, if we enter a signal like [1,1,1...,1,inf,inf,inf], and the first inf comes at sample N, the output becomes invalid at sample N-900, which implies 56.25 delay.
net = DCCRN(rnn_units=256,
masking_mode='E',
use_clstm=False,
kernel_num=[32, 64, 128, 256, 256, 256])
canary_input = torch.ones([1, 16000*2]).clamp_(-1, 1) * 0.5
canary_input[0,-leading_n:] = np.inf
out_canary = net(canary_input)[1].detach().numpy()
first_invalid = np.where(np.isnan(out_canary) == True)[1][0]
future_effect_size = canary_input[0].shape[0] -leading_n- first_invalid
print('model samples into the future is',future_effect_size)
I've done some digging and it seems that this has to do with STFT working in windows of 400 and 100 skips.
Thank you for posting your model. Could you consider adding a license file? Without it we can only look at the code. See also: https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/licensing-a-repository
您好:
我发现把ComplexBatchNorm中的track_running_stats设置为True,在训练过程中的内存占用会不断上升,最终导致内存爆炸。请问您在训练DCCRN时没有出现该问题吗?
谢谢;
祝好!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.