huyanxin / deepcomplexcrn Goto Github PK

View Code? Open in Web Editor NEW

390.0 9.0 97.0 57.04 MB

License: Apache License 2.0

HTML 47.28% JavaScript 0.06% CSS 12.58% Python 40.08%

deepcomplexcrn's Introduction

Samples from "DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

Authors: Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang,Yihui Fu, Jian Wu, Bihong Zhang, Lei Xie

Paper: https://arxiv.org/abs/2008.00264

Sample: https://huyanxin.github.io/DeepComplexCRN/

deepcomplexcrn's People

Contributors

Stargazers

Watchers

Forkers

demonbuaa kssk16 runngezhang mayeedit3 fchest chenxinglili cheriylan rodolfo-s hiyoung-asr jie-fei john-zhaofei happyday630 zhaoliang1983x azhiltz newoneincntk okrio pengyizhou zjc6666 realkris zhangxinaaaa lemnzhou yww624 hyli666 ishine zhongshijun lesley96-sudo zzhang68 turned2670 gaoyiyeah erridan23 spxnn noise-suppression dttlgotv pandagst conghannn meadow163 yizhao996 lizhengdao gxu82 scofir kongkong9527 noodyyeung acbdef123 sgleem baekms thomasfeher youzi-ciki road2018 songmmhh samsudinng ioyy900205 phuntast1c caovanloi sorangnl0 wangshuo182 jianweisun007 koala7580 fragrantrookie pzhang266 jinmingche jeffery-work rrrrwys markhsia supertang6 xk-wang xshwen kedengfeng huyuhua hbwu-ntu bigsealing lizezheng chaoyiwang09 dingsw1 mygit007hub lby14 victorvoice wikiloa auzxb gedebabin wuxiangyu jearendelle yejianfei milindap hfwanguanghui beimingcao mwy0615 xk2016 xjia520 statkwon magicalduoduo techthiyanes michaelchen147 runngezhang-jx mordehaym bw-git hujingbin1 snoppy258456

deepcomplexcrn's Issues

pretrain model会release出来么？

Hi， yanxin，想问下DCCRNpaper中用的模型可以作为pretrain model release出来么？

您好，我刚入门语音处理不久，最近在复现贵团队的DCCRN模型。我使用您论文中提到的训练参数，将DNS challenge数据集中的音频分为每段15s去训练。使用训练得到的模型对完整的音频文件进行测试，出来的效果没有问题非常好。
然后为了测试其实时推理的效果，我将音频文件切分为每37.5ms一个clip，然后将这些clips按照时序送入网络，得到网络的输出后再按照时序进行拼接，最后得到完整的处理后的音频。在推理的过程中我在您提供的官方模型代码的基础上修改和添加了：
（1）将“stft左右两边各补300采样点的零”改为“左边补300采样点的前一帧末尾的数据，右边补300采样点的零”
（2）手动更新LSTM的状态h和c
（3）给encoder添加buffer。即将前一帧的6次encoder的输入的（时序维度上）最后一个特征作为当前帧6次encoder时padding的内容（而不是padding零）
然而基于以上操作进行推理，出来的音频效果并不好（几乎没有降噪效果）。
因为网络对完整的音频文件进行推理出来的效果很好，所以可以确定网络的训练是有效的。而使用clip的方法去推理效果不好，我觉得可能还是推理的部分哪里没有和训练部分保持一致。
请问是我对您提供的官方模型代码进行修改和添加的过程中出现了错误，还是说我还需要再对代码进行一些其它修改？

Number of parameters reported in paper

Hi! I think the number of parameters reported in the paper are off.

For example, DCCRN-CL is reported as 32, 64, 128, 256, 256, 256 filters each with kernel size 5*2=10, for a total of 3.7M params. But isn't the number of params of the Conv2Ds roughly follows? (ignoring LSTM)

  in  *   ks  * out =     total
------+-------+-----+----------
Encoder
    1 * 5 * 2 *  32 =       320
   32 * 5 * 2 *  64 =    20,480
   64 * 5 * 2 * 128 =    81,920
  128 * 5 * 2 * 256 =   327,680
+ 256 * 5 * 2 * 256 =   655,360
+ 256 * 5 * 2 * 256 =   655,360
Decoder
+ 256 * 5 * 2 * 256 =   655,360
+ 512 * 5 * 2 * 256 = 1,310,720  (has skip conn input)
+ 512 * 5 * 2 * 256 = 1,310,720  (has skip conn input)
+ 512 * 5 * 2 * 128 =   655,360  (has skip conn input)
  256 * 5 * 2 *  64 =   163,840  (has skip conn input)
  128 * 5 * 2 *  32 =    40,960  (has skip conn input)
   64 * 5 * 2 *   1 =       640  (has skip conn input)
-------------------------------
                    = 5,878,720 COMPLEX params, i.e. 11,757,440 REAL params

Number of past frames?

Cant see it is stated in the paper, how many number of past the frames is used? Only that it uses 6 frames from the future.

convert to onnx error

Thanks for the great work.
When I tried to convert the model to onnx to measure exec time, there is an error. It seems onnx does not support atan2?
How to solve this problem?

RuntimeError: Exporting the operator atan2 to ONNX opset version 11 is not supported. Please open a bug to request ONNX export support for the missing operator.

While studying, I'd like to reproduce your code. Please could you help

In dc_crn.py inputs and labels are random tensor.
How could I process sample wav files?
Is it right to change line 305 at dc_crn.py into some kind of wav file read such as sf.read('./noreverb_fileid_6.wav')[0] in conv_stft.py?

Also, I am confuse to put which data should be used in 'labels' in line 306 in dc_crn.py.
I'm trying to apply 'ICASSP_blind_test_set' in 'DNS-Challenge'
https://github.com/microsoft/DNS-Challenge/tree/master/datasets/ICASSP_blind_test_set

In addition, is there any pretrained model that I could reproduce exactly same result of yours?

It would be pleasure if you help me.

关于mask_phase的计算公式问题

$ALUX@MD6VRI{Q_PC}0YV7C$
论文公式里计算mask_phase的方式是直接对网络的输出求arctan，为什么在代码实现里多了一步求real_phase，imag_phase这一步呢？代码实现如下：

Why is LSTM not receiving hidden states as input?

in complexnn.py:

    def forward(self, inputs):
        if isinstance(inputs,list):
            real, imag = inputs 
        elif isinstance(inputs, torch.Tensor):
            real, imag = torch.chunk(inputs,-1)
        r2r_out = self.real_lstm(real)[0]
        r2i_out = self.imag_lstm(real)[0]
        i2r_out = self.real_lstm(imag)[0]
        i2i_out = self.imag_lstm(imag)[0]
        real_out = r2r_out - i2i_out
        imag_out = i2r_out + r2i_out 
        if self.projection_dim is not None:
            real_out = self.r_trans(real_out)
            imag_out = self.i_trans(imag_out)
        #print(real_out.shape,imag_out.shape)
        return [real_out, imag_out]

Why is lstm not receiving the hidden states as input? Shouldn't it be something like this?:
r2r_out, (self.hn, self.cn) = self.real_lstm(real,(self.hn, self.cn))

Regarding PESQ score

Would you please tell me which kind of PESQ score are you using in the paper? Is it narrowband PESQ or wideband PESQ? Because when I reproduce the code I found the result is not what I expected as I use a different dataset.

training loss log

can your offer the log? Epoch 2, loss value is 12, is it right?

窗长度、DCRNN-CL的问题

您好，文章中几处细节有些不理解。

文章中说选取窗长度=25ms，fs=16kHz，每个窗包含400个数据，做512点的FFT。
请问此处是对400个数据进行补0扩充到512点做FFT吗？如果是，为什么不选取更长的窗（如32ms,512个数据无需补0）进行处理？
“DCCRN-CL(masking like DCCRN-E)”，是指按照公式（8）进行计算，DCCRN-CL的mask值是受限在0~1，而DCCRN-C的mask无限制范围吗？
“The direct current component of all these models is removed.” 模型中的直流分量指的是什么？这样做的意义是什么？

希望得到您的答复。

Question about benchmarks in paper

Hi! Thanks for the paper, it’s a nice read, and congratulations for performing so well in the DNS challenge!

I have a question about the training procedure used for the models that are being compared in the paper. Did you train all of the models (except the E-Aug variant) on identical data and with same number of epochs, augmentation, etc? In particular, did you use the same training data and process for the CL variant and DCUNet-16 or did you use a version of DCUNet trained on a different dataset? Specifically, did you use RIRs when training DCUNet?

I am asking because I wonder if there is anything inherent to DCUNet that makes it worse for reverberated speech or if the 0.5 PESQ difference is a result of not training on reverberated data.

Did you compare to any other DCUNet variant than DCUNet-16?

Thank you!

train loss decrease. dev loss increase

complex conv2d 数学理论和实现是否一致

Question about conv_stft

Congratulations for performing so well in the DNS challenge!

I have a question about calculation method of STFT. It seems that STFT by convolution is different from STFT in the conventional method. I have tested the librosa libaray and WebRTC, their results are different.

Inquiry about the window size setting of the model

Congratulations!

I got confused when trying to re-implement your work. In the 3.2 session of the paper, it says 'This eventually leads to 6 frames look-head, totally 6 × 6.25 = 37.5 ms'.

Does a 'look-ahead of 37.5 ms' means that one training example has a length of 25 + 6 × 6.25 = 62.5 ms, aka input of shape [1, 1000] under a sampling rate of 16 kHz?
Is it correct to truncate the output size to the current frame (without look-ahead) before calculating the SI-SNR loss?

Request for requirements.txt

What is the latest version of Python and what versions of packages were you using then you worked on this model?

Could you add a requirements.txt file?

关于训练数据和测试数据的长度

您好,感谢您优秀的工作!
在复现您的论文时,我们使用DNS数据集生成的30s语音进行训练和测试,但是最后的结果不是很理想,想请问一下论文中训练数据和测试数据的长度和格式是怎样的?谢谢!

Error with function name 'hanning'

The function name for Hann window has been changed from 'hanning' to 'hann'.

I made a pr for this issue, please check.

Samples on reverberation set

Model seems only remove noise on samples but not reverberation, is that right? If the model can remove both noise and reverb?

关于conv-stft的问题

您好，我看了你的源码，conv-stft里面主要就是一个conv1d,那你是用1为卷积模拟傅里叶变换的过程吗，并且参数都是可学习的，是不是说你的模型其实还是一个time-domain的模型？

Test time over CPU

As the paper claimed, one frame takes around 3.12 ms. How does this number come from? Thanks for sharing the experience.

Environment requirements

which pytorch, python version, does the project use ?

ComplexLSTM和Dense的问题

你好，祝贺贵团队在DNS challenge中获得好成绩。
在对贵团队的论文进行复现的过程中，我遇到了以下几个问题，渴求能够得到解答：

1、在ComplexEncoder和ComplexLSTM连接时的维度问题？
ComplexEncoder最后输出的结果应该为4维（batch, channel, H, W）,而ComplexLSTM的输入要求是（batch， timestep，feature_dim）,请问这边是如何处理匹配的呢？

2、在ComplexLSTM后面紧接着Dense层，但是此时的LSTM已经是复数形式，Dense层如何和其进行匹配？

3、Dense层的输出同样为3维（batch， timestep，feature_dim），请问贵团队是如何处理恢复成4维（batch, channel, H, W）以匹配ComplexDecoder的呢？

望解答，再次感谢贵团队的贡献！

Model is effected by 56.25 msec into the future, not 37.5 as stated in paper

It's easy enough to show the model's output at time t is effected by samples from t+900 (implying 56.25msec anti-causality).

Simply put, if we enter a signal like [1,1,1...,1,inf,inf,inf], and the first inf comes at sample N, the output becomes invalid at sample N-900, which implies 56.25 delay.

     net = DCCRN(rnn_units=256,
                masking_mode='E',
                use_clstm=False,
                kernel_num=[32, 64, 128, 256, 256, 256])
    canary_input = torch.ones([1, 16000*2]).clamp_(-1, 1) * 0.5
    canary_input[0,-leading_n:] = np.inf
    out_canary = net(canary_input)[1].detach().numpy()
    first_invalid = np.where(np.isnan(out_canary) == True)[1][0]
    future_effect_size = canary_input[0].shape[0] -leading_n- first_invalid
    print('model samples into the future is',future_effect_size)

I've done some digging and it seems that this has to do with STFT working in windows of 400 and 100 skips.

No License file

Thank you for posting your model. Could you consider adding a license file? Without it we can only look at the code. See also: https://docs.github.com/en/free-pro-team@latest/github/creating-cloning-and-archiving-repositories/licensing-a-repository

ComplexBatchNorm在训练过程中内存占用不断升高

您好：
我发现把ComplexBatchNorm中的track_running_stats设置为True，在训练过程中的内存占用会不断上升，最终导致内存爆炸。请问您在训练DCCRN时没有出现该问题吗？
谢谢；
祝好！