The cone-of-silence from vivjay30

dataset used for COS

Hey, I loved your work, I was trying to replicate it, to do that I was generating some synthetic dataset but got some errors and doubts.
As you mentioned the dataset used is VCTK, it has dataset in .flac format which is not recognized by the program, so did you guys did any preprocessing over the dataset?
And there is no data folder in the original dataset i,e. VCTK (mentioned in the command to generate synthetic dataset).
And can you share the dataset you guys used for training?

The question about the sample rate of ReSpeaker v2.0

No directory 'mir_eval'

Hi, I'm trying to recreate your results, when I run the code I get error message saying there's no such module 'mir_eval', and I wasn't able to find it. perhaps you moved the files somewhere else or renamed the directory? the reference is from 'cos/helpers/eval_utils.py'

Cone-of-Silence/cos/helpers/eval_utils.py

Line 6 in f84f416

from mir_eval.separation import bss_eval_sources

Thanks

[bug] Duplicate un-normalize in train.py

Hi, vivjay

Cone-of-Silence/cos/training/train.py

Lines 115 to 124 in a3f27eb

    
           # Un-normalize 
        
           output_signal = output_signal * stds.unsqueeze( 
        
               3) + means.unsqueeze(3) 
        
           # Un-normalize 
        
           output_signal = unnormalize_input(output_signal, means, stds) 
        
           output_voices = output_signal[:, 0] 
        
           loss = model.loss(output_voices, label_voice_signals)

line 117 and line 121 are the same.
Un-normalize twice will make the test loss (validation) do not decrease.
much like the problem @zhangshengoo asked this #12 (comment)

Sample 4-channel audio files?

Hi, in your README, you mention that "We even provide a sample 4 channel file for you to run". Where is this file located?

Real Data

Hi,

Your paper mentions you recorded 3 hours of data into the Seeed 4 mic hat from VCTK corpus.

Would you be willing to make that available?
We'd like to duplicate the results, and we don't get the same level from training on just the synthetic data.

Thanks!!
Richard

Real-time

Hi. I do realize that you have already answered the question that whether the algorithm is real-time or not in #9 . I just want to know whether there is any way that we can do to make it operational in real-time?

Two-channel audio recordings?

From what I understand, the model will work for 4-channel and 6-channel wav files.

Does this model also work on 2-channel recordings? Is there a pretrained model for that?

It seems there is a logical error in synthetic_dataset.py

In synthetic_dataset.py get_mixture_and_gt() funtion, about in line 161

shifted_gt, _ = utils.shift_mixture(np.stack(gt_waveforms), target_pos, self.mic_radius, self.sr)
should be in the outside of the loop like this:

Thanks for your sharing

Hi @vivjay30 ,
Thanks for your sharing - nice work!. Would you like to share your dataset for training?
Best regards,
PeterPham

Thanks for your sharing

Hi @vivjay30
Thanks for your answer. Would you like to tell some device that you are recording real speaker?
Best regards,
PeterPham

Mean and STD of the signal peak

Hi Vivek,

Thanks for your awesome work.
What's the meaning for FG_VOL_MIN、FG_VOL_MAX、BG_VOL_MIN、BG_VOL_MAX in generate_dataset.py and how did you calculate these four values?

Best regards,
KenHuang

real-time?

Hi,

thanks for this library.
Is it possible to use this in a real-time scenario like conferences?

Best regards,
Dirk

The effect of model size on the overall performance

Hi Vivek,

Thanks for opensourcing this interesting project - nice work! I just have one question about the model size and the performance. I checked your Demucs model implementation and calculated the number of parameters, and with your default hyperparameter setting there are over 260M parameters. I'm not sure if this is the actual setting you used for training, but if so, this is really a huge amount of parameters as in other separation models the model sizes are typically smaller than 10M nowadays. I'm wondering whether you have done any experiments on how the performance will be if you shrink the model size, e.g. to the level of multi-channel Conv-TasNet or TAC reported in your paper. Thanks!


	# Un-normalize
	output_signal = output_signal * stds.unsqueeze(
	3) + means.unsqueeze(3)

	# Un-normalize
	output_signal = unnormalize_input(output_signal, means, stds)
	output_voices = output_signal[:, 0]

	loss = model.loss(output_voices, label_voice_signals)

vivjay30 / cone-of-silence Goto Github PK

cone-of-silence's People

Contributors

Stargazers

Watchers

Forkers

cone-of-silence's Issues

Recommend Projects

Recommend Topics

Recommend Org