jusperlee / looking-to-listen-at-the-cocktail-party Goto Github PK

View Code? Open in Web Editor NEW

162.0 7.0 41.0 83.43 MB

Executable code based on Google articles

License: MIT License

Python 100.00%

cocktail-party audio speech-separation facenet

looking-to-listen-at-the-cocktail-party's Introduction

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

The project is an audiovisual model reproduced by the contents of the paper Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation.

Ephrat A, Mosseri I, Lang O, et al. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation[J]. arXiv preprint arXiv:1804.03619, 2018.

Requirement

Python3.7
TensorFlow 2.0.0
Keras 2.3.1
librosa 0.7.0
youtube-dl(https://github.com/ytdl-org/youtube-dl)(Any version)
ffmpeg(https://www.ffmpeg.org/)（Any version)
sox

To install requirements:

pip install -r requirements.txt

You can install ffmpeg and sox using homebrew:

brew install ffmpeg
brew install sox

Pretreatment

Video Data

Download the dataset from here and place files in data/csv.
First use this command to download the YouTube video and use ffmpeg to capture the 3 second video as 75 images.

python3 video_download.py

Then use mtcnn to get the image bounding box of the face, and then use the CSV x, y to locate the face center point.

pip install mtcnn
python3 face_detected.py
python3 check_vaild_face.py

Audio Data

For the audio section, use the YouTube download tool to download the audio, then set the sample rate to 16000 via the librosa library. Finally, the audio data is normalized.

python3 audio_downloads.py
python3 audio_norm.py # audio_data normalized

Pre-processing audio data, including stft, Power-law, blending, generating complex masks, etc....

python3 audio_data.py

Face embedding Feature

Here we use Google's FaceNet method to map face images to high-dimensional Euclidean space. In this project, we use David Sandberg's open source FaceNet preprocessing model "20180402-114759". Then use the TensorFlow_to_Keras script in this project to convert.（Model/face_embedding/）

Schroff F, Kalenichenko D, Philbin J. Facenet: A unified embedding for face recognition and clustering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 815-823.

Change the path tf_model_dir in Tensorflow_to_Keras.py

python3 Tensorflow_to_Keras.py
python3 face_emb.py

Create AVdataset_train.txt and AVdataset_val.txt

python3 AV_data_log.py

Training

Support continuous training after interrupt training
Support multi-GPU multi-process training.
According to the description in the paper, set the following parameters:

people_num = 2 # How many people you want to separate?
epochs = 100
initial_epoch = 0
batch_size = 1 # 2,4 need to GPU
gamma_loss = 0.1
beta_loss = gamma_loss * 2

Then use the script train.py to train

Plan to achieve

Implemented with Pytorch
Provide a trained model
Optimize code style
......

Part of the code reference this github https://github.com/bill9800/speech_separation

looking-to-listen-at-the-cocktail-party's People

Contributors

Stargazers

Watchers

Forkers

hy7873 sujian1988 chuan333 dnlwldnl coalboss ohjoon2 yuzhongshanyue xiaohanhwang caoyuhang sonhamin qingshanxiaozi borsuk74 jefferyoung96 ayushtiwari ruizewang asdlei99 souvic road2018 mayeedit3 haoyz warhammer0 ashbeats ishine shy2020-git zysilence diaodiaolzq luisst azuredsky noammy yangx1123 fmschleif dogwealth runngezhang gonglk syedrehan009 taizoayase fwl2000 techthiyanes wazhee zikovich mizilu33

looking-to-listen-at-the-cocktail-party's Issues

The voice generation after STFT in AO_model is not 2982572. Why are the numbers in the first column different?

error in test.py file

@JusperLee hi while running the python3 test.py file I'm getting following error :

python3 test.py
Using TensorFlow backend.
Initialing Parameters......
Loading data ......
2020-03-27 23:11:00.256203: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-27 23:11:00.284390: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1800000000 Hz
2020-03-27 23:11:00.285014: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b3d04ca750 executing computations on platform Host. Devices:
2020-03-27 23:11:00.285049: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): ,
2020-03-27 23:11:00.302032: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Traceback (most recent call last):
File "test.py", line 51, in
av_model = load_model(model_path,custom_objects={'tf':tf})
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/saving.py", line 492, in load_wrapper
return load_function(*args, **kwargs)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/saving.py", line 584, in load_model
model = _deserialize_model(h5dict, custom_objects, compile)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/saving.py", line 369, in _deserialize_model
sample_weight_mode=sample_weight_mode)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/training.py", line 119, in compile
self.loss, self.output_names)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/training_utils.py", line 822, in prepare_loss_functions
loss_functions = [get_loss_function(loss) for _ in output_names]
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/training_utils.py", line 822, in
loss_functions = [get_loss_function(loss) for _ in output_names]
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/training_utils.py", line 705, in get_loss_function
loss_fn = losses.get(loss)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/losses.py", line 795, in get
return deserialize(identifier)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/losses.py", line 776, in deserialize
printable_module_name='loss function')
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/utils/generic_utils.py", line 167, in deserialize_keras_object
':' + function_name)
ValueError: Unknown loss function:loss_func

Can you please help it out?

How to use test.py

I managed to train anget AVmodel-2p-001.h5
But when I tried to run test.py, I came to a strange bug.
It reads as below,

Traceback (most recent call last):
  File "test.py", line 53, in <module>
    av_model = load_model(model_path,custom_objects={'tf':tf})
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/saving.py", line 492, in load_wrapper
    return load_function(*args, **kwargs)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/saving.py", line 584, in load_model
    model = _deserialize_model(h5dict, custom_objects, compile)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/saving.py", line 369, in _deserialize_model
    sample_weight_mode=sample_weight_mode)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 75, in symbolic_fn_wrapper
    return func(*args, **kwargs)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/training.py", line 119, in compile
    self.loss, self.output_names)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/training_utils.py", line 822, in prepare_loss_functions
    loss_functions = [get_loss_function(loss) for _ in output_names]
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/training_utils.py", line 822, in <listcomp>
    loss_functions = [get_loss_function(loss) for _ in output_names]
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/training_utils.py", line 705, in get_loss_function
    loss_fn = losses.get(loss)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/losses.py", line 795, in get
    return deserialize(identifier)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/losses.py", line 776, in deserialize
    printable_module_name='loss function')
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/utils/generic_utils.py", line 167, in deserialize_keras_object
    ':' + function_name)
ValueError: Unknown loss function:loss_func

Could you please give me some help?
Since that you are using 百度网盘, I guess we are both Chinese.
So it may be more convenient to discribe it in Chinese.
我训练了几轮后得到了一些h5的权重值，我尝试着跑test.py的时候遇到了上面的报错，我用grep在代码里找了一下发现了叫做loss_func的函数，是不是av_model = load_model(model_path,custom_objects={'tf':tf})的问题？
为什么train.py用的是AV_model = AV.AV_model(people_num)，而test.py用的却是av_model = load_model(model_path,custom_objects={'tf':tf}) ?
请问您最后是怎么运行test.py的呢？

How to get dataset_train.txt？

No such file or directory: './AV_model_database/dataset_train.txt'.

X,Y co-ordinates will not be present in test videos as obvious

Hi,
I was just wondering why do we even need the X,Y co-ordinates here for the input..
Those will surely not present in testfiles if we are testing in the wild.

作者您好

您好，最近在实现代码的时候，下载数据集出了问题，YouTube上的部分视频无法下载，有的视频可以下载但是对应的音频无法下载，因此无法开始训练网络。您可以分享一下训练模型吗？不需要准确率最好的模型，我只想跑一个demo，看一下模型效果

Result of test.py file

@JusperLee after running the test.py file the pred folder is created where the .wav files are generating but there is no voice in all the .wav files all the files are silent. Why is that so?

can face embeddings be provided in this repo

the current test script requires face embedding to run, which need to download video&audios and heavy preprocess. can you provide face-emb so that the inference demo can be directly run?

hi,when i run test,a error :unknown loss function:loss_func.What is going on, can it be solved?

type error

Hi,
you should corret to fast_stft here:

Looking-to-Listen-at-the-Cocktail-Party/test.py

Line 67 in cb973ae

T = utils.fase_istft(F,power=False)

对cRM没有通过sigmoid将数值压缩到0-1？

您好，首先感谢您对论文的复现。我在阅读文章中发现作者提到：

Real and imaginary parts of the complex mask will typically lie between -1 and 1, however, we use sigmoidal compression to bound these complex mask values between 0 and 1.

而我看您的代码中并没有这部分，有tanh compression但没有sigmoid讲cRM值压缩到0-1范围。是您发现这样效果不好嘛？还是另有原因呢？多谢

operands error

@JusperLee Please tell the solution for that last issue as I'm very new to this?

hello,will there be pre-trained models?

istft error in model/utils/utils.py

While running test.py, an error occour:

In line 25 of utils.py:
Total = np.zeros((windows * step + fft_size))
with windows shape (512,), step 160 and fft_size 512.
raise an error since max dimension of np.zeros is 32.

I would suggest Total = np.zeros((windows.shape[0] * step + fft_size)) in order to make 1dimensional, but then another error occour:
Total[start:end] = Total[start:end] + data[i:] * windows,
ValueError: operands could not be broadcast together with shapes (298,257) (512,)
since data[i:] is clearly bidimensional.

I dont understund the logic here: is this a bug or i just messed up?
If it's a bug, can you please fix this?

Thanks

Size of num_gpu while training AVmodel?

Fit_generator() error in tensorflow 2.4.0 but not in 2.0.0

Thank you for your work. I tried to run the project in tensorflow 2.4.0 as shown in the requirements.txt but the traceback shows when fitting:

result=AV_model.fit_generator(generator=train_generator,
                                 validation_data=val_generator,
                                 epochs=epochs,
                                 workers=workers,
                                 use_multiprocessing=MultiProcess,
                                 callbacks=[TensorBoard(log_dir='./log'), checkpoint, rlr],
                                 initial_epoch=0
                                 )

UserWarning: `Model.fit_generator` is deprecated and will be removed in a future version. Please use `Model.fit`, which supports generators.
  warnings.warn('`Model.fit_generator` is deprecated and '
Traceback (most recent call last):
  File "D:\Looking-to-Listen-at-the-Cocktail-Party-master\train.py", line 138, in <module>
    initial_epoch=initial_epoch
  File "D:\Anaconda3\envs\cocktail\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1064, in fit
    steps_per_execution=self._steps_per_execution)
  File "D:\Anaconda3\envs\cocktail\lib\site-packages\tensorflow\python\keras\engine\data_adapter.py", line 1099, in __init__
    adapter_cls = select_data_adapter(x, y)
  File "D:\Anaconda3\envs\cocktail\lib\site-packages\tensorflow\python\keras\engine\data_adapter.py", line 964, in select_data_adapter
    _type_name(x), _type_name(y)))
ValueError: Failed to find data adapter that can handle input <class 'data_load.AVGenerator>, <class 'NoneType>'

I change AV_model.fit_generator() to AV_model.fit() and UserWarning disappears but the error is the same.

If I change tensorflow from 2.4.0 back to 2.0.3, codes can be run successfully. In this case, how to use tensorflow 2.4.0?

Hardware used to train

I've read through the Google research paper and there isn't anything the suggests what kind of hardware they are using.

What are you using to train this repo on?

I currently have 2 GTX 1070s and am worried that might not be enough

Output of the test.py file

@JusperLee Output of the test.py in the predict folder is nothing but the mixed audio files. We are supposed to get the isolated files. Please help on this.

something wrong in test code ?

Hi, @JusperLee
I successed tried train model, but when I run pyhton test.py:
error result like:

 File "test.py", line 84, in <module>
    T = fast_istft(F,power=False)
  File "/Looking-to-Listen-at-the-Cocktail-Party/model/utils/utils.py", line 73, in fast_istft
    data = istft(real_imag_shrink(data))
  File "/Looking-to-Listen-at-the-Cocktail-Party/model/utils/utils.py", line 25, in istft
    Total = np.zeros((windows * step + fft_size))
ValueError: maximum supported dimension for an ndarray is 32, found 512

and I check the code:

def istft(M, fft_size=512, step=160, padding=True):
    data = np.fft.ifft(M, axis=-1)
    windows = np.concatenate((np.zeros((56,)), np.hanning(fft_size - 112), np.zeros((56,))), axis=0)
    windows_num = M.shape[0]
    Total = np.zeros((windows_num * step + fft_size))  ##change windows to windows_num
    for i in range(windows_num):
        start = int(i * step)
        end = int(start + fft_size)
        print(Total.shape, data[i:].shape, windows.shape)
        Total[start:end] = Total[start:end] + data[i:] * windows
    if padding == True:
        Total = Total[:48000]

    return Total

error like this:

(48192,) (298, 257) (512,)
Traceback (most recent call last):
  File "test.py", line 84, in <module>
    T = fast_istft(F,power=False)
  File "/home/feilongchen/GitSpace/speakerindent/Looking-to-Listen-at-the-Cocktail-Party/model/utils/utils.py", line 74, in fast_istft
    data = istft(real_imag_shrink(data))
  File "/home/feilongchen/GitSpace/speakerindent/Looking-to-Listen-at-the-Cocktail-Party/model/utils/utils.py", line 30, in istft
    Total[start:end] = Total[start:end] + data[i:] * windows
ValueError: operands could not be broadcast together with shapes (298,257) (512,)

can u help me to fix it?

Colab script error in test.py part

I tried to implement the code in Colab, I am not sure, what is the mistake I am doing in the evaluation (testing part), the last coding cell. Any clue?
thanks

requirements.txt internal conflicts

Not the end of the world, but when I pip install your requirements.txt, I get version conflicts. I'm updating the specific listed versions myself as best I can, but it might be worth updating the repo's copy.

edit: FWIW, here's my updated requirements.txt, attached.
requirements.txt

你好，在知乎回答中注意到你在视频帧扩展数据维度与音频保持一致时提到使用最近邻内插法，但是在代码中好像你使用的还是双线性内插法

还有想请教一下，我是一个小白，正在对音视频语音分离进行研究，之前一直在跑bill写的代码，但是model_v2的效果一直不好，我所做的改变只是将bill中的双线性内插改成了最近邻插值法，效果会提升零点几dB，也尝试过增大混合训练数据集，使用过7万，21万的混合语音数据集进行训练，损耗值输出都在0.6几，而且有很多epoch都没有损耗下降，差不多50epoch只有6、7个epoch下降，使用测试集对训练好的模型进行测试一直在3dB左右，如果使用训练集测试也只是到达6dB左右，你在尝试bill的代码有存在这些问题吗？我原来有打算把脸部的特征提取变为唇部的特征提取，但是文章曾明确说过这样做效果几乎没有差异，而且文章提到了除嘴部以外的其他区域也会对语音分离的效果有作用，所以就没有进行此类尝试，想再请教一下，你觉得脸部替换为唇部特征有必要吗？