Coder Social home page Coder Social logo

looking-to-listen-at-the-cocktail-party's Introduction

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation


The project is an audiovisual model reproduced by the contents of the paper Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation.

Ephrat A, Mosseri I, Lang O, et al. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation[J]. arXiv preprint arXiv:1804.03619, 2018.


Requirement

To install requirements:

pip install -r requirements.txt

You can install ffmpeg and sox using homebrew:

brew install ffmpeg
brew install sox

Pretreatment

Video Data

  1. Download the dataset from here and place files in data/csv.
  2. First use this command to download the YouTube video and use ffmpeg to capture the 3 second video as 75 images.
python3 video_download.py
  1. Then use mtcnn to get the image bounding box of the face, and then use the CSV x, y to locate the face center point.
pip install mtcnn
python3 face_detected.py
python3 check_vaild_face.py

Audio Data

  1. For the audio section, use the YouTube download tool to download the audio, then set the sample rate to 16000 via the librosa library. Finally, the audio data is normalized.
python3 audio_downloads.py
python3 audio_norm.py # audio_data normalized
  1. Pre-processing audio data, including stft, Power-law, blending, generating complex masks, etc....
python3 audio_data.py

Face embedding Feature

  • Here we use Google's FaceNet method to map face images to high-dimensional Euclidean space. In this project, we use David Sandberg's open source FaceNet preprocessing model "20180402-114759". Then use the TensorFlow_to_Keras script in this project to convert.(Model/face_embedding/

Schroff F, Kalenichenko D, Philbin J. Facenet: A unified embedding for face recognition and clustering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 815-823.

Change the path tf_model_dir in Tensorflow_to_Keras.py

python3 Tensorflow_to_Keras.py
python3 face_emb.py

  1. Create AVdataset_train.txt and AVdataset_val.txt
python3 AV_data_log.py

Training

  • Support continuous training after interrupt training
  • Support multi-GPU multi-process training.
  • According to the description in the paper, set the following parameters:
people_num = 2 # How many people you want to separate?
epochs = 100
initial_epoch = 0
batch_size = 1 # 2,4 need to GPU
gamma_loss = 0.1
beta_loss = gamma_loss * 2
  • Then use the script train.py to train

Plan to achieve

  • Implemented with Pytorch
  • Provide a trained model
  • Optimize code style
  • ......

Part of the code reference this github https://github.com/bill9800/speech_separation

looking-to-listen-at-the-cocktail-party's People

Contributors

ayushtiwari avatar dependabot[bot] avatar jusperlee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

looking-to-listen-at-the-cocktail-party's Issues

error in test.py file

@JusperLee hi while running the python3 test.py file I'm getting following error :

python3 test.py
Using TensorFlow backend.
Initialing Parameters......
Loading data ......
2020-03-27 23:11:00.256203: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-27 23:11:00.284390: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1800000000 Hz
2020-03-27 23:11:00.285014: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b3d04ca750 executing computations on platform Host. Devices:
2020-03-27 23:11:00.285049: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): ,
2020-03-27 23:11:00.302032: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
Traceback (most recent call last):
File "test.py", line 51, in
av_model = load_model(model_path,custom_objects={'tf':tf})
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/saving.py", line 492, in load_wrapper
return load_function(*args, **kwargs)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/saving.py", line 584, in load_model
model = _deserialize_model(h5dict, custom_objects, compile)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/saving.py", line 369, in _deserialize_model
sample_weight_mode=sample_weight_mode)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/training.py", line 119, in compile
self.loss, self.output_names)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/training_utils.py", line 822, in prepare_loss_functions
loss_functions = [get_loss_function(loss) for _ in output_names]
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/training_utils.py", line 822, in
loss_functions = [get_loss_function(loss) for _ in output_names]
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/engine/training_utils.py", line 705, in get_loss_function
loss_fn = losses.get(loss)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/losses.py", line 795, in get
return deserialize(identifier)
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/losses.py", line 776, in deserialize
printable_module_name='loss function')
File "/home/lenovo/.local/lib/python3.6/site-packages/keras/utils/generic_utils.py", line 167, in deserialize_keras_object
':' + function_name)
ValueError: Unknown loss function:loss_func

Can you please help it out?

How to use test.py

I managed to train anget AVmodel-2p-001.h5
But when I tried to run test.py, I came to a strange bug.
It reads as below,

Traceback (most recent call last):
  File "test.py", line 53, in <module>
    av_model = load_model(model_path,custom_objects={'tf':tf})
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/saving.py", line 492, in load_wrapper
    return load_function(*args, **kwargs)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/saving.py", line 584, in load_model
    model = _deserialize_model(h5dict, custom_objects, compile)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/saving.py", line 369, in _deserialize_model
    sample_weight_mode=sample_weight_mode)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 75, in symbolic_fn_wrapper
    return func(*args, **kwargs)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/training.py", line 119, in compile
    self.loss, self.output_names)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/training_utils.py", line 822, in prepare_loss_functions
    loss_functions = [get_loss_function(loss) for _ in output_names]
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/training_utils.py", line 822, in <listcomp>
    loss_functions = [get_loss_function(loss) for _ in output_names]
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/engine/training_utils.py", line 705, in get_loss_function
    loss_fn = losses.get(loss)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/losses.py", line 795, in get
    return deserialize(identifier)
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/losses.py", line 776, in deserialize
    printable_module_name='loss function')
  File "/home/gz/anaconda3/envs/tf2/lib/python3.7/site-packages/keras/utils/generic_utils.py", line 167, in deserialize_keras_object
    ':' + function_name)
ValueError: Unknown loss function:loss_func

Could you please give me some help?
Since that you are using 百度网盘, I guess we are both Chinese.
So it may be more convenient to discribe it in Chinese.
我训练了几轮后得到了一些h5的权重值,我尝试着跑test.py的时候遇到了上面的报错,我用grep在代码里找了一下发现了叫做loss_func的函数,是不是av_model = load_model(model_path,custom_objects={'tf':tf})的问题?
为什么train.py用的是AV_model = AV.AV_model(people_num),而test.py用的却是av_model = load_model(model_path,custom_objects={'tf':tf}) ?
请问您最后是怎么运行test.py的呢?

作者您好

您好,最近在实现代码的时候,下载数据集出了问题,YouTube上的部分视频无法下载,有的视频可以下载 但是对应的音频无法下载,因此无法开始训练网络。您可以分享一下训练模型吗?不需要准确率最好的模型,我只想跑一个demo,看一下模型效果

Result of test.py file

@JusperLee after running the test.py file the pred folder is created where the .wav files are generating but there is no voice in all the .wav files all the files are silent. Why is that so?

can face embeddings be provided in this repo

the current test script requires face embedding to run, which need to download video&audios and heavy preprocess. can you provide face-emb so that the inference demo can be directly run?

对cRM没有通过sigmoid将数值压缩到0-1?

您好,首先感谢您对论文的复现。我在阅读文章中发现作者提到:

Real and imaginary parts of the complex mask will typically lie between -1 and 1, however, we use sigmoidal compression to bound these complex mask values between 0 and 1.

而我看您的代码中并没有这部分,有tanh compression但没有sigmoid讲cRM值压缩到0-1范围。是您发现这样效果不好嘛?还是另有原因呢?多谢

istft error in model/utils/utils.py

While running test.py, an error occour:

In line 25 of utils.py:
Total = np.zeros((windows * step + fft_size))
with windows shape (512,), step 160 and fft_size 512.
raise an error since max dimension of np.zeros is 32.

I would suggest Total = np.zeros((windows.shape[0] * step + fft_size)) in order to make 1dimensional, but then another error occour:
Total[start:end] = Total[start:end] + data[i:] * windows,
ValueError: operands could not be broadcast together with shapes (298,257) (512,)
since data[i:] is clearly bidimensional.

I dont understund the logic here: is this a bug or i just messed up?
If it's a bug, can you please fix this?

Thanks

Fit_generator() error in tensorflow 2.4.0 but not in 2.0.0

Thank you for your work. I tried to run the project in tensorflow 2.4.0 as shown in the requirements.txt but the traceback shows when fitting:

result=AV_model.fit_generator(generator=train_generator,
                                 validation_data=val_generator,
                                 epochs=epochs,
                                 workers=workers,
                                 use_multiprocessing=MultiProcess,
                                 callbacks=[TensorBoard(log_dir='./log'), checkpoint, rlr],
                                 initial_epoch=0
                                 )
UserWarning: `Model.fit_generator` is deprecated and will be removed in a future version. Please use `Model.fit`, which supports generators.
  warnings.warn('`Model.fit_generator` is deprecated and '
Traceback (most recent call last):
  File "D:\Looking-to-Listen-at-the-Cocktail-Party-master\train.py", line 138, in <module>
    initial_epoch=initial_epoch
  File "D:\Anaconda3\envs\cocktail\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1064, in fit
    steps_per_execution=self._steps_per_execution)
  File "D:\Anaconda3\envs\cocktail\lib\site-packages\tensorflow\python\keras\engine\data_adapter.py", line 1099, in __init__
    adapter_cls = select_data_adapter(x, y)
  File "D:\Anaconda3\envs\cocktail\lib\site-packages\tensorflow\python\keras\engine\data_adapter.py", line 964, in select_data_adapter
    _type_name(x), _type_name(y)))
ValueError: Failed to find data adapter that can handle input <class 'data_load.AVGenerator>, <class 'NoneType>'

I change AV_model.fit_generator() to AV_model.fit() and UserWarning disappears but the error is the same.

If I change tensorflow from 2.4.0 back to 2.0.3, codes can be run successfully. In this case, how to use tensorflow 2.4.0?

Hardware used to train

I've read through the Google research paper and there isn't anything the suggests what kind of hardware they are using.

What are you using to train this repo on?

I currently have 2 GTX 1070s and am worried that might not be enough

Output of the test.py file

@JusperLee Output of the test.py in the predict folder is nothing but the mixed audio files. We are supposed to get the isolated files. Please help on this.

something wrong in test code ?

Hi, @JusperLee
I successed tried train model, but when I run pyhton test.py:
error result like:

 File "test.py", line 84, in <module>
    T = fast_istft(F,power=False)
  File "/Looking-to-Listen-at-the-Cocktail-Party/model/utils/utils.py", line 73, in fast_istft
    data = istft(real_imag_shrink(data))
  File "/Looking-to-Listen-at-the-Cocktail-Party/model/utils/utils.py", line 25, in istft
    Total = np.zeros((windows * step + fft_size))
ValueError: maximum supported dimension for an ndarray is 32, found 512

and I check the code:

def istft(M, fft_size=512, step=160, padding=True):
    data = np.fft.ifft(M, axis=-1)
    windows = np.concatenate((np.zeros((56,)), np.hanning(fft_size - 112), np.zeros((56,))), axis=0)
    windows_num = M.shape[0]
    Total = np.zeros((windows_num * step + fft_size))  ##change windows to windows_num
    for i in range(windows_num):
        start = int(i * step)
        end = int(start + fft_size)
        print(Total.shape, data[i:].shape, windows.shape)
        Total[start:end] = Total[start:end] + data[i:] * windows
    if padding == True:
        Total = Total[:48000]

    return Total

error like this:

(48192,) (298, 257) (512,)
Traceback (most recent call last):
  File "test.py", line 84, in <module>
    T = fast_istft(F,power=False)
  File "/home/feilongchen/GitSpace/speakerindent/Looking-to-Listen-at-the-Cocktail-Party/model/utils/utils.py", line 74, in fast_istft
    data = istft(real_imag_shrink(data))
  File "/home/feilongchen/GitSpace/speakerindent/Looking-to-Listen-at-the-Cocktail-Party/model/utils/utils.py", line 30, in istft
    Total[start:end] = Total[start:end] + data[i:] * windows
ValueError: operands could not be broadcast together with shapes (298,257) (512,) 

can u help me to fix it?

Colab script error in test.py part

I tried to implement the code in Colab, I am not sure, what is the mistake I am doing in the evaluation (testing part), the last coding cell. Any clue?
thanks

requirements.txt internal conflicts

Not the end of the world, but when I pip install your requirements.txt, I get version conflicts. I'm updating the specific listed versions myself as best I can, but it might be worth updating the repo's copy.

edit: FWIW, here's my updated requirements.txt, attached.
requirements.txt

你好,在知乎回答中注意到你在视频帧扩展数据维度与音频保持一致时提到使用最近邻内插法,但是在代码中好像你使用的还是双线性内插法

还有想请教一下,我是一个小白,正在对音视频语音分离进行研究,之前一直在跑bill写的代码,但是model_v2的效果一直不好,我所做的改变只是将bill中的双线性内插改成了最近邻插值法,效果会提升零点几dB,也尝试过增大混合训练数据集,使用过7万,21万的混合语音数据集进行训练,损耗值输出都在0.6几,而且有很多epoch都没有损耗下降,差不多50epoch只有6、7个epoch下降,使用测试集对训练好的模型进行测试一直在3dB左右,如果使用训练集测试也只是到达6dB左右,你在尝试bill的代码有存在这些问题吗?我原来有打算把脸部的特征提取变为唇部的特征提取,但是文章曾明确说过这样做效果几乎没有差异,而且文章提到了除嘴部以外的其他区域也会对语音分离的效果有作用,所以就没有进行此类尝试,想再请教一下,你觉得脸部替换为唇部特征有必要吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.