resemble-ai / resemblyzer Goto Github PK

View Code? Open in Web Editor NEW

2.6K 2.6K 417.0 101.45 MB

A python package to analyze and compare voices with deep learning

License: Apache License 2.0

Python 100.00%

resemblyzer's People

Contributors

Stargazers

Watchers

Forkers

neural-audio chr1st0p tree-ind entn-at ml-lab liusongxiang desklop mvahit tarsbase trendingtechnology oskop liuzongquan dachengai sberryman minghailan beatwolf dosapati rajacsp lucasg2000 sadam1195 antonizhubar giering sanjosh richgong peng2017 crackerboy wgwangang victorargote recorsa belzecue todun dotrado unstoppable unforeseenocean sogian patricksilva terem42 jordanmicahbennett kevin-ke lbxcfx opensourcelearningrepos kuustudio kivvf chenchy fodereas n5ro vyaslkv bossmojoman 0xbadca7 alexpmorris mohsen-goodarzi geraltlin rohithkodali sergeytimoshin ishine gluteusperfectus verrol dreadlord1984 grukz hobbit19 meghs91 fenil15 hagenw leodenale nabeel-khan thisisyusuf fabiodosreis2 adwardy themottorw starwork anthonyandroulakis pentestbr simratepifi bellyfat hubaokun mr-segfault david-w-millar wangyoucao drewpeisner enderus chairgraveyard phillshills nayanhalder letslego kentchun33333 eddiebarry jake23 sweetcard aihill eruditepanda havingfun lahiruts campeaux wahyubram82 lneverl gitmesam sharonbubz krupalraj abhaypsingh vivekmathema

resemblyzer's Issues

trimmed at the end

Hi,
thank you very much, the project is very interesting.
I have a problem, I've got always diarized speech trimmed from the end, sometimes it is cut more than 10 seconds.
for example:
total duration of audio file = 27s, the result ends at 16s
total duration of audio file = 45s, the result ends at 30s
I am using sample rate 16k.
I appreciate any help.
Rawad

How many speakers is the modeled trained on?

Clarification on embeddings training

Hi @CorentinJ! Great repo! I have one question in regards to the embeddings training. Are they trained using cosine similarity, euclidian distance or some other loss?

I'm trying to use this repo in conjunction with https://github.com/wq2012/SpectralCluster but the results don't make a lot of sense , check this out wq2012/SpectralCluster#6 and seems like it might be due to some incompatibility between the two libs if embeddings are not trained on euclidian distance.
If so, is there a suggested library for a clustering algorithm where the number of speakers is not known in advance?

python3 demo02_diarization.py Error opening 'audio_data/X2zqiX6yL3I.mp3': File contains data in an unknown format

(Resemblizer) (base) marco@pc:~/Resemblyzer$ python3 demo02_diarization.py
Traceback (most recent call last):
  File "/home/marco/Resemblyzer/Resemblizer/lib/python3.7/site-packages/librosa
/core/audio.py", line 127, in load
    with sf.SoundFile(path) as sf_desc:
  File "/home/marco/Resemblyzer/Resemblizer/lib/python3.7/site-packages/soundfile.py", line 
627, in __init__
    self._file = self._open(file, mode_int, closefd)
  File "/home/marco/Resemblyzer/Resemblizer/lib/python3.7/site-packages/soundfile.py", line 
1182, in _open
    "Error opening {0!r}: ".format(self.name))
  File "/home/marco/Resemblyzer/Resemblizer/lib/python3.7/site-packages/soundfile.py", line 
1355, in _error_check
    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'audio_data/X2zqiX6yL3I.mp3': File contains data in an unknown 
format.

During handling of the above exception, another exception occurred:

   Traceback (most recent call last):
 File "demo02_diarization.py", line 14, in <module>
    wav = preprocess_wav(wav_fpath)
  File "/home/marco/Resemblyzer/resemblyzer/audio.py", line 27, in preprocess_wav
    wav, source_sr = librosa.load(fpath_or_wav, sr=None)
  File "/home/marco/Resemblyzer/Resemblizer/lib/python3.7/site-packages/librosa
/core/audio.py", line 142, in load
    y, sr_native = __audioread_load(path, offset, duration, dtype)
  File "/home/marco/Resemblyzer/Resemblizer/lib/python3.7/site-packages/librosa
/core/audio.py", line 164, in __audioread_load
    with audioread.audio_open(path) as input_file:
  File "/home/marco/Resemblyzer/Resemblizer/lib/python3.7/site-packages/audioread  
/__init__.py", line 116, in audio_open
    raise NoBackendError()
audioread.exceptions.NoBackendError

Embeddings from different sources of audio are different

Hey Corentin, I noticed that the embeddings generated are different (i.e. the model doesn't recognize it properly) if the recording source is different from which the sample is taken even thought the speaker is the same. I wanted to know how we can overcome this limitation, will more training of the encoder work, or can we introduce some kind of noise and augment the data, or can we add a white noise signal from a new audio source to the audio being analyzed.

What are your opinions on why this is so and what could help here.

Thanks

train new languages for better detection

I would like to retrain the model for detecting italy and spanish speakers. How do you train the english model?

Speaker diarization works good only for short utterance i.e 40-60 second

In executing the demos: cannot import name 'UMAP' from 'umap'

demo 01:

(Resemblizer) (base) marco@pc:~/Resemblyzer$ python3 demo01_similarity.py 
Traceback (most recent call last):
  File "demo01_similarity.py", line 2, in <module>
    from demo_utils import *
  File "/home/marco/Resemblyzer/demo_utils.py", line 6, in <module>
    from umap import UMAP
ImportError: cannot import name 'UMAP' from 'umap' (/home/marco/Resemblyzer/Resemblizer 
/lib/python3.7/site-packages/umap/__init__.py)

demo02 :

(Resemblizer) (base) marco@pc:~/Resemblyzer$ python3 demo02_diarization.py 
Traceback (most recent call last):
  File "demo02_diarization.py", line 2, in <module>
    from demo_utils import *
  File "/home/marco/Resemblyzer/demo_utils.py", line 6, in <module>
    from umap import UMAP
ImportError: cannot import name 'UMAP' from 'umap' (/home/marco/Resemblyzer/Resemblizer
/lib/python3.7/site-packages/umap/__init__.py)

demo03:

(Resemblizer) (base) marco@pc:~/Resemblyzer$ python3 demo03_projection.py 
Traceback (most recent call last):
  File "demo03_projection.py", line 2, in <module>
    from demo_utils import *
  File "/home/marco/Resemblyzer/demo_utils.py", line 6, in <module>
    from umap import UMAP
ImportError: cannot import name 'UMAP' from 'umap' (/home/marco/Resemblyzer/Resemblizer
/lib/python3.7/site-packages/umap/__init__.py)

How to solve the problem?
Marco

More information on dataset and training pipeline

Hello, thank you for this repository and Real-Time-Voice-Cloning, I'm curious about how exactly you trained this Resemblyzer model, which datasets you used? vox1, vox2, libri or other? If you'd share training scripts it will be very helpful to retrain embeddings for our domain, if it is not under any NDA.

Best regards

Ivan

Getting different clusters for the same speaker

Hey,

I tried the projection demo with my own voice recorded files (SpeakerA), and it seems I'm getting overlapping and the clusters are spread far apart, what do you think could be the issue, and how can I go about fixing it?

Thanks

questions about fetures supported by Resemblyzer

Hi there - I am new to Speaker Diarization and was exploring the repo as I have a few questions. I looked at the diarization demo here: demo02_diarization.py

Use live audio stream instead of static audio files:
I see that the demo uses a static mp3 file although in my use-case, I will be working with a realtime audio stream. Does Resemblyzer support streaming input for speech diarization? If so, is there somewhere I could find some resource/sample code to look into for reference?

Number of speakers unknown in the beginning of the audio stream.
Unlike in the given "demo code" where the total number of speakers is pre-decided, in my usecase - I will be trying to stream audio from a live meeting which means that the total number of users might not be known in advance (yes, we know how many people were sent an invite to the meeting but not all might join necessarily). In that case, how can I enable Resemblyzer to not only be able to detect when a particular speaker is talking but also detect that there is a new user who is speaking if he has not spoken before? Does Resemblyzer support that feature? Where can I find some reference for that?

Pre-trained english model for diarization.
I want to work with an already existing model and am okay using some pre-trained diarization model as long as it can detect a new speaker real-time. How can I find some pre-trained diarization models that I can just use right out of the box and see how well that model performs?

Thank you for your time and have a good one.

Mp3 file is not accepted in demo02_diarization.py

wav_fpath = Path("audio_data", "X2zqiX6yL3I.mp3")

When I try using mp3 files, it throws error. However, if I use wav formats, it goes through fine. Please correct me if I am doing anything wrong.

Embedding is mostly zero

I plot the embedding vector and it's mostly zeros. Is this expected?

I want to use the embedding in another project. I also plotted their example embeddings and those seem to be distributed significantly better.

Embedding from this project
Embeddings from AutoVC

And here's the test code;

from pathlib import Path
from resemblyzer import VoiceEncoder, preprocess_wav
import numpy as np, matplotlib.pyplot as plt

wav = preprocess_wav(Path("367-130732-0005.flac"))

encoder = VoiceEncoder()
embed = encoder.embed_utterance(wav)

plt.plot(embed, 'bo')
plt.show()

Am I doing something wrong? Thanks.

What dataset was used to train the model?

What dataset was used to train the model? LibriSpeech?

dvector 256 contain a lot of 0

It's normal to extract 0 from the dvector feature. Why does the dvector in the Chinese data set I extracted contain a lot of 0?

sndfile library not found. But when install sndfile with pip3: sndfile._sndfile.c:494:10: fatal error: sndfile.h: No such file or directory

(resemblyzerTest) (base) marco@pc:~/resemblyzerTest$ pip3 install resemblyzer
Collecting resemblyzer
  Using cached https://files.pythonhosted.org/packages/e0/21 
   /f0a22ee4afd9e5d9790b04329accdb71d2cf89ffaf5bb0611fb37cd91782/Resemblyzer-0.1.1.dev0-
   py3-none-any.whl
Collecting typing (from resemblyzer)
  Using cached https://files.pythonhosted.org/packages/fe/2e
   /b480ee1b75e6d17d2993738670e75c1feeb9ff7f64452153cf018051cc92/typing-3.7.4.1-py3-
   none-any.whl
Collecting numpy>=1.10.1 (from resemblyzer)
  Using cached https://files.pythonhosted.org/packages/25/eb
   /4ecf6b13897391cb07a4231e9d9c671b55dfbbf6f4a514a1a0c594f2d8d9/numpy-1.17.1-cp37-
   cp37m-manylinux1_x86_64.whl
Collecting torch>=1.0.1 (from resemblyzer)
 Using cached https://files.pythonhosted.org/packages/05/65
 /5248be50c55ab7429dd5c11f5e2f9f5865606b80e854ca63139ad1a584f2/torch-1.2.0-cp37-
  cp37m-manylinux1_x86_64.whl
Collecting webrtcvad>=2.0.10 (from resemblyzer)
Collecting librosa>=0.6.1 (from resemblyzer)
Collecting scipy>=1.2.1 (from resemblyzer)
  Using cached https://files.pythonhosted.org/packages/94/7f
   /b535ec711cbcc3246abea4385d17e1b325d4c3404dd86f15fc4f3dba1dbb/scipy-1.3.1-cp37-
   cp37m-manylinux1_x86_64.whl
Collecting decorator>=3.0.0 (from librosa>=0.6.1->resemblyzer)
 Using cached https://files.pythonhosted.org/packages/5f/88
 /0075e461560a1e750a0dcbf77f1d9de775028c37a19a346a6c565a257399/decorator-4.4.0-
 py2.py3-none-any.whl
Collecting resampy>=0.2.0 (from librosa>=0.6.1->resemblyzer)
Collecting joblib>=0.12 (from librosa>=0.6.1->resemblyzer)
   Using cached https://files.pythonhosted.org/packages/cd/c1
    /50a758e8247561e58cb87305b1e90b171b8c767b15b12a1734001f41d356/joblib-0.13.2-
     py2.py3-none-any.whl
Collecting soundfile>=0.9.0 (from librosa>=0.6.1->resemblyzer)
  Using cached https://files.pythonhosted.org/packages/68/64
  /1191352221e2ec90db7492b4bf0c04fd9d2508de67b3f39cbf093cd6bd86/SoundFile-0.10.2-
   py2.py3-none-any.whl
Collecting audioread>=2.0.0 (from librosa>=0.6.1->resemblyzer)
Collecting six>=1.3 (from librosa>=0.6.1->resemblyzer)
 Using cached https://files.pythonhosted.org/packages/73/fb
  /00a976f728d0d1fecfe898238ce23f502a721c0ac0ecfedb80e0d88c64e9/six-1.12.0-py2.py3-none-
  any.whl
Collecting scikit-learn!=0.19.0,>=0.14.0 (from librosa>=0.6.1->resemblyzer)
  Using cached https://files.pythonhosted.org/packages/9f/c5
  /e5267eb84994e9a92a2c6a6ee768514f255d036f3c8378acfa694e9f2c99/scikit_learn-0.21.3-
 cp37-cp37m-manylinux1_x86_64.whl
Collecting numba>=0.38.0 (from librosa>=0.6.1->resemblyzer)
  Using cached https://files.pythonhosted.org/packages/b5/9b
  /7ad0a181b66d58334a2233f18fc8345e3ff17ea6f8db0eb59dc31182b6a9/numba-0.45.1-cp37-
  cp37m-manylinux1_x86_64.whl
Collecting cffi>=1.0 (from soundfile>=0.9.0->librosa>=0.6.1->resemblyzer)
  Using cached https://files.pythonhosted.org/packages/a0/ea
    /37fe21475c884f88a2ae496cab10e8f84f0cc11137be860af9eb37a3edb9/cffi-1.12.3-cp37-
    cp37m-manylinux1_x86_64.whl
Collecting llvmlite>=0.29.0dev0 (from numba>=0.38.0->librosa>=0.6.1->resemblyzer)
   Using cached https://files.pythonhosted.org/packages/30/ae/ 
    a33eb9a94734889c189ba4b05170ac0ede05904db5d3dd31158cb33ac16e/llvmlite-0.29.0-1-
    cp37-cp37m-manylinux1_x86_64.whl
Collecting pycparser (from cffi>=1.0->soundfile>=0.9.0->librosa>=0.6.1->resemblyzer)
  Using cached https://files.pythonhosted.org/packages/68/9e
  /49196946aee219aead1290e00d1e7fdeab8567783e83e1b9ab5585e6206a/pycparser-2.19.tar.gz
  Installing collected packages: typing, numpy, torch, webrtcvad, decorator, llvmlite, numba, scipy, 
  six, resampy, joblib, pycparser, cffi, soundfile, audioread, scikit-learn, librosa, resemblyzer
  Running setup.py install for pycparser ... done
Successfully installed audioread-2.1.8 cffi-1.12.3 decorator-4.4.0 joblib-0.13.2 librosa-0.7.0 llvmlite-
0.29.0 numba-0.45.1 numpy-1.17.1 pycparser-2.19 resampy-0.2.2 resemblyzer-0.1.1.dev0 scikit-
learn-0.21.3 scipy-1.3.1 six-1.12.0 soundfile-0.10.2 torch-1.2.0 typing-3.7.4.1 webrtcvad-2.0.10
You are using pip version 19.0.3, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
(resemblyzerTest) (base) marco@pc:~/resemblyzerTest$ pip3 install --upgrade pip
Collecting pip
  Using cached https://files.pythonhosted.org/packages/30/db  
/9e38760b32e3e7f40cce46dd5fb107b8c73840df38f0046d8e6514e675a1/pip-19.2.3-py2.py3-none-
any.whl
Installing collected packages: pip
 Found existing installation: pip 19.0.3
    Uninstalling pip-19.0.3:
      Successfully uninstalled pip-19.0.3
Successfully installed pip-19.2.3

But when trying to install sndfile package :

(resemblyzerTest) (base) marco@pc:~/resemblyzerTest$ pip3 install sndfile
Collecting sndfile
  Using cached https://files.pythonhosted.org/packages/db/ce
/797cacd78490aa9de2e0e119491079d380e2fbbd7a1c5057c9fb2120a643/sndfile-0.2.0.tar.gz
Requirement already satisfied: cffi>=1.0.0 in ./lib/python3.7/site-packages (from sndfile) (1.12.3)
Requirement already satisfied: pycparser in ./lib/python3.7/site-packages (from 
cffi>=1.0.0->sndfile) (2.19)
Installing collected packages: sndfile
  Running setup.py install for sndfile ... error
    ERROR: Command errored out with exit status 1:
     command: /home/marco/resemblyzerTest/bin/python3 -u -c 'import sys, setuptools, tokenize; 
sys.argv[0] = '"'"'/tmp/pip-install-qcs9_nwv/sndfile/setup.py'"'"'; __file__='"'"'/tmp/pip-install- 
qcs9_nwv/sndfile/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"',  
open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, 
'"'"'exec'"'"'))' install --record /tmp/pip-record-2d8yk4e3/install-record.txt --single-version-externally-
managed --compile --install-headers /home/marco/resemblyzerTest/include/site/python3.7/sndfile
         cwd: /tmp/pip-install-qcs9_nwv/sndfile/
    Complete output (23 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.7
    creating build/lib.linux-x86_64-3.7/sndfile
    copying sndfile/__init__.py -> build/lib.linux-x86_64-3.7/sndfile
    copying sndfile/vio.py -> build/lib.linux-x86_64-3.7/sndfile
    copying sndfile/build.py -> build/lib.linux-x86_64-3.7/sndfile
    copying sndfile/formats.py -> build/lib.linux-x86_64-3.7/sndfile
    copying sndfile/io.py -> build/lib.linux-x86_64-3.7/sndfile
    running build_ext
    generating cffi module 'build/temp.linux-x86_64-3.7/sndfile._sndfile.c'
    creating build/temp.linux-x86_64-3.7
    building 'sndfile._sndfile' extension
    creating build/temp.linux-x86_64-3.7/build
    creating build/temp.linux-x86_64-3.7/build/temp.linux-x86_64-3.7
    gcc -pthread -B /home/marco/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare 
-DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/marco/resemblyzerTest/include 
-I/home/marco/anaconda3/include/python3.7m -c build/temp.linux-x86_64-3.7/sndfile._sndfile.c -o
build/temp.linux-x86_64-3.7/build/temp.linux-x86_64-3.7/sndfile._sndfile.o
    build/temp.linux-x86_64-3.7/sndfile._sndfile.c:494:10: fatal error: sndfile.h: No such file or 
directory
     #include <sndfile.h>
              ^~~~~~~~~~~
    compilation terminated.
    error: command 'gcc' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /home/marco/resemblyzerTest/bin/python3 -u -c      
'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-qcs9_nwv/sndfile/setup.py'"'"'; 
__file__='"'"'/tmp/pip-install-qcs9_nwv/sndfile/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', 
open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, 
'"'"'exec'"'"'))' install --record /tmp/pip-record-2d8yk4e3/install-record.txt --single-version-externally-
managed --compile --install-headers /home/marco/resemblyzerTest/include/site/python3.7/sndfile   
Check the logs for full command output.


Operating System: Ubuntu 18.04.02 Server Edition
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
pip3 -V
pip 19.2.3 from /home/marco/anaconda3/lib/python3.7/site-packages/pip (python 3.7)

How to solve the problem?
Marco

changing <partials_n_frames> to reduce partial utterances length and increase resolution (diarization with spectral clustering)

Hi!

I am trying to implement the paper: https://arxiv.org/pdf/1710.10468.pdf to create an unsupervised diarization algorithm using the d-vectors provided by the pre-trained model in Resemblyzer.

I found that the length of the partial utterances (1.6s), determined by the hyperparameter partials_n_frames with a default value 160 may be too high. In the paper, the authors recommend a window size and step of 240ms and 120ms for this kind of diarization, respectively.

Is this parameter something that can be changed easily? As it is implemented as a setting in the source (hyperparams.py) code and not as an argument of a function or method it looks like it is not a good idea to modify it.

Thanks in advance.

David.

Problems with sndfile

I installed the required packages with pip3:

(base) marco@pc:~/Resemblyzer$ pip3 install -r requirements_package.txt
Requirement already satisfied: librosa>=0.6.1 in /home/marco/anaconda3/lib/python3.7/site-packages 
 (from -r requirements_package.txt (line 1)) (0.7.0)
Requirement already satisfied: numpy>=1.10.1 in /home/marco/anaconda3/lib/python3.7/site-packages
 (from -r requirements_package.txt (line 2)) (1.17.1)
Requirement already satisfied: webrtcvad>=2.0.10 in /home/marco/anaconda3/lib/python3.7/site-
packages (from -r requirements_package.txt (line 3)) (2.0.10)
Requirement already satisfied: torch>=1.0.1 in /home/marco/anaconda3/lib/python3.7/site-packages 
 (from -r requirements_package.txt (line 4)) (1.3.0a0+a671609)
Requirement already satisfied: scipy>=1.2.1 in /home/marco/anaconda3/lib/python3.7/site-packages  
 (from -r requirements_package.txt (line 5)) (1.3.1)
Requirement already satisfied: typing in /home/marco/anaconda3/lib/python3.7/site-packages (from -r 
requirements_package.txt (line 6)) (3.6.4)
Requirement already satisfied: joblib>=0.12 in /home/marco/anaconda3/lib/python3.7/site-packages 
 (from librosa>=0.6.1->-r requirements_package.txt (line 1)) (0.13.2)
Requirement already satisfied: decorator>=3.0.0 in /home/marco/anaconda3/lib/python3.7/site-
packages (from librosa>=0.6.1->-r requirements_package.txt (line 1)) (4.4.0)
Requirement already satisfied: numba>=0.38.0 in /home/marco/anaconda3/lib/python3.7/site-packages
 (from librosa>=0.6.1->-r requirements_package.txt (line 1)) (0.45.1)
Requirement already satisfied: soundfile>=0.9.0 in /home/marco/anaconda3/lib/python3.7/site-
packages (from librosa>=0.6.1->-r requirements_package.txt (line 1)) (0.10.2)
Requirement already satisfied: resampy>=0.2.0 in /home/marco/anaconda3/lib/python3.7/site-
packages (from librosa>=0.6.1->-r requirements_package.txt (line 1)) (0.2.2)
Requirement already satisfied: six>=1.3 in /home/marco/anaconda3/lib/python3.7/site-packages (from 
librosa>=0.6.1->-r requirements_package.txt (line 1)) (1.12.0)
Requirement already satisfied: scikit-learn!=0.19.0,>=0.14.0 in /home/marco/anaconda3/lib/python3.7
/site-packages (from librosa>=0.6.1->-r requirements_package.txt (line 1)) (0.21.2)
Requirement already satisfied: audioread>=2.0.0 in /home/marco/anaconda3/lib/python3.7/site-
packages (from librosa>=0.6.1->-r requirements_package.txt (line 1)) (2.1.8)
Requirement already satisfied: llvmlite>=0.29.0dev0 in /home/marco/anaconda3/lib/python3.7/site
-packages (from numba>=0.38.0->librosa>=0.6.1->-r requirements_package.txt (line 1)) (0.29.0)
Requirement already satisfied: cffi>=1.0 in /home/marco/anaconda3/lib/python3.7/site-packages (from 
soundfile>=0.9.0->librosa>=0.6.1->-r requirements_package.txt (line 1)) (1.12.3)
Requirement already satisfied: pycparser in /home/marco/anaconda3/lib/python3.7/site-packages (from
cffi>=1.0->soundfile>=0.9.0->librosa>=0.6.1->-r requirements_package.txt (line 1)) (2.19)

But when trying to execute demo01_similarity.py :

Traceback (most recent call last):
  File "demo01_similarity.py", line 1, in <module>
    from demo_utils import *
  File "/home/marco/Resemblyzer/demo_utils.py", line 3, in <module>
    from resemblyzer import sampling_rate
  File "/home/marco/Resemblyzer/resemblyzer/__init__.py", line 3, in <module>
    from resemblyzer.audio import preprocess_wav, wav_to_mel_spectrogram, trim_long_silences, \
  File "/home/marco/Resemblyzer/resemblyzer/audio.py", line 7, in <module>
    import librosa
  File "/home/marco/anaconda3/lib/python3.7/site-packages/librosa/__init__.py", line 13, in <module>
     from . import core
  File "/home/marco/anaconda3/lib/python3.7/site-packages/librosa/core/__init__.py", line 115, in 
<module>
    from .audio import *  # pylint: disable=wildcard-import
  File "/home/marco/anaconda3/lib/python3.7/site-packages/librosa/core/audio.py", line 8, in <module>
    import soundfile as sf
  File "/home/marco/anaconda3/lib/python3.7/site-packages/soundfile.py", line 142, in <module>
    raise OSError('sndfile library not found')
OSError: sndfile library not found

And I have problems with the installation of sndfile library with pip3:

(base) marco@pc:~/Resemblyzer$ pip3 install sndfile
Collecting sndfile
  Using cached https://files.pythonhosted.org/packages/db/ce
/797cacd78490aa9de2e0e119491079d380e2fbbd7a1c5057c9fb2120a643/sndfile-0.2.0.tar.gz
Requirement already satisfied: cffi>=1.0.0 in /home/marco/anaconda3/lib/python3.7/site-packages 
(from sndfile) (1.12.3)
Requirement already satisfied: pycparser in /home/marco/anaconda3/lib/python3.7/site-packages 
(from cffi>=1.0.0->sndfile) (2.19)
Building wheels for collected packages: sndfile
  Building wheel for sndfile (setup.py) ... error

I reported the issue also here: https://github.com/sangoma/sndfile/issues

Operating System: Ubuntu 18.04.02 Server Edition
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
pip3 -V
pip 19.2.3 from /home/marco/anaconda3/lib/python3.7/site-packages/pip (python 3.7)

Looking forward to your kind help.
Marco

Resemblyzer is evenly distributed i.e speaker 0 and then speaker 1 and so on..

I would like to know why the output after applying spectral clustering gives output in the form speaker 0, speaker 1 ,speaker 0 ,...??.in a continuous manner.
Is there any possible way to enhance it

typo in demo2

speaker_wavs = [wav[int(s[0] * sampling_rate):int(s[1]) * sampling_rate] for s in segments]

should be revised to

speaker_wavs = [wav[int(s[0] * sampling_rate):int(s[1]* sampling_rate)] for s in segments]

Demo 01: Similarity

I've run the similarity demo against my model trained on English speakers using 768 hidden layers and embedding size.

I found it very interesting to see the cross similarity matrix at different checkpoints. Your model appears to do a better job for utterance similarity. Do you think this is due to the fact I just need to train the model for longer? I'm going to continue training the model for at least another week.

Note: I've changed the histogram and matrix range to [0, 1]

Model included in this repository

Utterance Median: 0.52 / 0.89

300k steps

Utterance Median: 0.21 / 0.82

500k steps

Utterance Median: 0.29 / 0.82

1M steps

Utterance Median: 0.35 / 0.83

1.3M steps

Utterance Median: 0.33 / 0.81

1.725M steps

Utterance Median: 0.33 / 0.80

Multiple languages 525k steps

Utterance Median: 0.31 / 0.85
This model has 1024 hidden and embedding size.

lots of values of d-vector are zero

I successfully calculated the d-vector and used it for uisrnn, but the loss during training was nan. I checked and found that some d-vectors have a value of 0. Is this normal?

demo02 has lot of noise, and Animation is delayed further than 200ms!

hi , i am tying to run the demo 02 but has lot of echos and the sound is distorted. and the animation is not working (only show the empty window) when it runs , it shows:
warnings.warn("PySoundFile failed. Trying audioread instead.")
Loaded the voice encoder model on cpu in 0.01 seconds.
Animation is delayed further than 200ms!

can anyone please help?

Re-train on my language domain

Thank you very much for the amazing repo!

I am working on my thesis, can you recommend me anyway to train this in my specific language. I have tried this with the result is 72% accuracy but I can not improve it anymore.

I mean how can I train to get the pretrained.pt for my domain. Thank you!

Cosine similarity is inconsistent with the cluster

Hi, when I tried visualizing the voices, it is shown that there is one sample (female voice) that is actually far away from the male speaker's utterances (which is expected).

However, when I compute the cosine similarity between the female's utterance versus the male ones, the value is quite high (0.88). I don't know if I perform the cosine similarity correctly here.

embed_1 = encoder.embed_utterance(y1)
embed_2 = encoder.embed_utterance(y2)
cosine_sim = embed_1 @ embed_2

Any help is very much appreciated !

Cannot import "sampling_rate"

Whenever the line "from resemblyzer import sampling_rate" is run, I get an import error that says I cannot import "sampling_rate". This occurs when I import demo_utils as well because it has this line in its imports. Is there any reason this could be the case?

I am using Google Colab and just installed the resemblyzer package Today (Nov 17/20)

Any help would be appreciated thanks!

How to get embeddings of audio data streaming from microphone.

I am using resemblyzer to create embeddings for speaker diarization.
It works fine when a whole wave file is loaded into the resemblyzer.
Now I want to try out real-time speaker diarization using data streaming from microphone using pyaudio (in form of chunks).
A chunk is essentially a frame of fixed size (100 ms in my case).
How do I get separate embedding for each chunk using resemblyzer?

demo04 issue

i could not reproduce the result of demo04.
the figure of demo04 in the introduction shows pretty good result with few errors to distinguish male/female voice, however, when i run the program, i could not get even a close result.

Compute embeddings from stream & unsupervised diarization

Hi, great work and great repo really. Your code and examples helped me understand the flow very easily.
I am currently working on a speaker identification task wherein I want to detect "who spoke when" with low latency. There are two tasks that I need to overcome and I was wondering if you had already worked on them or have plans to in future. If not, then I would be glad to contribute to your repo as a PR. The tasks are as follows:

How can I use the partial embeddings to identify speaker changes if I do not have pre-defined speaker embeddings (unlike the speaker diarization example that you gave)?
Can the embeddings be computed from a streaming input? Like directly reading wav bytes from microphone and computing them?

I know that they can be done with few tweaks but I would like to know your insight on them if you had already worked or have idea about them.
Thanks!

EER of pre-trained model?

Hi, thanks for this repository.
I understand that the speaker embedding model is based on "Generalized End-To-End Loss for Speaker Verification" and was trained on VoxCeleb2.
Could you please mention what is the EER of your pre-trained model?

Thank you

d-Vectors for UIS-RNN

I'm working on a project in which I want to use d-vector embeddings to train a model.
Can someone please help how to compute d-vectors for different utterances from different speakers to pass into the UISRNN model?

Simple Examples

Do you have any examples of creating a voice profile and how I would check that profile against an audio track to determine if the speaker is in it? The demo02 is close but quite complex to get my head around.

decode back?

Hi!
Thanks for this wonderful project.
I have a question about decoding back from embedding to waveform. How can I do it?

How many audio utterances per speaker to get a good recognition?

Hello, and first things first: thank you so much for such a great work.

I have been testing your tool with a bunch (28) of voices to consider using it in a voice recognition system. Btw, it works great.
Only, after a few tests it's yet not quite clear to me how many audio files per speaker, and of which length, would be needed to assure a good result identifying the speaker.

Since the production system will have to deal with quite short commands (2-3 words) I've tried demo1 with 3 short audios. Results aren't very good:

Then, wondering if longer audios would be better, I used 3 long audios (20-25 words), but improvements -if I get it right- happen in speaker identification but not so much in utterances.

Other things I've tried are using some more audios (8 short and the 3 longs ones), whici is already better:

The question here is: how many audios, and of what kind, would you recommend to get good per-utterance results (since false positives are to be avoided)?

Bonus question, since my enterprise works with .Net Core, I have exported your pretrained model to ONNX, and face now the preprocessing of audio to feed it. Could you recommend any code for the preprocessing part?

No figure and no sound when I run demo 02.

Below is my environment.

All other demos work well. Just demo 2 is not working. This may be a silly question. As a beginner, after several hours of working, I still cannot fix it. Can you help me? Thank you.

Model path and hparams

What are your thoughts on allowing you to pass in the model path (default to None) and override the hyper parameters? Or do you think the best route is sub-classing and override the init function?

Resemblyzer/resemblyzer/voice_encoder.py

Line 12 in cdd51df

def __init__(self, device: Union[str, torch.device]=None, verbose=True):

Installation Failed

$ pipenv install resemblyzer
Installing resemblyzer…
Error:  An error occurred while installing resemblyzer!
Error text: Collecting resemblyzer
  Using cached Resemblyzer-0.1.1.dev0-py3-none-any.whl (15.7 MB)
Collecting librosa>=0.6.1
  Using cached librosa-0.7.2.tar.gz (1.6 MB)
Collecting typing
  Using cached typing-3.7.4.1-py3-none-any.whl (25 kB)
Collecting webrtcvad>=2.0.10
  Using cached webrtcvad-2.0.10.tar.gz (66 kB)
Collecting scipy>=1.2.1
  Using cached scipy-1.5.0-cp36-cp36m-manylinux1_x86_64.whl (25.9 MB)
Collecting torch>=1.0.1

Killed

✘ Installation Failed

Hign GPU memory requirement

I Have tried to load the model on gtx 1080 GPU and run it but it is asking for a whole lot of memory
this is the error it throws

Traceback (most recent call last):
File "/home/server/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
runfile('/media/server/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/Resemblyzer/demo02_diarization.py', wdir='/media/server/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/Resemblyzer')
File "/home/server/pycharm-community-2019.2.4/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/home/server/pycharm-community-2019.2.4/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/media/server/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/Resemblyzer/demo02_diarization.py", line 64, in
run()
File "/media/server/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/Resemblyzer/demo02_diarization.py", line 46, in run
_, cont_embeds, wav_splits = encoder.embed_utterance(wav, return_partials=True, rate=16)
File "/media/server/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/Resemblyzer/resemblyzer/voice_encoder.py", line 152, in embed_utterance
partial_embeds = self(mels).cpu().numpy()
File "/home/server/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/media/server/b92be869-bd56-4ed4-9306-12a754f7065f/diarization-package/Resemblyzer/resemblyzer/voice_encoder.py", line 57, in forward
_, (hidden, _) = self.lstm(mels)
File "/home/server/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/server/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 564, in forward
return self.forward_tensor(input, hx)
File "/home/server/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 543, in forward_tensor
output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)
File "/home/server/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 526, in forward_impl
self.dropout, self.training, self.bidirectional, self.batch_first)
RuntimeError: CUDA out of memory. Tried to allocate 27.50 GiB (GPU 0; 7.93 GiB total capacity; 4.17 GiB already allocated; 3.24 GiB free; 22.08 MiB cached)

Library not loaded: @rpath/libc++.1.dylib when run sample in macOS Catalina

when I run the test.py file:

from resemblyzer import VoiceEncoder, preprocess_wav
from pathlib import Path
import numpy as np
fpath = Path('audio_data')
wav = preprocess_wav(fpath)
encoder = VoiceEncoder()
embed = encoder.embed_utterance(wav)
np.set_printoptions(precision=3, suppress=True)
print(embed)

this exception occurred:

Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 783, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/resemblyzer/__init__.py", line 5, in <module>
    from resemblyzer.voice_encoder import VoiceEncoder
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/resemblyzer/voice_encoder.py", line 5, in <module>
    from torch import nn
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/__init__.py", line 136, in <module>
    from torch._C import *
ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/_C.cpython-38-darwin.so, 2): Library not loaded: @rpath/libc++.1.dylib
  Referenced from: /Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/torch/_C.cpython-38-darwin.so
  Reason: image not found

While executing the demo01_similarity.py file in Google colab only the Figure Size is getting printed but not the figure , why?

Disable randomness in demos?

I'm not sure if it is a feature or a bug and what is the cause of the randomness appearing for some demos, e.g. here are two results from running:

python demo03_projection.py

And here from running:

python demo04_clustering.py

Would be nice if the demos would return always the same result.

torch.jit.frontend.UnsupportedNodeError: DictComp aren't supported

I know it's not strictly related to Resemblyzer library, but I do hope that may be you or others can give me an help.

My objective is to use the following class which simply uses your excellent work:

import torch
from torch.utils.data import DataLoader
import torchaudio
from resemblyzer import preprocess_wav, VoiceEncoder
from demo_utils import *
from itertools import groupby
from pathlib import Path
from tqdm import tqdm
import numpy as np
import sys

np.set_printoptions(threshold=sys.maxsize)

class GetSpeechEmbedding(torch.nn.Module):
    def __init__(self):
        super(GetSpeechEmbedding, self).__init__()

    def forward(self, a, b, extension):
        encoder = VoiceEncoder()
        p = Path(a,b)
        wav_fpaths = list(p.glob("**/*." + extension))
        speaker_wavs = {speaker: list(map(preprocess_wav, wav_fpaths)) for speaker, wav_fpaths 
in
                    groupby(tqdm(wav_fpaths, "Preprocessing wavs", len(wav_fpaths), unit="wavs"), 
                            lambda wav_fpath: wav_fpath.parent.stem)}
        embeds_a = np.array([encoder.embed_utterance(wavs[0]) for wavs in 
speaker_wavs.values()])
        return embeds_a

With python it works fine:

(venv373) (base) marco@pc01:~/PyTorchMatters/Resemblyzer$ python3 
./getSpeechEmbedding.py 
Loaded the voice encoder model on cpu in 0.03 seconds.
Preprocessing wavs:     
 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 78.71wavs/s]
[[0.00000000e+00 4.73524793e-04 1.30652651e-01 0.00000000e+00
 8.16560537e-02 8.52451399e-02 0.00000000e+00 0.00000000e+00
  7.62926340e-02 1.80580884e-01 2.26451114e-01 1.55212656e-01
  1.29384667e-01 0.00000000e+00 1.10790301e-04 0.00000000e+00

Following the indications found here: https://pytorch.org/tutorials/advanced/cpp_export.html

I added to getSpeechEmbedding.py these lines:

my_module = GetSpeechEmbedding()
sm = torch.jit.script(my_module)
sm.save("annotated_get_speech_embedding.pt")

But when trying to serialize the module, I get this error:
"torch.jit.frontend.UnsupportedNodeError: DictComp aren't supported"

(venv373) (base) marco@pc01:~/PyTorchMatters/Resemblyzer$ python3 
./getSpeechEmbedding.py 
Traceback (most recent call last):
  File "./getSpeechEmbedding.py", line 114, in <module>
    sm = torch.jit.script(my_module)
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/__init__.py", line 1255, in 
script
    return torch.jit._recursive.recursive_script(obj)
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/_recursive.py", line 534, in 
recursive_script
    return create_script_module(nn_module, infer_methods_to_compile(nn_module))
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/_recursive.py", line 493, in 
infer_methods_to_compile
    stubs.append(make_stub_from_method(nn_module, method))
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/_recursive.py", line 40, in 
make_stub_from_method
    return make_stub(func)
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/_recursive.py", line 33, in 
make_stub
    ast = torch.jit.get_jit_def(func, self_name="RecursiveScriptModule")
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/frontend.py", line 171, in 
get_jit_def
    return build_def(ctx, py_ast.body[0], type_line, self_name)
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/frontend.py", line 212, in 
build_def
    build_stmts(ctx, body))
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/frontend.py", line 127, in 
build_stmts
    stmts = [build_stmt(ctx, s) for s in stmts]
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/frontend.py", line 127, in 
<listcomp>
    stmts = [build_stmt(ctx, s) for s in stmts]
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/frontend.py", line 187, in 
__call__
    return method(ctx, node)
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/frontend.py", line 289, in 
build_Assign
    rhs = build_expr(ctx, stmt.value)
  File "/home/marco/anaconda3/lib/python3.7/site-packages/torch/jit/frontend.py", line 186, in 
__call__
    raise UnsupportedNodeError(ctx, node)
torch.jit.frontend.UnsupportedNodeError: DictComp aren't supported:
  File "./getSpeechEmbedding.py", line 46
        # It normalizes the volume, trims long silences and resamples the wav to the correct 
sampling rate.

        speaker_wavs = {speaker: list(map(preprocess_wav, wav_fpaths)) for speaker, wav_fpaths
in
                       ~ <--- HERE
                    groupby(tqdm(wav_fpaths, "Preprocessing wavs", len(wav_fpaths), unit="wavs"), 
                        lambda wav_fpath: wav_fpath.parent.stem)}

I read here: https://pytorch.org/tutorials/advanced/cpp_export.html
that "If you need to exclude some methods in your nn.Module because they use Python features that TorchScript doesn’t support yet, you could annotate those with @torch.jit.ignore"
But I guess that excluding the dictionary comprehension from the serialization, will interfere with the method's functionality.
What would you suggest me to do?

Error opening 'audio_data/test.mp3' : file containers data in an unknown format

Attempting to run demo02 but running into this error,

Error opening 'audio_data/test.mp3' : file containers data in an unknown format

"raise RuntimeError(prefix + _ffi.string(err_str).decode('urf-8), 'replace')

I have reinstalled and ensured librosa was up to date and install ffmpeg just in case. Still no luck? Is this me or a bug?

warnings.warn('PySoundFile failed. Trying audioread instead.')
Traceback (most recent call last):
File "/home/user/.local/lib/python3.6/site-packages/librosa/core/audio.py", line 129, in load
with sf.SoundFile(path) as sf_desc:
File "/home/usre/.local/lib/python3.6/site-packages/soundfile.py", line 629, in init
self._file = self._open(file, mode_int, closefd)
File "/home/user/.local/lib/python3.6/site-packages/soundfile.py", line 1184, in _open
"Error opening {0!r}: ".format(self.name))
File "/home/usre/.local/lib/python3.6/site-packages/soundfile.py", line 1357, in _error_check
raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'audio_data/testaudio.mp3': File contains data in an unknown format.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./demo02_diarization.py", line 17, in
wav = preprocess_wav("audio_data/testaudio.mp3")
File "/home/user/voice-recog/Resemblyzer/resemblyzer/audio.py", line 27, in preprocess_wav
wav, source_sr = librosa.load(str(fpath_or_wav), sr=None)
File "/home/user/.local/lib/python3.6/site-packages/librosa/core/audio.py", line 162, in load
y, sr_native = __audioread_load(path, offset, duration, dtype)
File "/home/user/.local/lib/python3.6/site-packages/librosa/core/audio.py", line 186, in __audioread_load
with audioread.audio_open(path) as input_file:
File "/home/user/.local/lib/python3.6/site-packages/audioread/init.py", line 116, in audio_open
raise NoBackendError()

Are demos meant to be non-deterministic?

When running demo 04 I get different results each time. I appreciate that with demo 05, it randomly selects from the 12 real samples so the outcome would vary, but I would have expected that the other demos would be deterministic. What's the expected behaviour in this regard?

Also in demo 04, I needed to apply a sort to the glob results or it would be unable to show any pattern (it appeared to be linking the gender markers arbitrarily). I'm running Linux, so the glob behaviour this way may differ to Windows. Even when I've done that though, the results are different (although they do segregate the voice samples by gender then).

I can include some screenshots later if it's helpful.

About Pre-trained Model

Hello, I come from your voice cloning repo.
I notice that this repo and the encoder of your voice cloning repo are both based on GE2E.
May I ask if the pre-trained models you provided are the same?
Does it make sense if I use this repo to measure the similarity of the audio generated by your voice cloning repo?
Thanks.

Pre-trained model accuracy

Hi Corentin,
Thanks for providing this amazing repo. I was just going through the speaker diarization script and tried to create embedding of my voice. I tested it on multiple audios recorded from different devices and got decent results. Would you suggest retraining/fine-tuning this model for better accuracy?
Or I should experiment on other available pretrained models for better results?

ImportError: cannot import name 'sampling_rate'

Hello! I'm facing an
ImportError: cannot import name 'sampling_rate'
when I tried to run the demo02_diarization.py file provided in the example:
from resemblyzer import sampling_rate

Could you help me to figure out how to fix this error?

Fake Voice Detection

Hey Guys,
I am relatively new to this field. Can you please tell me, how is the Fake Voice Detection Demo working.
If there is any blog on this, please mention that also.

demo02_diarization problem

when i start to run demo02_diarization.py,
it shows:
ImportError: cannot import name 'sampling_rate' from 'resemblyzer'

Finding out the gender

Hey Corentin, It is mentioned that we can use component analysis for gender indentification, could you please guide me on what needs to be done.
Also, I'm getting mixed results for the clustering demo, could you guess what could be wrong here and how to go about fixing it.

Thanks!