Coder Social home page Coder Social logo

wq2012 / spectralcluster Goto Github PK

View Code? Open in Web Editor NEW
489.0 19.0 73.0 1.85 MB

Python re-implementation of the (constrained) spectral clustering algorithms used in Google's speaker diarization papers.

Home Page: https://google.github.io/speaker-id/publications/LstmDiarization/

License: Apache License 2.0

Shell 0.98% Python 99.02%
machine-learning clustering spectral-clustering unsupervised-learning speaker-diarization unsupervised-clustering python constrained-clustering auto-tune

spectralcluster's Introduction

Spectral Clustering

Python application PyPI Version Python Versions Downloads codecov Documentation

Overview

This is a Python re-implementation of the spectral clustering algorithms presented in these papers:

Algorithm Paper
Refined Laplacian matrix Speaker Diarization with LSTM
Constrained spectral clustering Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection
Multi-stage clustering Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

refinement

Notice

We recently added new functionalities to this library to include algorithms in a new paper. We updated the APIs as well.

If you depend on our old API, please use an older version of this library:

pip3 install spectralcluster==0.1.0

Disclaimer

This is not a Google product.

This is not the original C++ implementation used by the papers.

Please consider this repo as a "demonstration" of the algorithms, instead of a "reproduction" of what we use at Google. Some features might be missing or incomplete.

Installation

Install the package by:

pip3 install spectralcluster

or

python3 -m pip install spectralcluster

Tutorial

Simply use the predict() method of class SpectralClusterer to perform spectral clustering. The example below should be closest to the original C++ implemention used by our ICASSP 2018 paper.

from spectralcluster import configs

labels = configs.icassp2018_clusterer.predict(X)

The input X is a numpy array of shape (n_samples, n_features), and the returned labels is a numpy array of shape (n_samples,).

You can also create your own clusterer like this:

from spectralcluster import SpectralClusterer

clusterer = SpectralClusterer(
    min_clusters=2,
    max_clusters=7,
    autotune=None,
    laplacian_type=None,
    refinement_options=None,
    custom_dist="cosine")

labels = clusterer.predict(X)

For the complete list of parameters of SpectralClusterer, see spectralcluster/spectral_clusterer.py.

youtube_screenshot_icassp2018 youtube_screenshot_icassp2022

Advanced features

Refinement operations

In our ICASSP 2018 paper, we apply a sequence of refinment operations on the affinity matrix, which is critical to the performance on the speaker diarization results.

You can specify your refinement operations like this:

from spectralcluster import RefinementOptions
from spectralcluster import ThresholdType
from spectralcluster import ICASSP2018_REFINEMENT_SEQUENCE

refinement_options = RefinementOptions(
    gaussian_blur_sigma=1,
    p_percentile=0.95,
    thresholding_soft_multiplier=0.01,
    thresholding_type=ThresholdType.RowMax,
    refinement_sequence=ICASSP2018_REFINEMENT_SEQUENCE)

Then you can pass the refinement_options as an argument when initializing your SpectralClusterer object.

For the complete list of RefinementOptions, see spectralcluster/refinement.py.

Laplacian matrix

In our ICASSP 2018 paper, we apply a refinement operation CropDiagonal on the affinity matrix, which replaces each diagonal element of the affinity matrix by the max non-diagonal value of the row. After this operation, the matrix has similar properties to a standard Laplacian matrix, and it is also less sensitive (thus more robust) to the Gaussian blur operation than a standard Laplacian matrix.

In the new version of this library, we support different types of Laplacian matrix now, including:

  • None Laplacian (affinity matrix): W
  • Unnormalized Laplacian: L = D - W
  • Graph cut Laplacian: L' = D^{-1/2} * L * D^{-1/2}
  • Random walk Laplacian: L' = D^{-1} * L

You can specify the Laplacian matrix type with the laplacian_type argument of the SpectralClusterer class.

Note: Refinement operations are applied to the affinity matrix before computing the Laplacian matrix.

Distance for K-Means

In our ICASSP 2018 paper, the K-Means is based on Cosine distance.

You can set custom_dist="cosine" when initializing your SpectralClusterer object.

You can also use other distances supported by scipy.spatial.distance, such as "euclidean" or "mahalanobis".

Affinity matrix

In our ICASSP 2018 paper, the affinity between two embeddings is defined as (cos(x,y)+1)/2.

You can also use other affinity functions by setting affinity_function when initializing your SpectralClusterer object.

Auto-tune

We also support auto-tuning the p_percentile parameter of the RowWiseThreshold refinement operation, which was original proposed in this paper.

You can enable this by passing in an AutoTune object to the autotune argument when initializing your SpectralClusterer object.

Example:

from spectralcluster import AutoTune, AutoTuneProxy

autotune = AutoTune(
    p_percentile_min=0.60,
    p_percentile_max=0.95,
    init_search_step=0.01,
    search_level=3,
    proxy=AutoTuneProxy.PercentileSqrtOverNME)

For the complete list of parameters of AutoTune, see spectralcluster/autotune.py.

Fallback clusterer

Spectral clustering exploits the global structure of the data. But there are cases where spectral clustering does not work as well as some other simpler clustering methods, such as when the number of embeddings is too small.

When initializing the SpectralClusterer object, you can pass in a FallbackOptions object to the fallback_options argument, to use a fallback clusterer under certain conditions.

Also, spectral clustering and eigen-gap may not work well at making single-vs-multi cluster decisions. When min_clusters=1, we can also specify FallbackOptions.single_cluster_condition and FallbackOptions.single_cluster_affinity_threshold to help determine single cluster cases by thresdholding the affinity matrix.

For the complete list of parameters of FallbackOptions, see spectralcluster/fallback_clusterer.py.

Speed up the clustering

Spectral clustering can become slow when the number of input embeddings is large. This is due to the high costs of steps such as computing the Laplacian matrix, and eigen decomposition of the Laplacian matrix. One trick to speed up the spectral clustering when the input size is large is to use hierarchical clustering as a pre-clustering step.

To use this feature, you can specify the max_spectral_size argument when constructing the SpectralClusterer object. For example, if you set max_spectral_size=200, then the Laplacian matrix can be at most 200 * 200.

But please note that setting max_spectral_size may cause degradations of the final clustering quality. So please use this feature wisely.

Constrained spectral clustering

turn-to-diarize-diagram

In the Turn-to-Diarize paper, the spectral clustering is constrained by speaker turns. We implemented two constrained spectral clustering methods:

  • Affinity integration.
  • Constraint propagation (see paper [1] and [2]).

If you pass in a ConstraintOptions object when initializing your SpectralClusterer object, you can call the predict function with a constraint_matrix.

Example usage:

from spectralcluster import constraint

ConstraintName = constraint.ConstraintName

constraint_options = constraint.ConstraintOptions(
    constraint_name=ConstraintName.ConstraintPropagation,
    apply_before_refinement=True,
    constraint_propagation_alpha=0.6)

clusterer = spectral_clusterer.SpectralClusterer(
    max_clusters=2,
    refinement_options=refinement_options,
    constraint_options=constraint_options,
    laplacian_type=LaplacianType.GraphCut,
    row_wise_renorm=True)

labels = clusterer.predict(matrix, constraint_matrix)

The constraint matrix can be constructed from a speaker_turn_scores list:

from spectralcluster import constraint

constraint_matrix = constraint.ConstraintMatrix(
    speaker_turn_scores, threshold=1).compute_diagonals()

Multi-stage clustering

multi-stage-clustering-diagram

In the multi-stage clustering paper, we introduced a highly efficient streaming clustering approach. This is implemented as the MultiStageClusterer class in spectralcluster/multi_stage_clusterer.py.

Note: We did NOT implement speaker turn detection in this open source library. We only implemented fallback, main, pre-clusterer and dynamic compression here.

The MultiStageClusterer class has a method named streaming_predict. In streaming clustering, every time we feed a single new embedding to the streaming_predict function, it will return the sequence of cluster labels for all inputs, including corrections for the predictions on previous embeddings.

Example usage:

from spectralcluster import Deflicker
from spectralcluster import MultiStageClusterer
from spectralcluster import SpectralClusterer

main_clusterer = SpectralClusterer()

multi_stage = MultiStageClusterer(
    main_clusterer=main_clusterer,
    fallback_threshold=0.5,
    L=50,
    U1=200,
    U2=400,
    deflicker=Deflicker.Hungarian)

for embedding in embeddings:
    labels = multi_stage.streaming_predict(embedding)

Citations

Our papers are cited as:

@inproceedings{wang2018speaker,
  title={{Speaker Diarization with LSTM}},
  author={Wang, Quan and Downey, Carlton and Wan, Li and Mansfield, Philip Andrew and Moreno, Ignacio Lopz},
  booktitle={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={5239--5243},
  year={2018},
  organization={IEEE}
}

@inproceedings{xia2022turn,
  title={{Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection}},
  author={Wei Xia and Han Lu and Quan Wang and Anshuman Tripathi and Yiling Huang and Ignacio Lopez Moreno and Hasim Sak},
  booktitle={2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={8077--8081},
  year={2022},
  organization={IEEE}
}

@article{wang2022highly,
  title={Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering},
  author={Quan Wang and Yiling Huang and Han Lu and Guanlong Zhao and Ignacio Lopez Moreno},
  journal={arXiv:2210.13690},
  year={2022}
}

Star History

Star History Chart

Misc

We also have fully supervised speaker diarization systems, powered by uis-rnn. Check this Google AI Blog.

Also check out our recent work on DiarizationLM.

To learn more about speaker diarization, you can check out:

spectralcluster's People

Contributors

ericwxia avatar gogyzzz avatar wq2012 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

spectralcluster's Issues

Using "custom_dist" for affinity computation

While custom_dist is used in the final K-means clustering step,

# Run K-means on spectral embeddings.
labels = custom_distance_kmeans.run_kmeans(
spectral_embeddings,
n_clusters=n_clusters,
custom_dist=self.custom_dist,
max_iter=self.max_iter)

it looks like the initial affinity matrix computation does not take this option into account, and always relies on cosine similarity:

# Normalize the data.
l2_norms = np.linalg.norm(embeddings, axis=1)
embeddings_normalized = embeddings / l2_norms[:, None]
# Compute cosine similarities. Range is [-1,1].
cosine_similarities = np.matmul(embeddings_normalized,
np.transpose(embeddings_normalized))
# Compute the affinity. Range is [0,1].
# Note that this step is not mentioned in the paper!
affinity = (cosine_similarities + 1.0) / 2.0

The docstring does mention its use for K-means, but does not say anything about the affinity matrix:

custom_dist: str or callable. custom distance measure for k-means. If a
string, "cosine", "euclidean", "mahalanobis", or any other distance
functions defined in scipy.spatial.distance can be used

My intuition tells me that custom_dist would be more useful for the initial affinity computation as some embeddings could have been optimized for a specific metric.

On the other hand, since K-means is applied in the spectral embedding space, it does make sense to stick to cosine similarity.

Would you accept a PR that relies on custom_dist (or better, a new affinity_custom_dist option) to compute the initial affinity matrix?

That being said, my embeddings are trained for cosine similarity so this is not such an issue for me right now.

[Announcement] Upcoming API changes in version 0.2.0

Hi all,

We will be updating the APIs of this library shortly to include new algorithms to appear in an upcoming paper.

If your code relies on the old API, please either:

  1. Update to the new API.
  2. Make sure you install the old version (0.1.0) of this library.

Otherwise the update may break your code.

Thanks!

Quality of Streaming clustering

Is there any paper or some evaluation table showcasing how streaming clustering perform for speaker diarization compared to the non-streaming clustering?

Config for Auto-Tune

Hi.
I'm trying to implement results of auto-tune paper, but all my results is worse than normal AHC. Can anybody please, provide configs for speakers diarization?
I use this one:

autotune = AutoTune(
 p_percentile_min=0.60,
 p_percentile_max=0.95,
 init_search_step=0.01,
 search_level=3)

 refinement_options = RefinementOptions(
 gaussian_blur_sigma=1,
 p_percentile=0.95,
 thresholding_soft_multiplier=0.01,
 thresholding_type=ThresholdType.RowMax,
 refinement_sequence=ICASSP2018_REFINEMENT_SEQUENCE)
 
 clusterer = SpectralClusterer(
 min_clusters=1,
 max_clusters=5,
 autotune=autotune,
 laplacian_type=LaplacianType.GraphCut,
 refinement_options=refinement_options,
 custom_dist="cosine")

P.S I'm using voxconverse dataset.

How to use

Can anyone give example how to use algorithm? What is X? How to read wav file to X array?

How to set number of components for the eigenvectors

Hi. I am trying to use the SpectralCluster in one of my speaker identification project. I can see that while instantiating the SpectralCluster object we can input the minimum and maximum number of clusters. But how do we input the number of eigen vectors to be taken for the embedding ?

Creating an Initial Cluster and predicting using that cluster

Hi. Thank you so much for the repo. I would like to use spectral clustering to accomplish the following:

  1. Take a 1 hour audio
  2. Take the first N min of audio and create an initial cluster
  3. I would then like to take the rest of the audio, segment it in some way, and push those audio segments through the initial cluster to see which speaker created in the initial cluster it is most similar to (or just return the speaker label it is closest to)

(Please let me know if this isn't clear)

Is there a way to accomplish this using SpectralCluster? Create an initial cluster based on some fixed length audio and use that cluster model to predict other audio segments?

Is there some parameter here:

clusterer = SpectralClusterer(
min_clusters=2,
max_clusters=7,
autotune=None,
laplacian_type=None,
refinement_options=None,
custom_dist="cosine")

that can be set to accomplish this before running cluster.predict()?

Thank you in advance for the reply.

Is the RowWiseThreshold doing what I think?

The paper states:

Row-wise Thresholding: For each row, set elements smaller than this row’s p-percentile to 0

So, if p_percentile = 0.95, then we retain the top 5% of affinities and soft-threshold the rest (even if they are high), which seems intuitive to me.

In the code:

row_max = Y.max(axis=1)
row_max = np.expand_dims(row_max, axis=1)
is_smaller = Y < (row_max * self.p_percentile)

Y = (Y * np.invert(is_smaller)) + (Y * self.multiplier * is_smaller)

This appears to be retaining anything larger than 0.95 * row_max. I tested a thresholding like:

row_percentile = np.percentile(Y, self.p_percentile*100, axis = 1)
row_percentile = np.expand_dims(row_percentile, axis = 1)
is_smaller = Y < row_percentile

And it seems to alleviate some issues I was having with returning very low #'s of speakers and extreme sensitivity to the p_percentile parameter.

Let me know if I am confused.

Run spectral clustering on GPUs

Hi,

I am wondering if it is possible to run spectral clustering on GPUs to speed it up. Is there any suggestion on it?

Best,
Xin

Embedding aggregation by segment

The authors of the paper state that after extracting all embeddings they aggregate them by segment, which is of maximum size 400ms post VAD processing. Also, a single embedding is representative of 240ms of the original signal overlapping by 120ms. So two full embeddings would be representative of 360ms of the original signal, whatabout the remaining 40ms? In this issue one of the author states that a segment has about 4 windows but I couldn't understand how that is achieved.

Is forced alignment sufficient for detecting speech segments?

In issue #13 the author explains the construction of the VAD model with the use of forced alignment and GMM. Seeing as forced alignment given audio and its corresponding text, outputs the the time intervals of speech in the stream, is that not sufficient enough? Or is it not due to the fact that it must be applied to each audio file prior to training and inference, whereas with GMM construction it is only performed once.

UnboundLocalError: local variable 'best_p_percentile_index' referenced before assignment

FYI, while running your pyannote.audio PR, I got the following error that I have yet to narrow down.

  File "../site-packages/spectralcluster/spectral_clusterer.py", line 181, in predict
    eigenvectors, n_clusters, _ = self.autotune.tune(p_percentile_to_ratio)
  File "../site-packages/spectralcluster/autotune.py", line 96, in tune
    start_index = max(0, best_p_percentile_index - local_search_dist)
UnboundLocalError: local variable 'best_p_percentile_index' referenced before assignment

start_index = max(0, best_p_percentile_index - local_search_dist)

It probably is because the pipeline tries to cluster a small (like less than 2) number of embeddings but I will confirm this once I know more.

In the meantime, maybe it is obvious to you why this might happen?

Does SpectralCluster work with XVector?

Firstly, I want to say thank you for your great work.

As the question, I just see the experiment of Spectral Cluster with i-Vector and d-Vector at this. However, i've never seen any experiment with x-Vector.
Have you ever done it? If yes, is x-Vector + Spectral Cluster better than i-Vector + Spectral Cluster?

Thank you!

Input X

How do we provide the input X if we are having a input as audio file in the form of .wav . can you please share one example related to this.

Thank You

Autotune proxy condition in spectral_cluster.py

Hi!
I have a question about conditions for autotune.Proxy which makes no sense for me. Is this the intended use? There is the same condition for both if and elif statement in computation of percentile to ratio. As I understand there is two options for autotuner that one comes with sqrt and one does not.

At the line 281:

        if self.autotune.proxy == AutoTuneProxy.PercentileSqrtOverNME:
          ratio = np.sqrt(1 - p_percentile) / max_delta_norm
        elif self.autotune.proxy == AutoTuneProxy.PercentileSqrtOverNME:
          ratio = (1 - p_percentile) / max_delta_norm
        else:
          raise ValueError("Unsupported value of AutoTuneProxy")
        return ratio, eigenvectors, n_clusters

Contraint Matrix Question

Hi,

From the code below what are the "spk_turn_entries" and what format are they in? Thank you

from spectralcluster import constraint

constraint_matrix = constraint.ConstraintMatrix(
spk_turn_entries, threshold=1).compute_diagonals()

Cosine distance for clustering

Hi,
Thanks for the great code. It's mentioned in the file that you have used euclidean distance as similarity measure instead of cosine similarity, as was originally used in the paper. I want to use cosine similarity as the measure for clustering. Can you tell me how can I do that ?
Also I read here that cosine similarity and euclidean distance are linearly related. If so, would changing the similarity measure have any changes in the output ?
Thanks for the time.

Behavior of `autotune` when `n_clusters == 1`

I am currently playing with SpectralCluster inside pyannote.audio.

It looks like SpectralClusterer._compute_eigenvectors_ncluster is having trouble when there is just one speaker (n_clusters = 1).

Despite setting min_clusters to 1, it (always?) tends to return at least 2 clusters.

I understand that applying clustering on cases where n_clusters = 1 is not really useful but I think that is definitely a good use case for auto-tune.

Any suggestion on how to solve this issue?
I could add a postprocessing when the estimated n_clusters is 2 and simply compare both centroids with a simple threshold but maybe there is something obvious I missed in SpectralCluster.

Is autotune useful when the number of clusters is known in advance?

There exists a scenario where the expected number of clusters is known in advance.

Under this assumption, my first intuition was that I should not use autotune and simply set min_clusters = max_clusters = expected_num_clusters.

However, after looking a bit closer to the code, I am wondering whether it might still be a good idea to use it when RowWiseThreshold refinement is active.

Even for a fixed number of clusters, the value of p_percentile might have an influence on the resulting clusters, right?

However, it looks like (correct me if I am wrong, I might have overlooked something) that there is no way to autotune the value of p_percentile under the constraint that the number of clusters is fixed.

Online clustering

Hi,

Thanks for your great work and sharing your code with us! I've been working in face recognition and person re-identification domain for a while and I think your clustering method is useful for these domain for sure.

I've also read your recent papers, especially the one "LINKS: A HIGH-DIMENSIONAL ONLINE CLUSTERING METHOD". However, I got a bit confused when looking through this repo. I'm just wondering how can we benefit from your algorithm(this repo) to do online clustering? Because I didn't see such process in the demo code. Or, is LINKS a totally different algorithm?

Anyway, could you please give me some more explanation about your online clustering method mentioned in your paper? Thanks!

Cluster Centers?

Thanks for sharing your great code.
I would like to know if I can get the center of clusters after prediction.

Tracking speakers in time

Thanks for your contribution and for recent updates with MultiStageClusterer and streaming_predict method!

I have a question about tracking speakers in time. In my case in online inference I put the most attention to the newest added embedding. However, I want these newest predictions to match the ones that were made in past (I want the samples to hold same speaker ID in time). I've noticed that with this clustering approach IDs can easily change in time, where obviously enforce_ordered_labels method wont help.

Have you tried to introduce some method for tracking the speakers clusters? I think it could be solved like maximum bipartite matching, using clusters with samples indices in time, calculating the clusters similarity from previous step and current step as Jaccard index.

For example, after 5 steps, we would get list like [0, 0, 0, 1, 1]. In next iteration, we get 6th embedding and get swapped ids for clusters like: [1, 1, 1, 0, 0, 1]. This would be especially ruining, if we took only the last result from every prediction.

With proposed method, we could represent reference clusters: {0: {0, 1, 2}, 1: {3, 4}}, and new clusters: {0: {3, 4}, 1: {0, 1, 2, 5}}. In this example enforcing ordered ids would help, but in general it might be not that useful, especially when we make some adjustments to previous clusters (f.e. we would get [0, 0, 1, 2, 2] if sample number 3 was assigned to its own cluster after update).

Is there a way to monitor the progress of the predict() method?

The predict() method's code seems pretty straightforward. I'm curious though, has anyone else found a way to monitor the progress of the method when processing larger amounts of data? (I'm running SpectralCluster as a background queue/job and wondering how to differentiate a stalled process from a long-running process.)

Outputs just one cluster.

I'm Using this on the outputs from the encoder provided in the voiceFilter repository.
The outputs are 256 dimensions, and i'm using a segment length of 1.97s.
The output of the spectralClusterer is always cluster 0 for every sample.

Here's How i define the SpectralClusterer Object:
SpectralClusterer(min_clusters=1, max_clusters=10, stop_eigenvalue=1e-4)

My input dimension is (36, 256)

Here's a tsne visualization of the encodings
tsne

Here's The output From The SpectralClusterer
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

The encodings are in range (-1, 1), am i doing something wrong?

Here's Another Example for a different audio clip
tsne2

Here's the output
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Other clustering approaches in the spectral embedding space?

As defined in canonical spectral clustering papers, spectralcluster uses K-means to perform the final clustering step in the spectral embedding space.

Yet, K-means is known for not being very good at handling clusters with uneven sizes (which is often the case for speaker diarization, for instance).

What are your thoughts on this?
Would it make sense to allow custom clustering approaches in place of K-means?
Or do you think the spectral embedding space is already good enough that it would not make any difference?

Feel free to close this issue if this is not the place for discussing this...

Spectral clustering is too slow & expensive when sequence is long

Spectral clustering relies on eigen-decomposition, which is ~O(N^2.7). This is too expensive for long-form conversations.

In order to accelerate spectral clustering, we can make use of pre-clustering results and constrain the max size of input to spectral clustering.

Another question about the Laplacian used

Hi!

You addressed this question in the Q&A, but could you please expand how exactly replacing the diagonal by row-wise max makes the similarity matrix to have the same properties as standard laplacian matrices? (or maybe it holds approximately and you consider the graph approximately regular after applying thresholding so that row-wise sums are equal between vertices? although it seems it would not hold true if speaking time vastly differs between speakers)

In unnormalized laplacian definition, we subtract from the diagonal the row-wise sums (given non-negative affinity) (and then negate the whole thing).

Maybe a related reference: https://arxiv.org/abs/1712.03769

Thank you!

Segment size and window step matching problem

According to the appropriate paper for speaker diarization maximum segment size is 400 ms, window size is 240 ms and window step is 120 ms. For this case we have two windows in one segment and non-covered part 40 ms. (400ms - (240ms +120 ms) = 40 ms). If we add next window we get segment size 480 ms. My question is how to include this uncovered part 40 ms? Or we should take maximum segment size 360 ms?

predicted cluster number always be 2

Hello, I was using the spectral clustering slove the inter-block label permutation problem of eend speaker diarization method.
I initial the spe_clusterer with the default setting as below. And found it works well for the two speaker conversations. However, when the real speaker number greater than 2, the cluster still predict with 2 speaker even I set the max_clusters as 7. I also tried with some different settings but it failed to predict the speaker number at most time. While the AHC is to sensetive with the distance hreshold and predicted more speaker than the real condition. I was wondering which paramter should I tune to make the scp_clusterer more sensetive with the speakr number? Thanks a lot.

scp_clusterer = SpectralClusterer(
        min_clusters=2,
        max_clusters=7,
        autotune=None,
        laplacian_type=None,
        refinement_options=None,
        custom_dist="cosine")

question about clustering every turn detection will change previous prediction ?

Hi @wq2012 Dr. Wang,
really impressive by your series of work and thankful for updating online spectral clustering algorithm.

I have 2 questions:

  1. "turn to dia" paper describes that clustering happens every time get the turn with history all segments. I am wondering whether it would change previous step prediction?
    e.g. we have (segment_id,prediction) (1,A),(2,B),...,(N,M), then new turn segment N+1 comes, and make clustering & prediction, will it change previous segments's prediction, such as -> (1,B),(2,A),...,(N,C),(N+1,M).

To have both great performance and low latency at the same time, we use spectral clustering in an online fashion: every time when we have a new speaker embedding, we run spectral clustering on the entire sequence of all existing embeddings.

  1. how about the performance and robustness of this method in real-scenario call center or on-device ? and also I am confused about how is the inbound call center dataset with 2-10 speakers generated, I cannot image why initiated by customer would have speakers more than 2/3 to 10 ? maybe the conference scenario?

The“Inbound” subset, which includes 250 conversations initiated by customers. This dataset has approximately 22 hours of speech in total. Each utterance has 2 to 10 speakers.

Configs from your ICASSP 2018 paper

I want to know when I use the following code

from spectralcluster import configs
labels = configs.icassp2018_clusterer.predict(X)

Will all the settings such as refinement operations, affinity matrix, laplacian matrix and so on be configured just as you did in your ICASSP 2018 paper or not ?
Do I need to use code to config refinement operations and many others, after using the above code, when I want to get the configs in your ICASSP 2018 paper ?

Normalization does not preserve Symmetry

Hi,
The last step of the Affinity Matrix Refinement does not preserve symmetry.
Yet in your paper you state:

Symmetrization restores matrix symmetry which is crucial to the spectral clustering algorithm.

This sounds counterintuitive.

Constrained MultiStageClusterer

Is there a chance one could use speaker turns constraints with MultiStageClusterer or is the lack of support for this intended? I see that streaming_predict method is lacking constraint_matrix argument. Adding it and calling self.main.predict with passed constraint_matrix seems to be the solution for this issue.

Log-mel-filterbank feature extraction

In the paper thr authors state that they extract log-mel-filterbank energies of dimension 40 from each frame of the raw signal for embedding extraction. Does this implementation of the python_speech_features library adequate to be used directly?

TypeError: constraint matrix must be a numpy array

configs.turntodiarize_clusterer is giving TypeError: constraint matrix must be a numpy array error for a data on which icassp2018_clusterer is working fine.

Error log:

File /..../spectralcluster/spectral_clusterer.py:250, in SpectralClusterer.predict(self, embeddings, constraint_matrix)
    246 # Apply constraint.
    247 if (self.constraint_options and
    248     self.constraint_options.apply_before_refinement):
    249   # Perform the constraint operation before refinement
--> 250   affinity = self.constraint_options.constraint_operator.adjust_affinity(
    251       affinity, constraint_matrix)
    253 if self.autotune:
    254   # Use Auto-tuning method to find a good p_percentile.
    255   if (RefinementName.RowWiseThreshold
    256       not in self.refinement_options.refinement_sequence):
...
---> 77   raise TypeError("constraint matrix must be a numpy array")
     78 if len(affinity.shape) != 2:
     79   raise ValueError("affinity must be 2-dimensional")

TypeError: constraint matrix must be a numpy array

Is there some different format or conditions on the input for turntodiarize_clusterer?

How to deal with the speaker_turn_scores of the constraint_matrix ?

Hello, I'd like to ask if you can give some suggestions on how to get the speaker order score in this restriction matrix? Or whether a template can be provided.

'''
from spectralcluster import constraint
constraint_matrix = constraint.ConstraintMatrix(
speaker_turn_scores, threshold=1).compute_diagonals()
'''
Thank you.

Question about this formula in paper mentioned

image
Can you please explain this formula in your paper (SPEAKER DIARIZATION WITH LSTM).

To simply things further,
Assume I have a Matrix A (m x n) , where m is the number of segments, and n is an embedding for each segment.

Now, I want to know how can I determine k (number of clusters, or number of speakers if you will).

Any ideas?

Thanks in advance.

LinAlgError: Array must not contain infs or NaNs

Hello!

I trying to use the spectral-clusterer algorithm that this repository implements.

I have performed the following steps:-

  1. Load audio file
  2. Obtain windows corresponding to 25ms and the time difference between the starts of successive windows being 10ms.
  3. I then obtained the log-mel-filterbank-energies of dimension 40.
  4. At this stage, I have a (n_samples, 40) dimensional numpy array.
  5. I then run this through a 3-layer LSTM (as described in the paper) and finally have an array of dimensions (n_samples, 256).
  6. I then L2-normalized each sample.

However, once I try to run the spectral clustered on the final L2-normalized numpy array (dimensions: (n_samples, 256)), I get the following error:-

LinAlgError Traceback (most recent call last)
in ()
10 gaussian_blur_sigma=1)
11
---> 12 labels = clusterer.predict(X_l2)
3 frames
/usr/local/lib/python3.7/dist-packages/spectralcluster/spectral_clusterer.py in predict(self, X)
117 # Perform eigen decomposion.
118 (eigenvalues, eigenvectors) = utils.compute_sorted_eigenvectors(
--> 119 affinity)
120 # Get number of clusters.
121 k = utils.compute_number_of_clusters(

/usr/local/lib/python3.7/dist-packages/spectralcluster/utils.py in compute_sorted_eigenvectors(A)
40 """
41 # Eigen decomposition.
---> 42 eigenvalues, eigenvectors = np.linalg.eig(A)
43 eigenvalues = eigenvalues.real
44 eigenvectors = eigenvectors.real
<array_function internals> in eig(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py in eig(a)
1316 """
1317 Return the eigenvalues and eigenvectors of a complex Hermitian
-> 1318 (conjugate symmetric) or a real symmetric matrix.
1319
1320 Returns two objects, a 1-D array containing the eigenvalues of a, and
/usr/local/lib/python3.7/dist-packages/numpy/linalg/linalg.py in _assert_finite(*arrays)
207 'at least two-dimensional' % a.ndim)
208
--> 209 def _assert_stacked_square(*arrays):
210 for a in arrays:
211 m, n = a.shape[-2:]
LinAlgError: Array must not contain infs or NaNs

Please note that my input array 'X_l2' of dimensions (n_samples,256) DOES NOT contain any nan or inf values. My input however does contain samples where all values are 0. For example, a sample might have all 256 entries as 0. Is it the cause of my problem?

Any help would be greatly appreciated.

Thank you! :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.