ssarfraz / finch-clustering Goto Github PK

Source Code for FINCH Clustering Algorithm

License: Other

Python 10.30% MATLAB 16.63% Shell 0.12% Jupyter Notebook 72.94%

finch-clustering's Introduction

First Integer Neighbor Clustering Hierarchy (FINCH) Algorithm

FINCH is a parameter-free fast and scalable clustering algorithm. it stands out for its speed and clustering quality. The algorithm is described in our paper Efficient Parameter-free Clustering Using First Neighbor Relations published in CVPR 2019 . Read Paper.

Installation

The project is available in PyPI. To install run:

pip install finch-clust

Optional. Install PyNNDescent to get first neighbours for large data

To install finch with pynndescent run:

pip install "finch-clust[ann]"

Usage:

typically you would run:

from finch import FINCH
c, num_clust, req_c = FINCH(data)

You can set options e.g., required number of cluster or distance etc,

c, num_clust, req_c = FINCH(data, initial_rank=None, req_clust=None, distance='cosine', verbose=True)

For more details on meaning of input arguments check README in finch directory.

Matlab usage

Correponding Matlab implementation is provided in the matlab directory.

Demos

The following demo notebooks are available to see the usage in clustering a dataset.

Relevant tools built on FINCH

h-nne: See also our h-nne method which uses FINCH for fast dimenionality reduction and visualization applications.
TW-FINCH: Also see our TW-FINCH variant which is useful for video segmentation.

Citation

@inproceedings{finch,
    author    = {M. Saquib Sarfraz and Vivek Sharma and Rainer Stiefelhagen}, 
    title     = {Efficient Parameter-free Clustering Using First Neighbor Relations}, 
    booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    pages = {8934--8943}
    year  = {2019}
}

The code and FINCH algorithm is not meant for commercial use. Please contact the author for licensing information.

finch-clustering's People

Contributors

Stargazers

Watchers

Forkers

ml-lab changliu816 locussam timewarlock zixinyi lonely-geese xifengguo gaiya2050 mosaddek-hossain andudu steven99999 allenmujie seven-xu qinghaizheng1992 ustczhouyu bio-ontology-research-group phymucs addingding guoflyfly shiyongde fagan2888 hbredin trantorrepository penghu-cs yurongchen1998 aquuuuf vivoutlaw yyht square187 hlj2021 798283635 qiqi12 vhientran mrshouxingma takuyara xlsean rotcx its-gucci ravipr009 lujunyihhh moishekeselman jaidevshriram avadakarrot chasemonsteraway magoning fsgdrq hyp-code gbtunze wang-shuibin pxzheng01 smallpokonyan melika-zabihi ustlzh anushaabdulla mr2cool josephrp

finch-clustering's Issues

how to handle dataset that is too large to be loaded in the memory?

Suppose I have a dataset of 3000w items, each item is a 2048-d vector.
Thanks

Different Clustering results when using python and matlab implementation

I realized, that the python implementation does yield different results than the matlab version.

This I have found out by first comparing a python evaluation of the tw-finch clustering results against the provided matlab evaluation one, with one of the provided datasets and the features from the TW-FINCH paper.
After looking a little more into the issue, I have found that already in the first steps of the clustering process, both version assign the same features/frames to different clusters and the number of clusters is drastically different too, which explains the performance differences.

Have you encountered this issue and if yes are there solutions?

Unable to replicate numbers

Hey,

I was trying to replicate the numbers presented in the paper with the features provided and my numbers seem to be a bit on the lower side. Without changing anything, I ran the python version of the code, and what i noticed was on breakfast I am getting an MOF of 60.1 whereas the reported number is 62.7. Similarly for MPII, I am getting 41.51 but reported number is 42.0 (Though very minor). Is there a reason for this discrepancy?

The code for TW-FINCH is not available

[

Any random factors in this algorithm?

Got different results for different trials in my experiments ...

TWFinch code missing FS "Eval" option; unable to reproduce accuracy

Hello, thank you for posting the code and data for the TWFinch paper.

The code seems to be missing an option to run the FS "Eval" dataset.
I've made a logical change to your code (below) to load this dataset, but am unable to reproduce the accuracy in the paper, which was reported as MoF= 71.1%.

The following change to TW-FINCH/util_fns/read_video.m produces an accuracy of MoF:= 66.7%:

 elseif strcmp(Dataset, 'FS')
    map=readtable(fullfile(mapping_path, 'mappingeval.txt'));
    map2=table([1:numel(map.Var2)]', 'RowNames', map.Var2);
    gt_label_str=table2cell(readtable(fullfile(gt_path, vid_name), 'Delimiter', '#', 'ReadVariableNames',false));
    gt_label_frame=table2array(map2(gt_label_str,1));

I would appreciate any guidance on what might be wrong. Thank you.

how to estimate number of clusters

I was wondering how to estimate number of clusters using FINCH after reading your paper, your method seems can always get the correct number of clusters, e.g., in Table 2.

array is empty with s1 dataset

why the algorithm triger an error when working on s1 dataset from http://cs.joensuu.fi/sipu/datasets/

~/finchcls.py in update_adj(self, adj, d)
94 v = np.argsort(d[idx])
95 v = v[:2]
---> 96 x = [idx[0][v[0]], idx[0][v[1]]]
97 y = [idx[1][v[0]], idx[1][v[1]]]
98 a = sp.lil_matrix(adj.get_shape())

IndexError: index 0 is out of bounds for axis 0 with size 0

the same error with a1 dataset and "unbalance" dataset

any other datasets it works fine

Replace `sklearn` with `scikit-learn` in `setup.py`

The former is deprecated and pip throws a hissy fit

About Output

Could you please provide the tool for visualizing the Figure 2?

Dear @ssarfraz ,
I am sorry for disturbing you, but could you please describe in more detail the tool or the source code you visualize the Figure 2 in your paper? Thank you so much!

input precomputed distance matrix instead of data

Hi, thanks for your great job. How to input a precomputed distance matrix instead of data? Could you please release a version ?

There is a bug when using pynndescent.NNDescent

Nice work!

When my data volume is very large, I will use the "NNDescent" in the "pynndescent" library according to the "Python" code, and then an error will occur.

my “pynndescent” version is ‘0.5.5'. how to fix it?

Looking forward to your reply, thanks.

about output

Thank your opened code,I want to know what mean about output of 'C', It is a N*2 array，what which is cluster label？ I found about my data get bad result ,I want to reason.

errors when run the run_on_dataset.m

hello,thank you for posting the code for the TWFinch and great work！I have tried to reproduce the results,but I meet some problems when I run the run_on_dataset.m.

I downloaded the data and put it under E:\FINCH-Clustering-master\TW-FINCH,
then I run the script tw_finch = true Result = run_on_dataset('50Salads', tw_finch, 'E:\FINCH-Clustering-master\TW-FINCH\Action_Segmentation_Datasets');
the error is as follows

The performance of FINCH on Aggregation

The code is working fine. But the performance I have got is always 0.96536 in terms of NMI (implemented in sklearn.metrics).
The code I run is as follows:

import numpy as np
import scipy.io as sio
from sklearn.metrics import normalized_mutual_info_score as nmi
from .finch import FINCH

data = sio.loadmat("Agg.mat")
X = data["X"]
y_true = data["Y"]
c_true = len(np.unique(y_true))

Y, num_clu, req_y = FINCH(X, req_clust=c_true, distance='euclidean') # or cosine
acc = nmi(y_true, req_y, average_method="max")
print(acc)

Looking forward to your reply

Code for TW-FINCH

It would be great if you could publish your code for TW-FINCH, since it is a bit hard to replicate the results from the paper.

Is there any randomness in the clustering results?

Hi, I fixed the random seed and input data and then applied FINCH for clustering. But I found that the results obtained by each clustering are different, what should I do to ensure that I can get a fixed result every time?

P.S. I have a large amount of data (hundreds of thousands) and use the NNDescent method in 'pynndescent', is it possible that this is the cause？ What can I do?

Looking forward to your reply, thank you very much

Segmentation fault for large dataset of 5M datapoints of 1024 dimensions

The segmentation faults occurs at the call of NNDescent function where RP-trees are being built and descent steps are about to start. I am using the H-NNE (koulakis/h-nne#17) algorithm which uses FINCH under the hood.

great work, waiting for the python code

How to get the midpoint hit criterion for the MPII?

All I konw is how to get the percision and recall. But I don't know how to get the midpoint hit, and they are different.

ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimensions(s) and the array at index 1 has 2 deminesion (s).

Hi, I tried to convert my video into a numpy array as method shown here (https://stackoverflow.com/questions/67644826/how-to-convert-a-video-to-a-numpy-array) . And now when I pass it as a input to the function as FINCH(data, req_clust=K, tw_finch=True) I am getting :
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimensions(s) and the array at index 1 has 2 deminesion (s). The shape of my data right now is (928, 108, 108, 3)

How do I fix this? Is there any other method to get feature vector of a video ? I really appreciate the response !

element of adjacent matrix may greater than 1

thanks for this amazing and practical algorihtm
when I browse the python ver code, I find the element of adjacent matrix may greater than 1 as below

csr_matrix in python.finch.py line45
0, 0, 0, 0, 1
0, 0, 0, 0, 1
0, 1, 0, 0, 0
0, 0, 0, 0, 1
0, 1, 0, 0, 0
adjacent matrix in line50
0, 1, 0, 1, 1
1, 0, 1, 1, 2
0, 1, 0, 0, 1
1, 1, 0, 0, 1
1, 2, 1, 1, 0

maybe this will impact the value of min_sim in hierarchy cluster line155

Finch Algo 2

Thank you for the greate method and code.
As far as I understand, I think algo2 is needed for evaluation, but I don't think there is a corresponding python code.

IndexError when req_clust > num_clust

I call finch using
cluster_partition, n_part_clust, part_labels = FINCH(data, req_clust=2)

and receive this error

line 185, in FINCH
    req_c = req_numclust(c[:, ind[-1]], data, req_clust, distance, use_ann_above_samples, verbose)
IndexError: list index out of range

My best guess is, that there is only one cluster, so the condition v >= req_clust is never fulfilled in ind = [i for i, v in enumerate(num_clust) if v >= req_clust], thereby the index list is empty, thereby ind[-1] is out of range.

What is the implication and how to best deal with this?

TWFinch code missing YTI without 75% background option; unable to reproduce accuracy

For YTI dataset, I have read CTE, but I can't reproduce the code to remove background frames. The replicated f1 is much lower, and the
mof is close.
I would appreciate any guidance on how to do this in YTI dataset. Thank you.

TW-FINCH feature extraction method

hello,thanks for your work!I'm sorry but this problem has been bothering me for a long time.For TW-FINCH,do the frame-wise features can only be extracted by iDT(your paper mentioned),or it can also be extracted by other CNN methods such as I3D？Will the methods affect the clustering results?

`sklearn` is still a dependency in `setup.py`

Commit b508b1a intended to remove sklearn dependency, but actually removed scipy. You can check the commit's diff here.

scipy is still installed since it's a dependency of scikit-learn, but we also get the deprecated sklearn package.

This means that the problem from #29 still affects finch-clust==0.1.8. We can check it by doing the following (based on How to test whether a package will be affected by the sklearn deprecation):

SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=False \
    pip install finch-clust==0.1.8

If the way F1-scores is calculated is different in matlab and python

When I was doing the experiment, the f1-scores in python always lower than matlab, but the other metric are always similar. For example, as for Breakfast:

Matching problem of TW-FINCH

It is amazing that this unsupervised clutering method outperforms other paradigms on five challenging action segmentation datasets. However, some details puzzle me a lot, just about how to map the obtainded segments with different action labels (including background) using Hungarian algorithm. It would pretty appreciate if these problems would be explained.

features for Hollywood and MPII Cooking 2

Hi~ Thanks for releasing your code and great work! I would appreciate if you can help me with the features of these two datasets.

Error when runninng TW_FINCH and specifying the number of clusters.

Hello,
Thank you for publishing your excellent work.

I was testing the TW_FINCH for clustering and it has been working well, but when I tried to specify the exact number of clusters I wanted, I got the following error:

    [186]  ind = [i for i, v in enumerate(num_clust) if v >= req_clust]
--> [187]  req_c = req_numclust(c[:, ind[-1]], data, req_clust, distance, use_tw_finch=tw_finch)
    [188]else:
    [189]  req_c = c[:, num_clust.index(req_clust)]

IndexError: list index out of range```

It seems to be in the c[:,ind[-1]] call.

What could be the reason behind this error?

Thank you.

Why does edge (vertices are the most similar nodes of each other) have greater weight when compare with min_sim

thanks for your work!!! I love it very much!!
I want to know Why does edge (vertices are the most similar nodes of each other) have greater weight when compare with min_sim?
In the src code, the weight of edge(vertices are the most similar nodes of each other) is 2, while others is 1 when compared with min_sim