subhadarship / kmeans_pytorch Goto Github PK

View Code? Open in Web Editor NEW

459.0 7.0 75.0 1.01 MB

kmeans using PyTorch

Home Page: https://subhadarship.github.io/kmeans_pytorch

License: MIT License

Jupyter Notebook 91.65% Python 8.27% Makefile 0.08%

kmeans-clustering pytorch docs jekyll jekylbook github-pages gpu

kmeans_pytorch's Introduction

K Means using PyTorch

PyTorch implementation of kmeans for utilizing GPU

Getting Started


import torch
import numpy as np
from kmeans_pytorch import kmeans

# data
data_size, dims, num_clusters = 1000, 2, 3
x = np.random.randn(data_size, dims) / 6
x = torch.from_numpy(x)

# kmeans
cluster_ids_x, cluster_centers = kmeans(
    X=x, num_clusters=num_clusters, distance='euclidean', device=torch.device('cuda:0')
)

see example.ipynb for a more elaborate example

Requirements

PyTorch version >= 1.0.0
Python version >= 3.6

Installation

install with pip:

pip install kmeans-pytorch

Installing from source

To install from source and develop locally:

git clone https://github.com/subhadarship/kmeans_pytorch
cd kmeans_pytorch
pip install --editable .

CPU vs GPU

see cpu_vs_gpu.ipynb for a comparison between CPU and GPU

Notes

useful when clustering large number of samples
utilizes GPU for faster matrix computations
support euclidean and cosine distances (for now)

Credits

This implementation closely follows the style of this
Documentation is done using the awesome theme jekyllbook

License

MIT

kmeans_pytorch's People

Contributors

Stargazers

Watchers

Forkers

nirvanalan wangzheallen napoler wooohoooo dsp6414 ruthwik98 innovarul dotchen himani007 enjine-com abdrysdale 2272223680 eghouti beneisner zaemyung imyjliu ouyangbo1988 vishalbelsare hadaev8 alierkan nnadagouda95 sarjann sunshineatnoon msmilezz klauscc nta-byte densechen jtchilders elin24 sirupli hongluzhou fuenwang 3n7ropy sprenkamp haoban asfcczxca juanfmontesinos yufan-microsoft alonshapiraab tlwzzy liuheng0111 jyyd o-alexandre-felipe on1kou95 lyan-ing opjang5 a-magic xiahaohao gqc0129 wn1695173791 mullerhai vidushivashishth ancestor-mithril saurychen zhaoyu-li flick-ai yangnianzu0515 linregress sircamp 40mhz dylan8527 hyeonjin-huinno koziev haoheliu mizadri goometasoft philippauss lev-stambler harveyd11 trogdentyler silyfox kastnerkyle elevir loahpotato

kmeans_pytorch's Issues

seed

The example.ipynb finds a different center_embedding for every running . How to resolve it?

Support for "kl_divergence"

I have implemented support for Kullback-Leibler divergence as follows. Shall I make a pull request of it?

def pairwise_kl_divergence(data1, data2, device=torch.device('cpu')):
    # transfer to device
    data1, data2 = data1.to(device), data2.to(device)

    # N*1*M
    A = data1.unsqueeze(dim=1)

    # 1*N*M
    B = data2.unsqueeze(dim=0)

    # normalize the points 
    A_normalized = torch.nn.functional.log_softmax(A, dim=-1)
    B_normalized = torch.nn.functional.log_softmax(B, dim=-1)

    kl_div = torch.nn.functional.kl_div(A_normalized, B_normalized, reduction='none', log_target=True)

    # return N*N matrix for pairwise distance
    kl_div = kl_div.mean(-1)
    return kl_div

Unit test with sklearn?

Hi, thanks for the implementation !

Are you considering writing a unit test again sklearn's Kmeans?

Can't choose a cluster if two points are too close to each other, that's where the nan come from

Can you elaborate more on what this means? Any idea how I can fix it?

#ToDo: Can't choose a cluster if two points are too close to each other, that's where the nan come from

https://github.com/subhadarship/kmeans_pytorch/blob/master/kmeans_pytorch/__init__.py#L5

Error trying to cluster from numpy

Hi, I'm not really using pytorch, but I want to use balanced kmeans. My code is as follows:

from torch import from_numpy
from balanced_kmeans import kmeans_equal
...
  # load X, a 23000x59 ndarray
  n_cluster = 50
  X_tensor = from_numpy(X)
  choices, centers = kmeans_equal(X_tensor,
                                  num_clusters=n_cluster,
                                  cluster_size=X.shape[0] // n_cluster)

I get the following error:
RuntimeError: expand(torch.LongTensor{[59]}, size=[]): the number of sizes provided (0) must be greater or equal to the number of dimensions in the tensor (1)

Am I doing something wrong creating my tensor from numpy? I apologize because I am asking more of like a general pytorch question and not really specific to kmeans_pytorch (and tbh I'm a total pytorch newb!) Is there an example anywhere of using kmeans_equal on numpy data? I bet other people would find that useful. Thanks in advance for any tips you can provide!

Overload of nonzero is deprecated

Hi, I was executing the example and got the following warning:

[running kmeans]: 0it [00:00, ?it/s]

running k-means on cpu..

/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
	nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
	nonzero(Tensor input, *, bool as_tuple)
[running kmeans]: 7it [00:02,  3.24it/s, center_shift=0.000091, iteration=7, tol=0.000100]

Here is the code I ran to get this message.

import torch
import numpy as np
import matplotlib.pyplot as plt
from kmeans_pytorch import kmeans, kmeans_predict

# set random seed
np.random.seed(123)

# data
data_size, dims, num_clusters = 500000, 100, 3
x = np.random.randn(data_size, dims) / 6
x = torch.from_numpy(x)

# set device
if torch.cuda.is_available():
    device = torch.device('cuda:0')
else:
    device = torch.device('cpu')

# k-means
cluster_ids_x, cluster_centers = kmeans(
    X=x, num_clusters=num_clusters, distance='euclidean', device=device
)

Thanks in advance.

GPU vs CPU

I did not see any significant difference in GPU vs CPU speed. I noticed that only the data size has highly been changed, and number of clusters were a bit low 2, 3 ... speaking of this test.
I hope I am not wrong, I think what really matters here with regard to speed is the number of clusters. Hence, re running the experiment with higher number of clusters would show the speed significance, if any. I would go up to 64 clusters.

Got an Unexpected keyword argument "iter_limit"

Looked into your source code, and I want to limit the number of iterations in estimating the centroids. I tried to do this on my virtual GPU machine and I got the error below:

cluster_ids_x, cluster_centers = kmeans(
            X=x, num_clusters=num_clusters, distance='cosine',iter_limit = 10, device=torch.device('cuda:0')
        )

  classes, centroids = self.TrainData()
  File "/home/ubuntu/new_sonalysis_deployment/Sonalysis-Front-End-listner_profile/PlayerTracker_upgrade.py", line 122, in TrainData
    X=x, num_clusters=num_clusters, distance='cosine',iter_limit = 10, device=torch.device('cuda:0')
TypeError: kmeans() got an unexpected keyword argument 'iter_limit'

Clustering of batch data

Does this method support the clustering of batch data, that is, the data that is sent in (batch-size,n,dim), and does it support the clustering of n sample points with dim dimension in each batch?

Quick and dirty fix for "Can't choose a cluster if two points are too close to each other, that's where the nan come"

Hey, thanks for the pytorch kmeans implementation. I also encountered the following problem and have a solution for it:
Add the following line into line 84 in kmeans_pytorch/init.py

            if selected.shape[0] == 0:
                selected = X[torch.randint(len(X), (1,))]

The problem is, if two centroids are too close together, then one centroid gets all the close points, leaving the other centroid with zero assigned points. If that happens, that centroid's selected points is an empty list. The above code detects that and then chooses a new point to replace that centroid randomly. I guess there are more elegant solution in choosing the replacement point. But this works and converges. So a quick and dirty fix.

Sorry I write it here, it should be a pull request, but I am new to GitHub.

Does not work with 2D, 3D features

How to set the num_features?

kornia collaboration

hi @subhadarship ! good repo. Would you like to integrate this algorithm inside kornia ?

An error occurs between the kmeans function runs.

I'm going to try to do 64 clusters kmeans for 50,000 datasets with 512 dimensions, and the following error occurs.
running k-means on cuda.. [running kmeans]: 0it [00:00, ?it/s]tcmalloc: large alloc 6553600000 bytes == 0x7f7bf5600000 @ 0x7f82122b0b6b 0x7f82122d0379 0x7f81c2f8b74e 0x7f81c2f8d7b6 0x7f81fd3e1fa2 0x7f81fd6ccbd3 0x7f81fd6a4207 0x7f81fd6bf2dc 0x7f81fd69b78a 0x7f81fd6a4207 0x7f81fd6bf2dc 0x7f81fd78b0dd 0x7f81fd3f309f 0x7f81fd3f56b6 0x7f81fd3f5bad 0x7f81fd3f5d28 0x7f81fd103ae5 0x7f81fd6cdae9 0x7f81fcf4d124 0x7f81fd85ea02 0x7f81fd75cc4e 0x7f81fecc8321 0x7f81fcf4d124 0x7f81fd85ea02 0x7f81fd9a369e 0x7f820d350fa9 0x7f820d3519b6 0x566f73 0x59fd0e 0x4b1eea 0x619d0c tcmalloc: large alloc 6553600000 bytes == 0x7f7a6ec00000 @ 0x7f82122b0b6b 0x7f82122d0379 0x7f81c2f8b74e 0x7f81c2f8d7b6 0x7f81fd9f7d53 0x7f81fd3e28cf 0x7f81fd6f9cac 0x7f81fd6a531b 0x7f81fd6c4135 0x7f81fd69fb4b 0x7f81fd6a531b 0x7f81fd6c4135 0x7f81fd78e2be 0x7f81fd3e1145 0x7f81fd9491ff 0x7f81fcfaec1b 0x7f81fd87b056 0x7f81fd78dba2 0x7f81fd2dbe43 0x7f81fd6cea59 0x7f81fcf4d1b1 0x7f81fd869183 0x7f81fd778d9e 0x7f81fed60021 0x7f81fcf4d1b1 0x7f81fd869183 0x7f81fd9ae69e 0x7f820d37c5f3 0x566f73 0x59fd0e 0x4b1eea ^C

I don't know why '^C' is printed on its own.
I tried to use this code as part of the loss class of pytorch. Is there any other solution?

this is the code
_, centroids = kmeans(descriptor, num_clusters=64, distance='euclidean', device=torch.device('cuda'))

bad default parameters!

can i know why on earth does the stable release have by default the number of k-means iterations as infinite, and assumes the condition of breaking the loop will be met??? You caused my GPU to run k-means infinitely for a single sample without breaking, for the whole weekend, and wasted 2 days of running experiments!

Either publish reliable code or don't touch your keyboard

Initialize with setting Seed

Hi, as far as I understand it is not possible to set a seed to reproduce the results of the algorithm. I am I correct here?
I looked at the code and it would be quite simple to edit I believe, are you interested in adding the function and can I participate?

If selected is empty all data will become Nan

You don't handle the case when non of data is clustered into a cluster and if the num of cluster is big enough, all data will become nan.

Feature request: Switch off output to stdout

I suggest adding a verbose=False option to kmeans so that we can switch off the verbose output.

tqdm_flag missing

I had to change the code to remove verbosity.
Why tqdm_flag is missing?

Please Share the requirement file

Please Share the requirement file requirements.txt file

Try to solve the OOM for large scale dataset

Hi, it is amazing module. But if I try to set the cluster number big or the dataset is too large. Then I will caught OOM issues.
I have refactor the code via batch script. Please feel free if it is good for you.

Best regards.
Alisca

Appendix, the refactor code for euclidean distance calculation with batch step.

def pairwise_distance(data1, data2, device=torch.device('cpu'), batch_size=100000):
# transfer to device
data1, data2 = data1.to(device), data2.to(device)

# N*1*M
A = data1.unsqueeze(dim=1)

# 1*N*M
B = data2.unsqueeze(dim=0)

dis_reduce = torch.zeros([data1.shape[0], data2.shape[0]])
for batch_idx in range(int(np.ceil(data1.shape[0]/batch_size))):
    dis = (A[batch_idx * batch_size: (batch_idx+1) * batch_size] - B) ** 2.0
    dis = dis.sum(dim=-1).squeeze()
    dis_reduce[batch_idx * batch_size: (batch_idx+1) * batch_size] = dis
return dis_reduce

IndexError: index_select(): Index is supposed to be a vector

This error occurs when I try to cluster with a set of tensors with n_clusters = 2.

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Hi,
if num_clusters=1 I get this error
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
Hawo I solve this problem
thanks

no tqdm in DEPENDENCIES in setup.py

Does it work on DDP mode of PyTorch?

Hi! I wonder if I can use the function on multiple GPUs and if it works on DDP mode of PyTorch? Thanks!

kmeans-find optimal k

Hi,
can you explain how to find the optimal k for unsupervised learning, like elbow method?
Thanks

center_shift=nan

Hello,

Sometimes I get center_shift=nan and I don't understand why and how can I fix this.

How do I define the initial cluster center？

Your method helps me a lot, but how do I define the initial cluster center myself ? Similar to the "init" parameter of the kmeans method in Sklearn

Input of the same data leads to different results.

Hi,

I tried to cluster the same data with Kmeans_pytorch, but got different clustering results. What is the cause of this?

Looking forward to your reply.

Return values are forced to be on CPU

Hi!
Thanks for the great tool!

I noticed that the output of kmeans() is forced to be on the CPU regardless of what device was requested. Line 121 here

return choice_cluster.cpu(), initial_state.cpu()

I'd expect the return values to be on the same device as the device requested in the function call.

Is there a particular reason for forcing the cpu in returned values?

Way to supress tqdm printout?

Hi,

I'm trying to use your package as a part of larger algorithm and I'd really like to suppress the printout as it is not relevant for my purposes. The solution here (diverting the standard output of the system) does not seem to work.

SET NUMBER OF ITERATIONS

I want to choose the number of iterations but when i put iter_limit = xxx it give the error that the parameter doesn't exist

kmeans_predict super slow with print

On line 103 of file init.py for the method kmeans_predict, can you make the print an option for a verbose mode (or just delete it). If it gets called a lot the print really slows everything down

Can input like is (batch,node,feature)?

Hello, GPU-accelerated version of Kmeans, can pass in a tensor whose shape is 3 dimensional? For example, (3,758,32) is (batch,node,feature), and the batch is expected to be parallel.

pytorch warning

Warning ( torch 1.5.0+cpu )

[running kmeans]: 0it [00:00, ?it/s]..\torch\csrc\utils\python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
	nonzero(Tensor input, *, Tensor out)

Example code

import torch
import numpy as np
from kmeans_pytorch import kmeans, kmeans_predict

# data
data_size, dims, num_clusters = 1000, 2, 3
x = np.random.randn(data_size, dims) / 6
x = torch.from_numpy(x)

# kmeans
cluster_ids_x, cluster_centers = kmeans(
    X=x, num_clusters=num_clusters, distance='euclidean', device=torch.device('cpu')
)

print(cluster_ids_x,cluster_centers)

Add soft dynamic time warping metric

I wrote a pairwise distance metric that computes the soft dynamic time warping distance (#19).
The actual computation of the distance is done by Maghoumi/pytorch-softdtw-cuda.

Just to say a few words about the codes:
SoftDTW takes inputs with shape (batch_size, seq_len, feature_dims).
As the kmeans in this repo deals only with univariate inputs, the feaure_dims is set to 1.
The tensors are broadcasted accordingly.
The code seems to work, but please correct them if I'm wrong!)

def pairwise_soft_dtw(data1, data2, sdtw=None, device=torch.device('cpu')):
    if sdtw is None:
        raise ValueError('sdtw is None - initialize it with SoftDTW')

    # transfer to device
    data1, data2 = data1.to(device), data2.to(device)

    # (batch_size, seq_len, feature_dim=1)
    A = data1.unsqueeze(dim=2)

    # (cluster_size, seq_len, feature_dim=1)
    B = data2.unsqueeze(dim=2)

    distances = []
    for b in B:
        # (1, seq_len, 1)
        b = b.unsqueeze(dim=0)
        A, b = torch.broadcast_tensors(A, b)
        # (batch_size, 1)
        sdtw_distance = sdtw(b, A).view(-1, 1)
        distances.append(sdtw_distance)

    # (batch_size, cluster_size)
    dis = torch.cat(distances, dim=1)
    return dis

Does not converge on GPU if dims becomes very large

A simple example to reproduce this issue:

`import torch

import numpy as np

import matplotlib.pyplot as plt

from kmeans_pytorch import kmeans, kmeans_predict

np.random.seed(123)

data_size, dims, num_clusters = 1000, 200, 3

x = np.random.randn(data_size, dims) / 6

x = torch.from_numpy(x)

if torch.cuda.is_available():
device = torch.device('cuda:0')
else:
device = torch.device('cpu')

cluster_ids_x, cluster_centers = kmeans(
X=x, num_clusters=num_clusters, distance='soft_dtw', device=device
)`

discussion

It seems the current implementation of k-means may not be suitable for soft-dtw. A simple solution is to mimic the implementation of tslearn https://github.com/tslearn-team/tslearn/blob/main/tslearn/clustering/kmeans.py .

TypeError: kmeans() got an unexpected keyword argument 'cluster_centers'

x_batch size is [100,256,3,3], len(x_batch) is 400,I want to use these datas to get the cluster is 100 of kmeans. When it is not the first time, I want to choose the preview cluster_centers to retrain(According example). But I got the below error. What should I do? Thankes!