Coder Social home page Coder Social logo

subhadarship / kmeans_pytorch Goto Github PK

View Code? Open in Web Editor NEW
459.0 7.0 75.0 1.01 MB

kmeans using PyTorch

Home Page: https://subhadarship.github.io/kmeans_pytorch

License: MIT License

Jupyter Notebook 91.65% Python 8.27% Makefile 0.08%
kmeans-clustering pytorch docs jekyll jekylbook github-pages gpu

kmeans_pytorch's Introduction

K Means using PyTorch

PyTorch implementation of kmeans for utilizing GPU

Alt Text

Getting Started


import torch
import numpy as np
from kmeans_pytorch import kmeans

# data
data_size, dims, num_clusters = 1000, 2, 3
x = np.random.randn(data_size, dims) / 6
x = torch.from_numpy(x)

# kmeans
cluster_ids_x, cluster_centers = kmeans(
    X=x, num_clusters=num_clusters, distance='euclidean', device=torch.device('cuda:0')
)

see example.ipynb for a more elaborate example

Requirements

  • PyTorch version >= 1.0.0
  • Python version >= 3.6

Installation

install with pip:

pip install kmeans-pytorch

Installing from source

To install from source and develop locally:

git clone https://github.com/subhadarship/kmeans_pytorch
cd kmeans_pytorch
pip install --editable .

CPU vs GPU

see cpu_vs_gpu.ipynb for a comparison between CPU and GPU

Notes

  • useful when clustering large number of samples
  • utilizes GPU for faster matrix computations
  • support euclidean and cosine distances (for now)

Credits

  • This implementation closely follows the style of this
  • Documentation is done using the awesome theme jekyllbook

License

MIT

kmeans_pytorch's People

Contributors

sprenkamp avatar subhadarship avatar wooohoooo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

kmeans_pytorch's Issues

seed

The example.ipynb finds a different center_embedding for every running . How to resolve it?

Support for "kl_divergence"

I have implemented support for Kullback-Leibler divergence as follows. Shall I make a pull request of it?

def pairwise_kl_divergence(data1, data2, device=torch.device('cpu')):
    # transfer to device
    data1, data2 = data1.to(device), data2.to(device)

    # N*1*M
    A = data1.unsqueeze(dim=1)

    # 1*N*M
    B = data2.unsqueeze(dim=0)

    # normalize the points 
    A_normalized = torch.nn.functional.log_softmax(A, dim=-1)
    B_normalized = torch.nn.functional.log_softmax(B, dim=-1)

    kl_div = torch.nn.functional.kl_div(A_normalized, B_normalized, reduction='none', log_target=True)

    # return N*N matrix for pairwise distance
    kl_div = kl_div.mean(-1)
    return kl_div

Unit test with sklearn?

Hi, thanks for the implementation !

Are you considering writing a unit test again sklearn's Kmeans?

Error trying to cluster from numpy

Hi, I'm not really using pytorch, but I want to use balanced kmeans. My code is as follows:

from torch import from_numpy
from balanced_kmeans import kmeans_equal
...
  # load X, a 23000x59 ndarray
  n_cluster = 50
  X_tensor = from_numpy(X)
  choices, centers = kmeans_equal(X_tensor,
                                  num_clusters=n_cluster,
                                  cluster_size=X.shape[0] // n_cluster)

I get the following error:
RuntimeError: expand(torch.LongTensor{[59]}, size=[]): the number of sizes provided (0) must be greater or equal to the number of dimensions in the tensor (1)

Am I doing something wrong creating my tensor from numpy? I apologize because I am asking more of like a general pytorch question and not really specific to kmeans_pytorch (and tbh I'm a total pytorch newb!) Is there an example anywhere of using kmeans_equal on numpy data? I bet other people would find that useful. Thanks in advance for any tips you can provide!

Overload of nonzero is deprecated

Hi, I was executing the example and got the following warning:

[running kmeans]: 0it [00:00, ?it/s]

running k-means on cpu..

/pytorch/torch/csrc/utils/python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
	nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
	nonzero(Tensor input, *, bool as_tuple)
[running kmeans]: 7it [00:02,  3.24it/s, center_shift=0.000091, iteration=7, tol=0.000100] 

Here is the code I ran to get this message.

import torch
import numpy as np
import matplotlib.pyplot as plt
from kmeans_pytorch import kmeans, kmeans_predict

# set random seed
np.random.seed(123)

# data
data_size, dims, num_clusters = 500000, 100, 3
x = np.random.randn(data_size, dims) / 6
x = torch.from_numpy(x)

# set device
if torch.cuda.is_available():
    device = torch.device('cuda:0')
else:
    device = torch.device('cpu')

# k-means
cluster_ids_x, cluster_centers = kmeans(
    X=x, num_clusters=num_clusters, distance='euclidean', device=device
)

Thanks in advance.

GPU vs CPU

I did not see any significant difference in GPU vs CPU speed. I noticed that only the data size has highly been changed, and number of clusters were a bit low 2, 3 ... speaking of this test.
I hope I am not wrong, I think what really matters here with regard to speed is the number of clusters. Hence, re running the experiment with higher number of clusters would show the speed significance, if any. I would go up to 64 clusters.

Got an Unexpected keyword argument "iter_limit"

Looked into your source code, and I want to limit the number of iterations in estimating the centroids. I tried to do this on my virtual GPU machine and I got the error below:

cluster_ids_x, cluster_centers = kmeans(
            X=x, num_clusters=num_clusters, distance='cosine',iter_limit = 10, device=torch.device('cuda:0')
        )
  classes, centroids = self.TrainData()
  File "/home/ubuntu/new_sonalysis_deployment/Sonalysis-Front-End-listner_profile/PlayerTracker_upgrade.py", line 122, in TrainData
    X=x, num_clusters=num_clusters, distance='cosine',iter_limit = 10, device=torch.device('cuda:0')
TypeError: kmeans() got an unexpected keyword argument 'iter_limit'

Clustering of batch data

Does this method support the clustering of batch data, that is, the data that is sent in (batch-size,n,dim), and does it support the clustering of n sample points with dim dimension in each batch?

Quick and dirty fix for "Can't choose a cluster if two points are too close to each other, that's where the nan come"

Hey, thanks for the pytorch kmeans implementation. I also encountered the following problem and have a solution for it:
Add the following line into line 84 in kmeans_pytorch/init.py

            if selected.shape[0] == 0:
                selected = X[torch.randint(len(X), (1,))]

The problem is, if two centroids are too close together, then one centroid gets all the close points, leaving the other centroid with zero assigned points. If that happens, that centroid's selected points is an empty list. The above code detects that and then chooses a new point to replace that centroid randomly. I guess there are more elegant solution in choosing the replacement point. But this works and converges. So a quick and dirty fix.

Sorry I write it here, it should be a pull request, but I am new to GitHub.

An error occurs between the kmeans function runs.

I'm going to try to do 64 clusters kmeans for 50,000 datasets with 512 dimensions, and the following error occurs.
running k-means on cuda.. [running kmeans]: 0it [00:00, ?it/s]tcmalloc: large alloc 6553600000 bytes == 0x7f7bf5600000 @ 0x7f82122b0b6b 0x7f82122d0379 0x7f81c2f8b74e 0x7f81c2f8d7b6 0x7f81fd3e1fa2 0x7f81fd6ccbd3 0x7f81fd6a4207 0x7f81fd6bf2dc 0x7f81fd69b78a 0x7f81fd6a4207 0x7f81fd6bf2dc 0x7f81fd78b0dd 0x7f81fd3f309f 0x7f81fd3f56b6 0x7f81fd3f5bad 0x7f81fd3f5d28 0x7f81fd103ae5 0x7f81fd6cdae9 0x7f81fcf4d124 0x7f81fd85ea02 0x7f81fd75cc4e 0x7f81fecc8321 0x7f81fcf4d124 0x7f81fd85ea02 0x7f81fd9a369e 0x7f820d350fa9 0x7f820d3519b6 0x566f73 0x59fd0e 0x4b1eea 0x619d0c tcmalloc: large alloc 6553600000 bytes == 0x7f7a6ec00000 @ 0x7f82122b0b6b 0x7f82122d0379 0x7f81c2f8b74e 0x7f81c2f8d7b6 0x7f81fd9f7d53 0x7f81fd3e28cf 0x7f81fd6f9cac 0x7f81fd6a531b 0x7f81fd6c4135 0x7f81fd69fb4b 0x7f81fd6a531b 0x7f81fd6c4135 0x7f81fd78e2be 0x7f81fd3e1145 0x7f81fd9491ff 0x7f81fcfaec1b 0x7f81fd87b056 0x7f81fd78dba2 0x7f81fd2dbe43 0x7f81fd6cea59 0x7f81fcf4d1b1 0x7f81fd869183 0x7f81fd778d9e 0x7f81fed60021 0x7f81fcf4d1b1 0x7f81fd869183 0x7f81fd9ae69e 0x7f820d37c5f3 0x566f73 0x59fd0e 0x4b1eea ^C

I don't know why '^C' is printed on its own.
I tried to use this code as part of the loss class of pytorch. Is there any other solution?

this is the code
_, centroids = kmeans(descriptor, num_clusters=64, distance='euclidean', device=torch.device('cuda'))

bad default parameters!

can i know why on earth does the stable release have by default the number of k-means iterations as infinite, and assumes the condition of breaking the loop will be met??? You caused my GPU to run k-means infinitely for a single sample without breaking, for the whole weekend, and wasted 2 days of running experiments!

Either publish reliable code or don't touch your keyboard

Initialize with setting Seed

Hi, as far as I understand it is not possible to set a seed to reproduce the results of the algorithm. I am I correct here?
I looked at the code and it would be quite simple to edit I believe, are you interested in adding the function and can I participate?

tqdm_flag missing

I had to change the code to remove verbosity.
Why tqdm_flag is missing?

Try to solve the OOM for large scale dataset

Hi, it is amazing module. But if I try to set the cluster number big or the dataset is too large. Then I will caught OOM issues.
I have refactor the code via batch script. Please feel free if it is good for you.

Best regards.
Alisca

Appendix, the refactor code for euclidean distance calculation with batch step.

def pairwise_distance(data1, data2, device=torch.device('cpu'), batch_size=100000):
# transfer to device
data1, data2 = data1.to(device), data2.to(device)

# N*1*M
A = data1.unsqueeze(dim=1)

# 1*N*M
B = data2.unsqueeze(dim=0)

dis_reduce = torch.zeros([data1.shape[0], data2.shape[0]])
for batch_idx in range(int(np.ceil(data1.shape[0]/batch_size))):
    dis = (A[batch_idx * batch_size: (batch_idx+1) * batch_size] - B) ** 2.0
    dis = dis.sum(dim=-1).squeeze()
    dis_reduce[batch_idx * batch_size: (batch_idx+1) * batch_size] = dis
return dis_reduce

kmeans-find optimal k

Hi,
can you explain how to find the optimal k for unsupervised learning, like elbow method?
Thanks

center_shift=nan

Hello,

Sometimes I get center_shift=nan and I don't understand why and how can I fix this.

Return values are forced to be on CPU

Hi!
Thanks for the great tool!

I noticed that the output of kmeans() is forced to be on the CPU regardless of what device was requested. Line 121 here

return choice_cluster.cpu(), initial_state.cpu()

I'd expect the return values to be on the same device as the device requested in the function call.

Is there a particular reason for forcing the cpu in returned values?

Way to supress tqdm printout?

Hi,

I'm trying to use your package as a part of larger algorithm and I'd really like to suppress the printout as it is not relevant for my purposes. The solution here (diverting the standard output of the system) does not seem to work.

SET NUMBER OF ITERATIONS

I want to choose the number of iterations but when i put iter_limit = xxx it give the error that the parameter doesn't exist

kmeans_predict super slow with print

On line 103 of file init.py for the method kmeans_predict, can you make the print an option for a verbose mode (or just delete it). If it gets called a lot the print really slows everything down

Can input like is (batch,node,feature)?

Hello, GPU-accelerated version of Kmeans, can pass in a tensor whose shape is 3 dimensional? For example, (3,758,32) is (batch,node,feature), and the batch is expected to be parallel.

pytorch warning

Warning ( torch 1.5.0+cpu )

[running kmeans]: 0it [00:00, ?it/s]..\torch\csrc\utils\python_arg_parser.cpp:756: UserWarning: This overload of nonzero is deprecated:
	nonzero(Tensor input, *, Tensor out)

Example code

import torch
import numpy as np
from kmeans_pytorch import kmeans, kmeans_predict

# data
data_size, dims, num_clusters = 1000, 2, 3
x = np.random.randn(data_size, dims) / 6
x = torch.from_numpy(x)

# kmeans
cluster_ids_x, cluster_centers = kmeans(
    X=x, num_clusters=num_clusters, distance='euclidean', device=torch.device('cpu')
)

print(cluster_ids_x,cluster_centers)

Add soft dynamic time warping metric

I wrote a pairwise distance metric that computes the soft dynamic time warping distance (#19).
The actual computation of the distance is done by Maghoumi/pytorch-softdtw-cuda.

Just to say a few words about the codes:
SoftDTW takes inputs with shape (batch_size, seq_len, feature_dims).
As the kmeans in this repo deals only with univariate inputs, the feaure_dims is set to 1.
The tensors are broadcasted accordingly.
The code seems to work, but please correct them if I'm wrong!)

def pairwise_soft_dtw(data1, data2, sdtw=None, device=torch.device('cpu')):
    if sdtw is None:
        raise ValueError('sdtw is None - initialize it with SoftDTW')

    # transfer to device
    data1, data2 = data1.to(device), data2.to(device)

    # (batch_size, seq_len, feature_dim=1)
    A = data1.unsqueeze(dim=2)

    # (cluster_size, seq_len, feature_dim=1)
    B = data2.unsqueeze(dim=2)

    distances = []
    for b in B:
        # (1, seq_len, 1)
        b = b.unsqueeze(dim=0)
        A, b = torch.broadcast_tensors(A, b)
        # (batch_size, 1)
        sdtw_distance = sdtw(b, A).view(-1, 1)
        distances.append(sdtw_distance)

    # (batch_size, cluster_size)
    dis = torch.cat(distances, dim=1)
    return dis

Does not converge on GPU if dims becomes very large

A simple example to reproduce this issue:

`import torch

import numpy as np

import matplotlib.pyplot as plt

from kmeans_pytorch import kmeans, kmeans_predict

np.random.seed(123)

data_size, dims, num_clusters = 1000, 200, 3

x = np.random.randn(data_size, dims) / 6

x = torch.from_numpy(x)

if torch.cuda.is_available():
device = torch.device('cuda:0')
else:
device = torch.device('cpu')

cluster_ids_x, cluster_centers = kmeans(
X=x, num_clusters=num_clusters, distance='soft_dtw', device=device
)`

discussion

It seems the current implementation of k-means may not be suitable for soft-dtw. A simple solution is to mimic the implementation of tslearn https://github.com/tslearn-team/tslearn/blob/main/tslearn/clustering/kmeans.py .

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.