Coder Social home page Coder Social logo

kmodes's Introduction

Version Conda forge page Build status Test coverage Codacy Monthly downloads Supported Python versions License

kmodes

Description

Python implementations of the k-modes and k-prototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.

k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.

Implemented are:

The code is modeled after the clustering algorithms in scikit-learn and has the same familiar interface.

I would love to have more people play around with this and give me feedback on my implementation. If you come across any issues in running or installing kmodes, please submit a bug report.

Enjoy!

Installation

kmodes can be installed using pip:

pip install kmodes

To upgrade to the latest version (recommended), run it like this:

pip install --upgrade kmodes

kmodes can also conveniently be installed with conda from the conda-forge channel:

conda install -c conda-forge kmodes

Alternatively, you can build the latest development version from source:

git clone https://github.com/nicodv/kmodes.git
cd kmodes
python setup.py install

Usage

import numpy as np
from kmodes.kmodes import KModes

# random categorical data
data = np.random.choice(20, (100, 10))

km = KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)

clusters = km.fit_predict(data)

# Print the cluster centroids
print(km.cluster_centroids_)

The examples directory showcases simple use cases of both k-modes ('soybean.py') and k-prototypes ('stocks.py').

Parallel execution

The k-modes and k-prototypes implementations both offer support for multiprocessing via the joblib library, similar to e.g. scikit-learn's implementation of k-means, using the n_jobs parameter. It generally does not make sense to set more jobs than there are processor cores available on your system.

This potentially speeds up any execution with more than one initialization try, n_init > 1, which may be helpful to reduce the execution time for larger problems. Note that it depends on your problem whether multiprocessing actually helps, so be sure to try that out first. You can check out the examples for some benchmarks.

FAQ

Q: I'm seeing errors such as "TypeError: '<' not supported between instances of 'str' and 'float'" when using the kprototypes algorithm.

A: One or more of your numerical feature columns have string values in them. Make sure that all columns have consistent data types.


Q: How does k-protypes know which of my features are numerical and which are categorical?

A: You tell it which column indices are categorical using the categorical argument. All others are assumed numerical. E.g., clusters = KPrototypes().fit_predict(X, categorical=[1, 2])


Q: I'm getting the following error, what gives? "ModuleNotFoundError: No module named 'kmodes.kmodes'; 'kmodes' is not a package".

A: Make sure your working file is not called 'kmodes.py', because it might overrule the kmodes package.


Q: I'm getting the following error: "ValueError: Clustering algorithm could not initialize. Consider assigning the initial clusters manually."

A: This is a feature, not a bug. kmodes is telling you that it can't make sense of the data you are presenting it. At least, not with the parameters you are setting the algorithm with. It is up to you, the data scientist, to figure out why. Some hints to possible solutions:

  • Run with fewer clusters as the data might not support a large number of clusters
  • Explore and visualize your data, checking for weird distributions, outliers, etc.
  • Clean and normalize the data
  • Increase the ratio of rows to columns

Q: I'm getting the following error: "ValueError: Input contains NaN, infinity, or a value too large for dtype('float64')."

A: Following scikit-learn, the k-modes algorithm does not accept np.NaN values in the X matrix. Users are suggested to fill in the missing data in a way that makes sense for the problem at hand.


Q: How would like your library to be cited?

A: Something along these lines would do nicely:

@Misc{devos2015,
  author = {Nelis J. de Vos},
  title = {kmodes categorical clustering library},
  howpublished = {\url{https://github.com/nicodv/kmodes}},
  year = {2015--2021}
}

References

[HUANG97](1, 2) Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997.
[HUANG98]Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998.
[CAO09]Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 10223-10228., 2009.

kmodes's People

Contributors

b-harish avatar benandow avatar bikashpandey17 avatar daffidwilde avatar enfeizhan avatar genie-liu avatar kklein avatar larroy avatar nicodv avatar nilkeshpatra avatar nuclearfishin avatar rggelles avatar rphes avatar trevorstephens avatar zhengluke avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.