Coder Social home page Coder Social logo

ndgigliotti / cluster-optimizer Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 1.0 76 KB

A GridSearchCV-like hyperparameter optimizer for clustering (no cross-validation).

License: BSD 3-Clause "New" or "Revised" License

Python 47.38% Jupyter Notebook 52.62%
scikit-learn clustering clustering-evaluation unsupervised-clustering hyperparameter-optimization hyperparameter-tuning hyperparameter-search transductive-learning supervised-clustering

cluster-optimizer's Introduction

cluster-optimizer

Installation

You can install this package with pip using the following command:

pip install git+https://github.com/ndgigliotti/cluster-optimizer.git@main

Purpose

This project provides a simple, Scikit-Learn-compatible, hyperparameter optimization tool for clustering. It's intended for situations where predicting clusters for new data points is a low priority. Many clustering algorithms in Scikit-Learn are transductive, meaning that they are not designed to be applied to new observations. Even if using an inductive clustering algorithm like K-Means, you might not have any desire to predict clusters for new observations. Or, even if you do have such a desire, prediction might be a lower priority than finding the best clusters in the data.

Since Scikit-Learn's GridSearchCV uses cross-validation, and is designed to optimize inductive machine learning models, an alternative tool is necessary.

ClusterOptimizer

The ClusterOptimizer class is a hyperparameter search tool for optimizing clustering algorithms. It simply fits one model per hyperparameter combination and selects the best. It's a spin-off of GridSearchCV, and the implementation is derived from Scikit-Learn. The only difference is that it doesn't use cross-validation and is designed to work with special clustering scorers. It's not always necessary to provide a target variable, since clustering metrics such as silhouette, Calinski-Harabasz, and Davies-Bouldin are designed for unsupervised clustering.

The interface is largely the same as GridSearchCV. One minor difference is that the search results are stored in the results_ attribute, rather than cv_results_.

Transductive Clustering Scorers

You can use ClusterOptimizer by passing the string name of a Scikit-Learn clustering metric, e.g. 'silhouette', 'calinski_harabasz', or 'rand_score' (the '_score' suffix is optional). You can also create a special scorer for transductive clustering using scorer.make_scorer on any score function with the signature score_func(labels_true, labels_fit) or score_func(X, labels_fit).

Recognized Scorer Names

Note that the '_score' suffix is always optional.

  • 'silhouette_score'
  • 'silhouette_score_euclidean'
  • 'silhouette_score_cosine'
  • 'davies_bouldin_score'
  • 'calinski_harabasz_score'
  • 'mutual_info_score'
  • 'normalized_mutual_info_score'
  • 'adjusted_mutual_info_score'
  • 'rand_score'
  • 'adjusted_rand_score'
  • 'completeness_score'
  • 'fowlkes_mallows_score'
  • 'homogeneity_score'
  • 'v_measure_score'

Caveats

Comparing Clustering Algorithms

It's important to consider your dataset and goals before comparing clustering algorithms in a grid search. Just because one algorithm gets a higher score than another does not necessarily make it a better choice. Different clustering algorithms have different benefits, drawbacks, and use cases.

Future Work

  • Write automated tests.
  • Develop alternative to BaseSearchCV.
  • Add multi-metric compatibility.
  • Remove noise "cluster" and impose noise limit.
  • Update docstrings taken from Scikit-Learn.
  • Add more search types (e.g. randomized).

Credits

Most of the credit goes to the developers of Scikit-Learn for the engineering behind the search estimators. It's not very hard to spam a bunch of models with different hyperparameters, but it's hard to do it in a robust way with a friendly interface and wide compatibility.

cluster-optimizer's People

Contributors

ndgigliotti avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

nitesr

cluster-optimizer's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.