Coder Social home page Coder Social logo

nredell / rari Goto Github PK

View Code? Open in Web Editor NEW
4.0 2.0 0.0 377 KB

A python package which implements a distance-based extension of the adjusted Rand index for the supervised validation of 2 cluster analysis solutions

License: MIT License

Python 83.91% Makefile 16.09%
adjusted-rand-index ranked-adjusted-rand-index ari rari cluster-validity-index cluster-validation cluster-analysis t-sne umap

rari's Introduction

lifecycle

package.rari rari logo

rari is a Python implementation of Pinto et. al's ranked adjusted Rand index (RARI) from Ranked Adjusted Rand: integrating distance and partition information in a measure of clustering agreement. RARI is an extension of the adjusted Rand index (ARI) that measures the agreement between two independent clustering solutions while incorporating distances between instances/clusters from each solution.

  • RARI = 1: Perfect agreement between cluster solutions 'A' and 'B'. Identical cluster partitions and equally ranked relative distances between clusters in cluster solutions 'A' and 'B'.

  • RARI = 0: No agreement between cluster solutions 'A' and 'B'. Only occurs when, in cluster solution 'A', all instances are in the same cluster and, in cluster solution 'B', all instances are in their own cluster and all clusters are equidistant from each other.

Roughly speaking, the benefit of RARI is in penalizing the ARI when a given pair of instances is close together in cluster solution 'A' and far apart in cluster solution 'B'.

Lightning Example

  • Below is a comparison of the agreement between hierarchical and k-means clustering solutions on the iris data set. The same distance matrix is used to calculate pairwise distances between each iris instance, but this is not a requirement.
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import pairwise_distances
from rari import rari

X = load_iris().data

model_1 = AgglomerativeClustering(n_clusters=3, linkage='ward')
x = model_1.fit_predict(X)

model_2 = KMeans(n_clusters=3)
y = model_2.fit_predict(X)

dist_x = pairwise_distances(X, metric='euclidean')
dist_y = pairwise_distances(X, metric='euclidean')

rari(x, y, dist_x, dist_y)

Out[1]: .975

Install

  • Development
pip install git+https://github.com/nredell/rari

Intuition

Below is Figure 1 from Pinto et. al's article which demonstrates the impact of inter-cluster distances on the RARI metric as compared to, say, the ARI.

Examples

Example 1: ARI vs. RARI, Few Clusters, High Agreement

import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import adjusted_rand_score, pairwise_distances
from rari import rari

X, y = make_blobs(n_samples=[50, 50, 50], n_features=2, cluster_std=1.0, center_box=(-5.0, 5.0), shuffle=True, random_state=224)
data = pd.DataFrame(np.hstack([X, y[:, np.newaxis]]), columns=["X1", "X2", "Cluster"])

model_1 = AgglomerativeClustering(n_clusters=3, linkage='ward')
x = model_1.fit_predict(X)

model_2 = KMeans(n_clusters=3)
y = model_2.fit_predict(X)

dist_x = pairwise_distances(X, metric='euclidean')
dist_y = pairwise_distances(X, metric='euclidean')

adjusted_rand_score(x, y)
rari(x, y, dist_x, dist_y)

ARI: .83 RARI: .89

Example 2: ARI vs. RARI, A New Data Point

The toy 1D example below illustrates how the dynamic RARI changes as the distance between clusters changes while the static ARI remains the same.

Imagine that the moving data point represents a new data point added to the data set, at which point each of 2 models is re-run and the clusters are re-labeled. For the sake of illustration, the labels for this new data point from each model are held constant through each of the 11 analyses to emphasize the impact of cluster spacing. In a real problem, it's likely that the moving data point would be classified as a '2' as it approaches the yellow '2' on the right hand side of each plot. However, this change of labels may not even occur in a simple 2D example with a method like spectral clustering. And our intuitions will fail us in higher dimensions, but RARI will account for these changes in cluster orientation if so desired.

Implementation Details

At present, inter-cluster distances are based on the euclidean distance between pairs of instances in dist_x and dist_y. That is to say, even if the input pairwise distance matrices are, for example, cosine and manhattan, the inter-cluster distance ranks are still based on a euclidean, complete linkage measure of these pairwise distances. This will be relaxed in the future with support for additional input arguments.

rari's People

Contributors

nredell avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.