Coder Social home page Coder Social logo

pajaskowiak / dbcv Goto Github PK

View Code? Open in Web Editor NEW
7.0 1.0 0.0 365 KB

Density-Based Clustering Validation

MATLAB 100.00%
cluster-analysis clustering clustering-evaluation clustering-methods clustering-validation datasets dbcv density-based-clustering 2d-data dbscan-clustering

dbcv's Introduction

DBCV

Density-Based Clustering Validation

This is the source code employed to compute DBCV (Density-Based Clustering Validation) in our following paper:

Density-Based Clustering Validation. Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander. Proceedings of the 2014 SIAM International Conference on Data Mining (SDM). 2014, 839-847

You can read the paper here: https://epubs.siam.org/doi/10.1137/1.9781611973440.96

The Matlab source code from this repository is from Davoud Moulavi and Pablo A. Jaskowiak.

Usage

For calculating the DBCV Validation Index of a partition run dbcv.m. The output of the function is the DBCV index for the corresponding partition. Its values are within -1 and +1. DBCV is a maximization index, which means that higher values correspond to better partitions, according to the index. The necessary inputs are the dataset (without labels) and the partition, which correspond to a clustering solution, that is, labels. The dataset should be formatted with each line corresponding to an object, with columns corresponding to features. The partition is a 1-dimensional array with cluster assignments, in integer format. The number of labels must match the number of objects in the dataset.

Important note: noise is represented by label 0 (zero). Therefore, any object with label of zero in the partition is considered as noise. By definition, singletons (clusters with a single object) will be treated as noise by the measure (please see the paper for discussion on this topic).

Example: Computing DBCV on a Synthetic Dataset

Assuming you are running Matlab and your current working directory is src.

%first we load the dataset
load ../data/dataset_1.txt

%we run dbcv on the dataset (first two columns),
%passing its ground truth as partition (column 3).
val = dbcv(dataset_1(:,-3),dataset_1(:,3))

%val should be 0.6149. Note that this is low given
%that some noise points actually overlap the clusters.

%we can also plot the dataset with the following
plot_clusters(dataset_1(:,1:2),dataset_1(:,3))

%that's all :)

Datasets

In the folder data you can find the four synthetic datasets employed in our paper. These are shown below. Plots in PDF and PNG can be found in the folder plots. Please, if you use any of these datasets in a publication, cite both our paper and this repository.

Other resources

You can also find an implementation of DBCV in the R Package clusterConfusion.

Disclaimer

This prototype was developed for research purposes only. It was by no means implemented to optimize computational performance. Therefore, its computational performance should not be evaluated and/or fairly compared to that of other measures.

Contact

If you find any bugs or problems, please, get in touch. You can reach me here.

dbcv's People

Contributors

pajaskowiak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

dbcv's Issues

dataset_1.txt noise examples appear to be labeled as '-1'

Hello,

I know the example provided in the package README is a synthetic example intended to showcase a basic program execution. However, I believe the score presented may be a bit misleading because in the dataset used, "dataset_1.txt", the noise instances appear to be labeled as "-1", not "0" as assumed by the package implementation itself. As far as I understand, this means they are considered a real cluster during the DBCV computation, thus substantially modifying the estimated metric score (reported estimation=0.6149, estimation w/ labels fixed=0.8576).

The same issue presumably applies to all other example datasets.

Also, am I correct by assuming that the distance metric used during the DBCV computation is the squared euclidean distance? I understand this is a legitimate choice; I just want to clarify if my understanding is correct.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.