Coder Social home page Coder Social logo

hanruihua / incdbscan Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dataombudsman/incdbscan

0.0 0.0 0.0 2.74 MB

Implementation of IncrementalDBSCAN clustering.

License: BSD 3-Clause "New" or "Revised" License

Python 6.03% Makefile 0.04% Jupyter Notebook 93.92%

incdbscan's Introduction

IncrementalDBSCAN

incdbscan is an implementation of IncrementalDBSCAN, the incremental version of the DBSCAN clustering algorithm.

IncrementalDBSCAN lets the user update the clustering by inserting or deleting data points. The algorithm yields the same result as DBSCAN but without reapplying DBSCAN to the modified data set.

Thus, IncrementalDBSCAN is ideal to use when the size of the data set to cluster is so large that applying DBSCAN to the whole data set would be costly but for the purpose of the application it is enough to update an already existing clustering by inserting or deleting some data points.

The implementation is based on the following paper. To see what's new compared to the paper, jump to Notes on the IncrementalDBSCAN paper.

Ester, Martin; Kriegel, Hans-Peter; Sander, Jörg; Wimmer, Michael; Xu, Xiaowei (1998). Incremental Clustering for Mining in a Data Warehousing Environment. In: Proceedings of the 24rd International Conference on Very Large Data Bases (VLDB 1998).

indbscan illustration

Table of Contents

Highlights

The incdbscan package is an implementation of the IncrementalDBSCAN algorithm by Ester et al., with about 40 unit tests covering diverse cases, and with additional corrections to the original paper.

Installation

incdbscan is on PyPI, and can be installed with pip:

pip install incdbscan

The package requires at least Python 3.7.1.

Usage

The algorithm is implemented in the IncrementalDBSCAN class.

There are 3 methods to use:

  • insert for inserting data points into the clustering
  • delete for deleting data points from the clustering
  • get_cluster_labels for obtaining cluster labels

All methods take a batch of data points in the form of an array of shape (n_samples, n_features) (similar to the scikit-learn API).

from sklearn.datasets import load_iris
X = load_iris()['data']
X_1, X_2 = X[:100], X[100:]

from incdbscan import IncrementalDBSCAN
clusterer = IncrementalDBSCAN(eps=0.5, min_pts=5)

# Insert 1st batch of data points and get their labels
clusterer.insert(X_1)
labels_part1 = clusterer.get_cluster_labels(X_1)

# Insert 2nd batch and get labels of all points in a one-liner
labels_all = clusterer.insert(X_2).get_cluster_labels(X)

# Delete 1st batch and get labels for 2nd batch
clusterer.delete(X_1)
labels_part2 = clusterer.get_cluster_labels(X_2)

For a longer description of usage check out the notebook developed just for that!

Performance

Performance has two components: insertion and deletion cost.

indbscan performance

The cost of inserting a new data point into IncrementalDBSCAN is quite small and grows slower than the cost of applying (scikit-learns's) DBSCAN to a whole data set. In other words, given that we have a data set D clustered with IncrementalDBSCAN, and we want to see which cluster would a new object P belong to, it is faster to insert P into the current IncrementalDBSCAN clustering than to apply DBSCAN to the union of D and P.

The cost of deleting a data point from IncrementalDBSCAN grows faster than the cost of applying (scikit-learns's) DBSCAN to the data set minus that data point. Thus, the cost of deletion in IncrementalDBSCAN is quite small below a certain data set size, but becomes larger as data set size grows.

These results do not imply that it is very efficient to cluster a whole data set with a series of IncrementalDBSCAN insertions. If we measure the time to cluster a data set with DBSCAN versus to cluster the data by adding the data points one by one to IncrementalDBSCAN, IncrementalDBSCAN will be slower compared to DBSCAN. A typical performance number is that clustering 8,000 data points takes about 10-20 seconds with this implementation.

See this notebook about performance for more details.

Known shortcomings

  • Batch insertion: In the current implementation batch insertion of data points is not efficient, since pairwise distances between new points and existing points are not calculated with matrix multiplication.
  • Deletion: Data point deletion can take long in big data sets (big clusters), because there is a graph traversal step that can take long. Whether this can be sped up is not clear.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.