Coder Social home page Coder Social logo

HDBScan about dbscan HOT 17 CLOSED

mhahsler avatar mhahsler commented on July 19, 2024
HDBScan

from dbscan.

Comments (17)

mhahsler avatar mhahsler commented on July 19, 2024

This is on our list of things to add, but I cannot give you a timeline at this point.

from dbscan.

zachmayer avatar zachmayer commented on July 19, 2024

There's an implementation in R here you can look at: https://github.com/elbamos/largeVis

from dbscan.

mhahsler avatar mhahsler commented on July 19, 2024

Thanks! I will have a look.

from dbscan.

elbamos avatar elbamos commented on July 19, 2024

Hi :) I was actually checking in here because I'm implementing dbscan and optics, but I'm the author of that hdbscan implementation. Let me know if you want to chat.

@zachmayer is there a reason you don't want to use my implementation?

from dbscan.

zachmayer avatar zachmayer commented on July 19, 2024

@elbamos I am using your implementation now, but the interface is a bit different and it's taking me a minute to figure it out.

from dbscan.

elbamos avatar elbamos commented on July 19, 2024

Let me know if you need help - probably better not to cluster this package's issues though.

from dbscan.

mhahsler avatar mhahsler commented on July 19, 2024

@elbamos We are looking into HDBSCAN for our package. As far as I understand, your package uses sparse matrix representation for the distance matrix. This is very different from what we do here, where we do all the calculations on the original data (using a kd-tree).

from dbscan.

elbamos avatar elbamos commented on July 19, 2024

One of the features of the package is very efficient generation of the distance matrix which is, yes, then stored as a sparse matrix. The reason being, if someone is clustering/visualizing in one way (hdbscan, largevis, dbscan, etc), they probably want to try out alternatives as well. So no need to force them to calculate nearest neighbors more than once. (This is the logic of adding clustering to the package in the first place.) And for large datasets managing a dense distance matrix isn't feasible.

On Sep 21, 2016, at 8:47 PM, Michael Hahsler [email protected] wrote:

@elbamos We are looking into HDBSCAN for our package. As far as I understand, your package uses sparse matrix representation for the distance matrix. This is very different from what we do here, where we do all the calculations on the original data (using a kd-tree).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

from dbscan.

peekxc avatar peekxc commented on July 19, 2024

Forgive my ignorance, but I believe that HDBSCAN requires/uses a dense matrix representation though, does it not? Perhaps I'm not following here.

from dbscan.

elbamos avatar elbamos commented on July 19, 2024

It doesn't need to be completely dense. The likelihood that the edge between a point and it's Kth nearest neighbor would affect the running of Prim's algorithm declines rapidly as K increases.

This is actually a key insight behind the LargeVis algorithm. In these neighbor-distance-based models, the role of distant neighbors is so minimal to the outcome that you can drop them (or, in LargeVis, use negative sampling) and still get a good approximation.

largeVis gives you is a density matrix for each point's k nearest neighbors. In typical usage, it's expected that K will be 100-150, because that's the usual number for the LargeVis visualization algorithm.

Using a sparse matrix can affect things in a few circumstances. Generally if there's any effect at all, its that the graph is not fully connected, and you get a minimum spanning forest rather than a single minimum spanning tree. The connections that are missing are ones that you don't want within a single cluster anyway; this is very similar to how OPTICS works. As long as K is reasonable for the size of the dataset, I think it would take a pretty pathological dataset for the effect to be negative.

The other situation where it could matter is if a point is very dense but its neighbors are not dense. I guess that's possible.

You're correct that my implementation is an approximation of hsbscan since the density matrix is sparse. That's true of any implementation where the neighbor search is approximate.

The largeVis package in general is aimed at very large high-dimensional datasets, in the millions of rows and hundreds of columns. Dense distance matrices become infeasible in terms of both storage size and computation time.

Btw - i made a "clusteringdatasets" package on my repo, if we want to trade results, it might be useful to have common datasets that account for a variety of situations.

On Sep 21, 2016, at 9:22 PM, Matt Piekenbrock [email protected] wrote:

Forgive my ignorance, but I believe that HDBSCAN requires/uses a dense matrix representation though, does it not? Perhaps I'm not following here.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

from dbscan.

elbamos avatar elbamos commented on July 19, 2024

@mhahsler @zachmayer Take a look at http://htmlpreview.github.io/?https://github.com/elbamos/largeVis/blob/release/0.1.10/inst/doc/momentumandusedata.html

This (draft) vignette, if you scroll down a little, visualizes the results of dbscan, optics, and hdbscan with a variety of hyperparameters. I think it illustrates the advantage of hdbscan nicely.

from dbscan.

zachmayer avatar zachmayer commented on July 19, 2024

Very cool! Thank you.

from dbscan.

peekxc avatar peekxc commented on July 19, 2024

For reference, the newest version of the dbscan package now on CRAN was just released with an HDBSCAN implementation.

There's also a small vignette available that illustrates some of the "features" of HDBSCAN Campello et al have proposed in a few papers. You can view it with:

vignette("hdbscan", package="dbscan")

If you run into any issues, please don't hesitate to post them!

from dbscan.

lmcinnes avatar lmcinnes commented on July 19, 2024

You may want to check out https://arxiv.org/abs/1705.07321 which describes how to compute HDBSCAN using spatial indexing (rather than dense distance matrices) for significant performance improvements.

from dbscan.

peekxc avatar peekxc commented on July 19, 2024

@lmcinnes Thanks for putting the info up on arxiv. Using something like DTB or some other means of reducing the complexity below O(n^2), avoiding the full distance computation, is definitely something of interest in the future, I'm just not sure if/when I'll have the time to do something similar. The KD tree used by this package is built using the ANN library, which internally supports Minkowski metrics and uses some tricks (such as incremental distance calculations) to operate--translating the structure to allow the pruning of the neighbor queries isn't immediately transparent to me how to do, although the end solution may actually be simple.

from dbscan.

elbamos avatar elbamos commented on July 19, 2024

I'm pretty sure my implementation of hdbscan in largeVis runs in O(n log (n))...

from dbscan.

peekxc avatar peekxc commented on July 19, 2024

Ok. I don't know the runtime complexity, nor the source code, of the largeVis implementation either.

However, this seems off topic for this issue. @zachmayer brought up that HDBSCAN* would be a good fit for the scope of the dbscan package, and in fact, at the time he did, such an implementation was in development. Whatever exists in the largeVis package just doesn't really seem relevant to the original issue.

The HDBSCAN* implementation in the dbscan package runs in O(n^2) time as of right now, due to the required prerequisite calculation of the distances using the dist function, although a precomputed one can be passed in if available to avoid this. Using some form of spatial indexing to avoid the unnecessary overhead of the (n choose 2) distance calculation, as @lmcinnes brought up, is definitely a useful idea and something planned for the future.

However, the goal of the dbscan package is to include simple to use, efficient, and correct implementations of some of the more prominent dbscan-family of algorithms for the R platform.

Asymptotic complexity is important, but it's not the only component that matters. I regularly use models built from upwards of 40,000 points, computed in a relatively short amount of time. This seems sufficient for many use cases.

For much larger use cases, there seems to a significantly larger effort put into the scalability of the SciKit implementation due to the use of various spatial indexing structures, many of which may or may not be out of the scope of this package.

from dbscan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.