Comments (17)
This is on our list of things to add, but I cannot give you a timeline at this point.
from dbscan.
There's an implementation in R here you can look at: https://github.com/elbamos/largeVis
from dbscan.
Thanks! I will have a look.
from dbscan.
Hi :) I was actually checking in here because I'm implementing dbscan and optics, but I'm the author of that hdbscan implementation. Let me know if you want to chat.
@zachmayer is there a reason you don't want to use my implementation?
from dbscan.
@elbamos I am using your implementation now, but the interface is a bit different and it's taking me a minute to figure it out.
from dbscan.
Let me know if you need help - probably better not to cluster this package's issues though.
from dbscan.
@elbamos We are looking into HDBSCAN for our package. As far as I understand, your package uses sparse matrix representation for the distance matrix. This is very different from what we do here, where we do all the calculations on the original data (using a kd-tree).
from dbscan.
One of the features of the package is very efficient generation of the distance matrix which is, yes, then stored as a sparse matrix. The reason being, if someone is clustering/visualizing in one way (hdbscan, largevis, dbscan, etc), they probably want to try out alternatives as well. So no need to force them to calculate nearest neighbors more than once. (This is the logic of adding clustering to the package in the first place.) And for large datasets managing a dense distance matrix isn't feasible.
On Sep 21, 2016, at 8:47 PM, Michael Hahsler [email protected] wrote:
@elbamos We are looking into HDBSCAN for our package. As far as I understand, your package uses sparse matrix representation for the distance matrix. This is very different from what we do here, where we do all the calculations on the original data (using a kd-tree).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
from dbscan.
Forgive my ignorance, but I believe that HDBSCAN requires/uses a dense matrix representation though, does it not? Perhaps I'm not following here.
from dbscan.
It doesn't need to be completely dense. The likelihood that the edge between a point and it's Kth nearest neighbor would affect the running of Prim's algorithm declines rapidly as K increases.
This is actually a key insight behind the LargeVis algorithm. In these neighbor-distance-based models, the role of distant neighbors is so minimal to the outcome that you can drop them (or, in LargeVis, use negative sampling) and still get a good approximation.
largeVis gives you is a density matrix for each point's k nearest neighbors. In typical usage, it's expected that K will be 100-150, because that's the usual number for the LargeVis visualization algorithm.
Using a sparse matrix can affect things in a few circumstances. Generally if there's any effect at all, its that the graph is not fully connected, and you get a minimum spanning forest rather than a single minimum spanning tree. The connections that are missing are ones that you don't want within a single cluster anyway; this is very similar to how OPTICS works. As long as K is reasonable for the size of the dataset, I think it would take a pretty pathological dataset for the effect to be negative.
The other situation where it could matter is if a point is very dense but its neighbors are not dense. I guess that's possible.
You're correct that my implementation is an approximation of hsbscan since the density matrix is sparse. That's true of any implementation where the neighbor search is approximate.
The largeVis package in general is aimed at very large high-dimensional datasets, in the millions of rows and hundreds of columns. Dense distance matrices become infeasible in terms of both storage size and computation time.
Btw - i made a "clusteringdatasets" package on my repo, if we want to trade results, it might be useful to have common datasets that account for a variety of situations.
On Sep 21, 2016, at 9:22 PM, Matt Piekenbrock [email protected] wrote:
Forgive my ignorance, but I believe that HDBSCAN requires/uses a dense matrix representation though, does it not? Perhaps I'm not following here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
from dbscan.
@mhahsler @zachmayer Take a look at http://htmlpreview.github.io/?https://github.com/elbamos/largeVis/blob/release/0.1.10/inst/doc/momentumandusedata.html
This (draft) vignette, if you scroll down a little, visualizes the results of dbscan, optics, and hdbscan with a variety of hyperparameters. I think it illustrates the advantage of hdbscan nicely.
from dbscan.
Very cool! Thank you.
from dbscan.
For reference, the newest version of the dbscan package now on CRAN was just released with an HDBSCAN implementation.
There's also a small vignette available that illustrates some of the "features" of HDBSCAN Campello et al have proposed in a few papers. You can view it with:
vignette("hdbscan", package="dbscan")
If you run into any issues, please don't hesitate to post them!
from dbscan.
You may want to check out https://arxiv.org/abs/1705.07321 which describes how to compute HDBSCAN using spatial indexing (rather than dense distance matrices) for significant performance improvements.
from dbscan.
@lmcinnes Thanks for putting the info up on arxiv. Using something like DTB or some other means of reducing the complexity below O(n^2), avoiding the full distance computation, is definitely something of interest in the future, I'm just not sure if/when I'll have the time to do something similar. The KD tree used by this package is built using the ANN library, which internally supports Minkowski metrics and uses some tricks (such as incremental distance calculations) to operate--translating the structure to allow the pruning of the neighbor queries isn't immediately transparent to me how to do, although the end solution may actually be simple.
from dbscan.
I'm pretty sure my implementation of hdbscan in largeVis
runs in O(n log (n))
...
from dbscan.
Ok. I don't know the runtime complexity, nor the source code, of the largeVis
implementation either.
However, this seems off topic for this issue. @zachmayer brought up that HDBSCAN* would be a good fit for the scope of the dbscan
package, and in fact, at the time he did, such an implementation was in development. Whatever exists in the largeVis
package just doesn't really seem relevant to the original issue.
The HDBSCAN* implementation in the dbscan
package runs in O(n^2) time as of right now, due to the required prerequisite calculation of the distances using the dist
function, although a precomputed one can be passed in if available to avoid this. Using some form of spatial indexing to avoid the unnecessary overhead of the (n choose 2) distance calculation, as @lmcinnes brought up, is definitely a useful idea and something planned for the future.
However, the goal of the dbscan
package is to include simple to use, efficient, and correct implementations of some of the more prominent dbscan-family of algorithms for the R platform.
Asymptotic complexity is important, but it's not the only component that matters. I regularly use models built from upwards of 40,000 points, computed in a relatively short amount of time. This seems sufficient for many use cases.
For much larger use cases, there seems to a significantly larger effort put into the scalability of the SciKit implementation due to the use of various spatial indexing structures, many of which may or may not be out of the scope of this package.
from dbscan.
Related Issues (20)
- some strange results of sNN function HOT 7
- Discrepancies in outlier score between HDBSCAN R and python HOT 7
- Implement Density-Based Clustering Validation (DBCV) HOT 3
- BD-trees
- DBSCAN with categorica/factor/dummy variables HOT 1
- hdbscan HOT 2
- LOF edge case HOT 2
- LOF fails after upgrading to dbscan 1.1-6 HOT 2
- Possible Memory Leak HOT 2
- kNN crashing (segfault) when matrix has Inf values HOT 1
- mrdist error in large datasets HOT 3
- frNN object created from scratch couldn't be used in dbscan HOT 6
- Error in mrd(x_dist, coredist) : number of mutual reachability distance values and size of the distances do not agree. HOT 6
- DBSCAN for trajectories HOT 4
- Getting an error when using predict: x has to be a numeric matrix. HOT 2
- may you clarify is multi-density clustering is implemented, since it is mentioned on references ? HOT 1
- R session aborted in pointdensity() HOT 4
- Add broom tidier methods HOT 9
- Allow setting cluster_selection_epsilon in hdbscan() HOT 4
- `hdbscan` documenting params that are not used HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dbscan.