mhahsler / dbscan Goto Github PK

View Code? Open in Web Editor NEW

291.0 14.0 62.0 8.94 MB

Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package

License: GNU General Public License v3.0

R 35.19% C++ 56.50% C 2.54% TeX 5.77%

dbscan clustering lof optics density-based-clustering cran r hdbscan

dbscan's Introduction

R package dbscan - Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms

Introduction

This R package (Hahsler, Piekenbrock, and Doran 2019) provides a fast C++ (re)implementation of several density-based algorithms with a focus on the DBSCAN family for clustering spatial data. The package includes:

Clustering

DBSCAN: Density-based spatial clustering of applications with noise (Ester et al. 1996).
HDBSCAN: Hierarchical DBSCAN with simplified hierarchy extraction (Campello et al. 2015).
OPTICS/OPTICSXi: Ordering points to identify the clustering structure clustering algorithms (Ankerst et al. 1999).
FOSC: Framework for Optimal Selection of Clusters for unsupervised and semisupervised clustering of hierarchical cluster tree (Campello, Moulavi, and Sander 2013).
Jarvis-Patrick clustering: Shared Nearest Neighbor Graph partitioning (Jarvis and Patrick 1973).
SNN Clustering: Shared Nearest Neighbor Clustering (Ertöz, Steinbach, and Kumar 2003).

Outlier Detection

LOF: Local outlier factor algorithm (Breunig et al. 2000).
GLOSH: Global-Local Outlier Score from Hierarchies algorithm (Campello et al. 2015).

Fast Nearest-Neighbor Search (using kd-trees)

kNN search
Fixed-radius NN search

The implementations use the kd-tree data structure (from library ANN) for faster k-nearest neighbor search, and are typically faster than the native R implementations (e.g., dbscan in package fpc), or the implementations in WEKA, ELKI and Python’s scikit-learn.

The following R packages use dbscan: AFM, bioregion, CLONETv2, ClustAssess, cordillera, CPC, crosshap, daltoolbox, DDoutlier, diceR, dobin, doc2vec, dPCP, EHRtemporalVariability, eventstream, evprof, FCPS, fdacluster, FORTLS, funtimes, FuzzyDBScan, karyotapR, ksharp, LOMAR, maotai, metaCluster, mlr3cluster, MOSS, oclust, openSkies, opticskxi, OTclust, pagoda2, parameters, ParBayesianOptimization, performance, rMultiNet, seriation, sfdep, sfnetworks, sharp, shipunov, smotefamily, snap, spdep, spNetwork, squat, ssMRCD, stream, supc, synr, tidySEM, weird

To cite package ‘dbscan’ in publications use:

Hahsler M, Piekenbrock M, Doran D (2019). “dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software, 91(1), 1-30. doi:10.18637/jss.v091.i01 https://doi.org/10.18637/jss.v091.i01.

@Article{,
  title = {{dbscan}: Fast Density-Based Clustering with {R}},
  author = {Michael Hahsler and Matthew Piekenbrock and Derek Doran},
  journal = {Journal of Statistical Software},
  year = {2019},
  volume = {91},
  number = {1},
  pages = {1--30},
  doi = {10.18637/jss.v091.i01},
}

Installation

Stable CRAN version: Install from within R with

install.packages("dbscan")

Current development version: Install from r-universe.

install.packages("dbscan",
    repos = c("https://mhahsler.r-universe.dev",
              "https://cloud.r-project.org/"))

Usage

Load the package and use the numeric variables in the iris dataset

library("dbscan")

data("iris")
x <- as.matrix(iris[, 1:4])

DBSCAN

db <- dbscan(x, eps = 0.42, minPts = 5)
db

## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.42, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 3 cluster(s) and 29 noise points.
## 
##  0  1  2  3 
## 29 48 37 36 
## 
## Available fields: cluster, eps, minPts, metric, borderPoints

Visualize the resulting clustering (noise points are shown in black).

pairs(x, col = db$cluster + 1L)

OPTICS

opt <- optics(x, eps = 1, minPts = 4)
opt

## OPTICS ordering/clustering for 150 objects.
## Parameters: minPts = 4, eps = 1, eps_cl = NA, xi = NA
## Available fields: order, reachdist, coredist, predecessor, minPts, eps,
##                   eps_cl, xi

Extract DBSCAN-like clustering from OPTICS and create a reachability plot (extracted DBSCAN clusters at eps_cl=.4 are colored)

opt <- extractDBSCAN(opt, eps_cl = 0.4)
plot(opt)

HDBSCAN

hdb <- hdbscan(x, minPts = 4)
hdb

## HDBSCAN clustering for 150 objects.
## Parameters: minPts = 4
## The clustering contains 2 cluster(s) and 0 noise points.
## 
##   1   2 
## 100  50 
## 
## Available fields: cluster, minPts, coredist, cluster_scores,
##                   membership_prob, outlier_scores, hc

Visualize the hierarchical clustering as a simplified tree. HDBSCAN finds 2 stable clusters.

plot(hdb, show_flat = TRUE)

Using dbscan with tidyverse

dbscan provides for all clustering algorithms tidy(), augment(), and glance() so they can be easily used with tidyverse, ggplot2 and tidymodels.

library(tidyverse)
db <- x %>%
    dbscan(eps = 0.42, minPts = 5)

Get cluster statistics as a tibble

tidy(db)

## # A tibble: 4 × 3
##   cluster  size noise
##   <fct>   <int> <lgl>
## 1 0          29 TRUE 
## 2 1          48 FALSE
## 3 2          37 FALSE
## 4 3          36 FALSE

Visualize the clustering with ggplot2 (use an x for noise points)

augment(db, x) %>%
    ggplot(aes(x = Petal.Length, y = Petal.Width)) + geom_point(aes(color = .cluster,
    shape = noise)) + scale_shape_manual(values = c(19, 4))

Using dbscan from Python

R, the R package dbscan, and the Python package rpy2 need to be installed.

import pandas as pd
import numpy as np

### prepare data
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
                   header = None, 
                   names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species'])
iris_numeric = iris[['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth']]

# get R dbscan package
from rpy2.robjects import packages
dbscan = packages.importr('dbscan')

# enable automatic conversion of pandas dataframes to R dataframes
from rpy2.robjects import pandas2ri
pandas2ri.activate()

db = dbscan.dbscan(iris_numeric, eps = 0.5, MinPts = 5)
print(db)

## DBSCAN clustering for 150 objects.
## Parameters: eps = 0.5, minPts = 5
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 2 cluster(s) and 17 noise points.
## 
##  0  1  2 
## 17 49 84 
## 
## Available fields: cluster, eps, minPts, dist, borderPoints

# get the cluster assignment vector
labels = np.array(db.rx('cluster'))
labels

## array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
##         1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,
##         2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 0,
##         2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,
##         2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]],
##       dtype=int32)

License

The dbscan package is licensed under the GNU General Public License (GPL) Version 3. The OPTICSXi R implementation was directly ported from the ELKI framework’s Java implementation (GNU AGPLv3), with permission by the original author, Erich Schubert.

Changes

List of changes from NEWS.md

References

Ankerst, Mihael, Markus M Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. “OPTICS: Ordering Points to Identify the Clustering Structure.” In ACM Sigmod Record, 28:49–60. 2. ACM. https://doi.org/10.1145/304181.304187.

Breunig, Markus M, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. “LOF: Identifying Density-Based Local Outliers.” In ACM Int. Conf. On Management of Data, 29:93–104. 2. ACM. https://doi.org/10.1145/335191.335388.

Campello, Ricardo JGB, Davoud Moulavi, and Jörg Sander. 2013. “Density-Based Clustering Based on Hierarchical Density Estimates.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 160–72. Springer. https://doi.org/10.1007/978-3-642-37456-2_14.

Campello, Ricardo JGB, Davoud Moulavi, Arthur Zimek, and Joerg Sander. 2015. “Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection.” ACM Transactions on Knowledge Discovery from Data (TKDD) 10 (1): 5. https://doi.org/10.1145/2733381.

Ertöz, Levent, Michael Steinbach, and Vipin Kumar. 2003. “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data.” In Proceedings of the 2003 SIAM International Conference on Data Mining (SDM), 47–58. https://doi.org/10.1137/1.9781611972733.5.

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226–31. https://dl.acm.org/doi/10.5555/3001460.3001507.

Hahsler, Michael, Matthew Piekenbrock, and Derek Doran. 2019. “dbscan: Fast Density-Based Clustering with R.” Journal of Statistical Software 91 (1): 1–30. https://doi.org/10.18637/jss.v091.i01.

Jarvis, R. A., and E. A. Patrick. 1973. “Clustering Using a Similarity Measure Based on Shared Near Neighbors.” IEEE Transactions on Computers C-22 (11): 1025–34. https://doi.org/10.1109/T-C.1973.223640.

dbscan's People

Contributors

Stargazers

Watchers

Forkers

gabrielaolinto linearregression psy2013github asgr lelayf lyimage mitchsanders timkiely paulponcet taekyunk hellodhr yayahjb yxiaowhut learn24x7 joezzr cao-dut viv-r wush978 kno10 zschuster salrue nnu-gisa stjordanis lakehui moredatapls sadanandann zorrodong constant1355 phymucs mza0150 rjlallana uchiha-938 michael-chen-24 damirpolat eduardokapp minghao2016 microly limemeba taggarrykopr 1ndevelopment haozhou789 dingluchuan aminzayer dfqytcom navrobot liudxhit cooparation chengwei920412 lvulis lukasgd ecstasygu sandy4321 xzwang12345 henrifnk yyl-20020115 hanyghazal79 beamiter ustlzh gerhobbelt joeroe straycat2486 m-muecke

dbscan's Issues

Add broom tidier methods

Hello, thank you for this very helpful package.

Would you be open to a pull request adding broom tidier methods for dbscan and hdbscan objects?

DBSCAN for trajectories

I would like to analyze trajectories (x, y, t) with multiple values of x and y. First I tried scikit-learn on Python (like this QA: https://stackoverflow.com/questions/52926477/run-dbscan-on-trajectories) but it turned out that we cannot do DBSCAN of multiple x and y values in scikit-learn (only x or y, or single x and single y would work).
With R, I attempted to do DBSCAN with below data, but it required a matrix.

data <- list(list(c(0, 1), c(0, 1)), list(c(1, 1), c(1, 1)), list(c(1, 3), c(1, 3)))

Is it possible to do DBSCAN for multiple x and y values with your package? If possible, how can I input the data?
Like
x1, x2, x3, ..., y1, y2, y3, ... without correspondence of x and y? (Hopefully there should be a correspondence of x and y...)

R session aborted in pointdensity()

Hey!

I'm using tidySEM to visualise structural equation models. The package calls dbscan::pointdensity() at some point.

For a specific case, this led to an "R session aborted" when calling pointdensity().

I can consistently reproduce this on my machine with the following MWE:

library(dbscan)

tmp <- structure(list(x = c(5, 6, 7, 9, 10, 11.25, 5.05, 6.3, 7, 9, 
9.7, 10.95, 3, 3, 3.3, 3.3, 3, 3, 13, 13, 12.7, 12.7, 13, 13, 
8, 8, 5.95, 10.05, 8, 6, 9.75, 5.75, 10, 8, NaN), y = c(16.05, 
16.05, 16.05, 16.05, 16.05, 15.8, 4, 3.75, 3.95, 3.95, 3.75, 
4, 13.05, 12.05, 11.25, 8.75, 7.95, 6.95, 13.05, 12.05, 11.25, 
8.75, 7.95, 6.95, 11.95, 8.05, 10, 10, 10, 12, 11.75, 7.75, 8, 
10, NaN)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 
23L, 24L, 30L, 31L, 32L, 33L, 58L, 59L, 60L, 61L, 62L, 63L, 35L
), class = "data.frame")

pointdensity(x = tmp, eps = 5)

I hope this is reproducible.

Best,
Paul

LOF fails after upgrading to dbscan 1.1-6

This command previously would work now it fails:

dbscan::lof(x = dt, k = 3, sort = F, approx = 1.1)
Error in kNN(x, k, sort = TRUE, ...) : formal argument "sort" matched by multiple actual arguments

Problems compiling with intel compilers (icpc)

Kind of duplicate to #20

We get an error compiling buildHDBSCAN.cpp with intels icpc on our HPC:

buildHDBSCAN.cpp(46): error: more than one operator "==" matches these operands:
            built-in operator "pointer == pointer"
            function "Rcpp::operator==(Rcpp::Na_Proxy, SEXP)"
            operand types are: Rcpp::internal::generic_name_proxy<19, Rcpp::PreserveStorage> == SEXP
    if (!hcl.containsElementNamed("labels") || hcl["labels"] == R_NilValue){

from the command

icpc -std=c++11 -I/cm/shared/uniol/software/R/3.3.1-intel-2016b/lib64/R/include -DNDEBUG  -I/cm/shared/uniol/software/imkl/11.3.3.210-iimpi-2016b/mkl/include -I/cm/shared/uniol/software/libreadline/6.3-intel-2016b/include -I/cm/shared/uniol/software/ncurses/6.0-intel-2016b/include -I/cm/shared/uniol/software/bzip2/1.0.6-intel-2016b/include -I/cm/shared/uniol/software/XZ/5.2.2-intel-2016b/include -I/cm/shared/uniol/software/zlib/1.2.8-intel-2016b/include -I/cm/shared/uniol/software/SQLite/3.13.0-intel-2016b/include -I/cm/shared/uniol/software/PCRE/8.38-intel-2016b/include -I/cm/shared/uniol/software/libpng/1.6.23-intel-2016b/include -I/cm/shared/uniol/software/libjpeg-turbo/1.5.0-intel-2016b/include -I/cm/shared/uniol/software/LibTIFF/4.0.6-intel-2016b/include -I/cm/shared/uniol/software/Java/1.8.0_112/include -I/cm/shared/uniol/software/Tcl/8.6.5-intel-2016b/include -I/cm/shared/uniol/software/Tk/8.6.5-intel-2016b/include -I/cm/shared/uniol/software/cURL/7.49.1-intel-2016b/include -I/cm/shared/uniol/software/libxml2/2.9.4-intel-2016b/include -I/cm/shared/uniol/software/GDAL/2.1.0-intel-2016b/include -I/cm/shared/uniol/software/PROJ/4.9.2-intel-2016b/include -I/cm/shared/uniol/software/GMP/6.1.1-intel-2016b/include -I"/user/sebo8575/R/x86_64-pc-linux-gnu-library/3.3/Rcpp/include"   -fpic  -O2 -xHost -ftz -fp-speculation=safe -fp-model source -c buildHDBSCAN.cpp -o buildHDBSCAN.o

which could be solved by just casting hcl["labels"] - seems Intel is more picky than gcc. The first patch is:

diff --git a/src/buildHDBSCAN.cpp b/src/buildHDBSCAN.cpp
index 89e8e71..d8dee7b 100644
--- a/src/buildHDBSCAN.cpp
+++ b/src/buildHDBSCAN.cpp
@@ -43,7 +43,7 @@ List buildDendrogram(List hcl) {
   NumericVector height = hcl["height"];
   IntegerVector order = hcl["order"];
   List labels = List(); // allows to avoid type inference 
-  if (!hcl.containsElementNamed("labels") || hcl["labels"] == R_NilValue){
+  if (!hcl.containsElementNamed("labels") || (SEXP)hcl["labels"] == R_NilValue){
     labels = seq_along(order); 
   } else { 
     labels = as<StringVector>(hcl["labels"]);

maybe you can integrate a similar patch in the next release.

Counting one too many clusters?

Running DBScan with MinPts=1 and eps = 50 on the attached data returns 11 clusters, i am expecting 10 clusters ... am i missing something?

Column A is X Data, Column B is Y Data. Please note this data is always organized smallest to largest and the Y Data is always 1.

Please look at tab 42127348110000, i have added tab 42127348110000-manual clusters to show my assumptions for clusters.

I am running R as follows:

data <- read.csv(file.names[i])
res <- dbscan(data, eps=50, MinPts = 1

42041320940000.xlsx
)

thanks.

Had an issue with the object '.ANNsplitRule'

Trying to learn and use dbscan in R recently. When I try the simple example found in the document,

data(iris)
iris <- as.matrix(iris[,1:4])
kNNdistplot(iris, k = 5)
abline(h=.5, col = "red", lty=2)
res <- dbscan(iris, eps = .5, minPts = 5)

goes through everything until it hits the dbscan command (last line). The error thrown is as following

Error in pmatch(toupper(splitRule), .ANNsplitRule) : object '.ANNsplitRule' not found**

Any insights on what I've missed?

Implement Density-Based Clustering Validation (DBCV)

http://epubs.siam.org/doi/pdf/10.1137/1.9781611973440.96

Document parameters better when benchmarking

You do not give details on the indexes chosen in the comparison methods. From the runtime numbers, it appears that you did not add enable an index in ELKI? Do the runtimes include JVM startup cost, R startup cost, or did you use internal runtime measurements (e.g. -time).
https://rdrr.io/cran/dbscan/f/inst/doc/dbscan.pdf

I suggest to make separate benchmarks with and without indexing, and experiment with different indexes such as the cover tree which performed quite well in other studies. Benchmarking an implementation without index vs. one with index obviously is unfair. For OPTICS, I suggest you use a small epsilon whenever using an index, for obvious reasons.

Furthermore, benchmarks with such tiny data sets (<=8000 instances) are likely not very informative. Startup costs such as the JVM hotspot compilation will contribute substantially.
Here is one slightly larger (but still small!), but real data and 13k instances: https://cs.joensuu.fi/sipu/datasets/MopsiLocationsUntil2012-Finland.txt and with appropriate parameters this should work with both DBSCAN and OPTICS. For more meaningful results, I suggest to use at minimum 30.000-50.000 instances of real data, e.g. coordinates from Tweets (supposedly not shareable) or Flickr images (could be shareable?). With indexes, the data size should probably go up to a million points.

Also the R dbscan package documentation should be more upfront about the performance strongly depending on the use of Euclidean distance because of the kd-tree. The ANN library is great, but this is a substantial limitation. For a tweet locations example, Haversine distance is much more appropriate.

Sorry that I have not yet prepared a technical report on the filtering step. It has some reasonable theory behind it, but it is too incremental as that a major conference would publish this, so there only exists a draft report.

Both will be very valuable for users, and therefore belong into the manual

runtime of OPTICS does depend on keeping epsilon small
other distances than Euclidean can be used, but will be a lot slower (in this package). Demonstrate how to use Haversine distance for geo coordinates, because this is a very important use case.

Cutting with HDBSCAN, get membership probability matrix for each observation?

I noticed that it is possible to extract arbitrary number of clusters with hdbscan by using the cutree function on the hc component of the hdbscan output. But is there any simple ways to get the membership probabilities for each element given the fixed number of cluster? I.e. a matrix which gives the cluster probabilities for each element and cluster (such as fanny? in cluster`)?

Issue with predicting cluster labeling, using DBSCAN object and Gower distance matrix for new data in R

Hi there, I'm having issue with predicting cluster labeling for a test data, based on a dbscan clustering model on the training data. I used gower distance matrix when creating the model:

> gowerdist_train <- daisy(analdata_train,
                   metric = "gower",
                   stand = FALSE,
                   type = list(asymm = c(5,6)))

Using this gowerdist matrix, the dbscan clustering model created was:
> sb <- dbscan(gowerdist_train, eps = .23, minPts = 50)

Then I try to use predict to label a test dataset using the above dbscan object:
> predict(sb, newdata = analdata_test, data = analdata_train)

But I receive the following error:

Error in frNN(rbind(data, newdata), eps = object$eps, sort = TRUE,
...) : x has to be a numeric matrix

I can take a guess on where this error might be coming from, which is probably due to the absence of the gower distance matrix that hasn't been created for the test data. My question is, should I create a gower distance matrix for all data (datanal_train + datanal_test) separately and feed it into predict? how else would the algorithm know what the distance of test data from the train data is, in order to label?
In that case, would the newdata parameter be the new gower distance matrix that contains ALL (train + test) data? and the data parameter in predict would be the training distance matrix, gowerdist_train?
What I am not quite sure about is how would the predict algorithm distinguish between the test and train data set in the newly created gowerdist_all matrix?
The two matrices (new gowerdist for all data and the gowerdist_train) would obviously not have the same dimensions. Also, it doesn't make sense to me to create a gower distance matrix only for the test data because distances must be relative to the test data, not the test data itself?
It hasn’t been very clear in the dbscan documentation how to use the predict function for distance matrix data (as opposed to raw data)

tried using gower distance matrix for all data (train + test) as my new data and received an error when fed to predict:

> gowerdist_all <- daisy(rbind(analdata_train, analdata_test),
                         metric = "gower",
                         stand = FALSE,
                         type = list(asymm = c(5,6)))

> test_sb_label <- predict(sb, newdata = gowerdist_all, data = gowerdist_train)

ERROR: Error in 1:nrow(data) : argument of length 0 In addition: Warning message: In rbind(data, newdata) : number of columns of result is not a multiple of vector length (arg 1)

HDBScan

Or Hierarchical Density-Based Spatial Clustering of Applications with Noise. Python implementation here: https://github.com/lmcinnes/hdbscan

It's nice, because you don't have to tune the eps parameter, which makes it a little easier to use. It'd be awesome to have an implementation of hdbscan in the R package dbscan.

Thank you!

Weights in dbscan()

Hi,
Changing back from my comment earlier. Some of the unexpected behavior that I saw was due to bugs in my own code -- a good thing!
Still one thing remains that I find odd: Using uniform random weights does improve clustering... at least in my data set.
But the issue can be closed!

Discrepancies in outlier score between HDBSCAN R and python

Dear author,

I have started a detailed examination of the hdbscan function contained in the dbscan package because I observed an anomalous distribution of the outlier scores for my dataset. In fact, the scores tended to group all around two values: 0 and 1.

In order to grasp the reasons behind this problem I have tried to apply the hdbscan function both in R and in python to the same dataset.

The result that I got is that using the python the distribution of the outlier scores makes a lot more sense than the one outputted by R on the same dataset.

Is there a known issue on this?

Would you be able to fix this?

Thanks in advance for your feedback

Best regards

Matteo

Not able to install

The link to download and install is not working for me. I used -
install.packages("dbscan")

http://cran.rstudio.com/bin/windows/contrib/3.2/dbscan_0.9-6.zip

Export internal functions of hdbscan

It would be of great use (in other packages and experimentation) to export internal functions (of HDBSCAN) like mrd, computeStability which are valuable as standalone functions.

mrdist error in large datasets

Thank you for the fantastic dbscan package.

I believe I am finding an error when trying to use dbscan::hdbscan in large datasets.

Error in mrd(x_dist, coredist) : 
  number of mutual reachability distance values and size of the distances do not agree.

Here is a reproducible example that generates a large dataset.
If n <- 20000 the code runs without errors. It seems that values for 30000 and above trigger the error.

library(dbscan)
library(uwot)
n <- 30000
ngenes <- 3000
n.means <- 2^runif(ngenes, 2, 10)
n.disp <- 100/n.means + 0.5
set.seed(1000)
m1 <- matrix(rnbinom(ngenes*n, mu=n.means, size=1/n.disp), ncol=n)
set.seed(1000)
m2 <- matrix(rnbinom(ngenes*n, mu=5*n.means, size=1/n.disp), ncol=n)
m <- cbind(m1, m2)
um <- uwot::umap(t(m))
hd <- dbscan::hdbscan(um, 10)

I couldn't exactly pinpoint as to why it happens. But it seems to happen when calling dbscan::mrdist()
Possibly related to #46 ? Apologies if it was better to continue the conversation there. However, I do not get a memory error in this issue and I confirmed that there was enough memory to allocate the object.

Session Info:

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] uwot_0.1.10   Matrix_1.3-4  dbscan_1.1-10

loaded via a namespace (and not attached):
  [1] backports_1.3.0             circlize_0.4.13             igraph_1.2.8               
  [4] lazyeval_0.2.2              splines_4.1.1               flowCore_2.6.0             
  [7] BiocParallel_1.28.0         usethis_2.1.3               GenomeInfoDb_1.30.0        
 [10] ggplot2_3.3.5               amap_0.8-18                 digest_0.6.28              
 [13] foreach_1.5.1               yulab.utils_0.0.4           htmltools_0.5.2            
 [16] viridis_0.6.2               ggalluvial_0.12.3           fansi_0.5.0                
 [19] magrittr_2.0.1              ScaledMatrix_1.2.0          cluster_2.1.2              
 [22] doParallel_1.0.16           mixtools_1.2.0              limma_3.50.0               
 [25] fastcluster_1.2.3           ComplexHeatmap_2.10.0       RcppParallel_5.1.4         
 [28] matrixStats_0.61.0          cytolib_2.6.0               colorspace_2.0-2           
 [31] xfun_0.28                   dplyr_1.0.7                 crayon_1.4.2               
 [34] RCurl_1.98-1.5              jsonlite_1.7.2              survival_3.2-13            
 [37] iterators_1.0.13            ape_5.5                     glue_1.4.2                 
 [40] gtable_0.3.0                zlibbioc_1.40.0             XVector_0.34.0             
 [43] Rsubread_2.8.1              GetoptLong_1.0.5            DelayedArray_0.20.0        
 [46] car_3.0-12                  BiocSingular_1.10.0         kernlab_0.9-29             
 [49] shape_1.4.6                 SingleCellExperiment_1.16.0 prabclus_2.3-2             
 [52] BiocGenerics_0.40.0         DEoptimR_1.0-9              abind_1.4-5                
 [55] scales_1.1.1                DBI_1.1.1                   edgeR_3.36.0               
 [58] rstatix_0.7.0               miniUI_0.1.1.1              Rcpp_1.0.7                 
 [61] viridisLite_0.4.0           xtable_1.8-4                clue_0.3-60                
 [64] gridGraphics_0.5-1          tidytree_0.3.5              dqrng_0.3.0                
 [67] rsvd_1.0.5                  mclust_5.4.7                stats4_4.1.1               
 [70] metapod_1.2.0               gplots_3.1.1                RColorBrewer_1.1-2         
 [73] fpc_2.2-9                   modeltools_0.2-23           ellipsis_0.3.2             
 [76] pkgconfig_2.0.3             flexmix_2.3-17              scuttle_1.4.0              
 [79] nnet_7.3-16                 copykit_0.0.0.9039          janitor_2.1.0              
 [82] locfit_1.5-9.4              utf8_1.2.2                  DNAcopy_1.68.0             
 [85] ggplotify_0.1.0             tidyselect_1.1.1            rlang_0.4.12               
 [88] later_1.3.0                 munsell_0.5.0               tools_4.1.1                
 [91] generics_0.1.1              broom_0.7.10                evaluate_0.14              
 [94] stringr_1.4.0               fastmap_1.1.0               yaml_2.2.1                 
 [97] ggtree_3.2.0                fs_1.5.0                    knitr_1.36                 
[100] robustbase_0.93-9           caTools_1.18.2              purrr_0.3.4                
[103] nlme_3.1-153                sparseMatrixStats_1.6.0     mime_0.12                  
[106] scran_1.22.1                aplot_0.1.1                 pracma_2.3.3               
[109] compiler_4.1.1              beeswarm_0.4.0              png_0.1-7                  
[112] treeio_1.18.0               tibble_3.1.5                statmod_1.4.36             
[115] stringi_1.7.5               forcats_0.5.1               lattice_0.20-45            
[118] bluster_1.4.0               vctrs_0.3.8                 copynumber_1.34.0          
[121] pillar_1.6.4                lifecycle_1.0.1             GlobalOptions_0.1.2        
[124] RcppAnnoy_0.0.19            BiocNeighbors_1.12.0        data.table_1.14.2          
[127] cowplot_1.1.1               bitops_1.0-7                irlba_2.3.3                
[130] httpuv_1.6.3                patchwork_1.1.1             GenomicRanges_1.46.0       
[133] R6_2.5.1                    promises_1.2.0.1            RProtoBufLib_2.6.0         
[136] KernSmooth_2.23-20          gridExtra_2.3               vipor_0.4.5                
[139] IRanges_2.28.0              codetools_0.2-18            boot_1.3-28                
[142] MASS_7.3-54                 gtools_3.9.2                assertthat_0.2.1           
[145] SummarizedExperiment_1.24.0 rjson_0.2.20                withr_2.4.3                
[148] S4Vectors_0.32.2            GenomeInfoDbData_1.2.7      diptest_0.76-0             
[151] parallel_4.1.1              grid_4.1.1                  ggfun_0.0.4                
[154] beachmat_2.10.0             tidyr_1.1.4                 class_7.3-19               
[157] snakecase_0.11.0            rmarkdown_2.11              DelayedMatrixStats_1.16.0  
[160] carData_3.0-4               segmented_1.3-4             MatrixGenerics_1.6.0       
[163] ggnewscale_0.4.5            lubridate_1.8.0             Biobase_2.54.0             
[166] shiny_1.7.1                 ggbeeswarm_0.6.0

Thank you, let me know if I can help more.

LOF return Nan values

I have been using LOF for analysing more than 10,000 data point. However, the lof function in dbscan package return nan value.

Allow setting cluster_selection_epsilon in hdbscan()

Malzer & Baum describe how adding a minimum threshold value of eps to HDBSCAN can help with 'micro-clusters' in high density regions. This is implemented in the hdbscan Python package as a cluster_selection_epsilon parameter. Perhaps it could be added to this package too?

DBSCAN with categorica/factor/dummy variables

Can DBSCAN be used in data with non-continuous variables? For example, a dummy for married (0 or 1) or a multi-level categorical variable for marital status (married, single, divorced, widowed etc.). If so, how? I get an error that "x has to be a numeric matrix".

`hdbscan` documenting params that are not used

Currently hdbscan is documenting the following, but only x, minPts, gen_hdbscan_tree and gen_simplified_tree are used, is that intended?

#' @param x a data matrix (Euclidean distances are used) or a [dist] object
#' calculated with an arbitrary distance metric.
#' @param minPts integer; Minimum size of clusters. See details.
#' @param gen_hdbscan_tree logical; should the robust single linkage tree be
#' explicitly computed (see cluster tree in Chaudhuri et al, 2010).
#' @param gen_simplified_tree logical; should the simplified hierarchy be
#' explicitly computed (see Campello et al, 2013).
#' @param verbose report progress.
#' @param ...  additional arguments are passed on.
#' @param scale integer; used to scale condensed tree based on the graphics
#' device. Lower scale results in wider trees.
#' @param gradient character vector; the colors to build the condensed tree
#' coloring with.
#' @param show_flat logical; whether to draw boxes indicating the most stable
#' clusters.
#' @param coredist numeric vector with precomputed core distances (optional).

may you clarify is multi-density clustering is implemented, since it is mentioned on references ?

@inproceedings{ghanbarpour2014exdbscan,
title={EXDBSCAN: An extension of DBSCAN to detect clusters in multi-density datasets},
author={Ghanbarpour, Asieh and Minaei, Behrooz},
booktitle={Intelligent Systems (ICIS), 2014 Iranian Conference on},
pages={1--5},
year={2014},
organization={IEEE}
}

LOF edge case

I noticed an edge case in LOF:

> dbscan::lof(dist(c(1,2,3,4,5,6,7)), k=3)
[1] 1.0555556 1.0555556 1.0555556 0.9047619 0.9047619 1.1111111 1.1111111

By symmetry, points 1, 2, 3 should respectively have the same LOFs as points 7, 6, 5. According to my own calculation the answer should be:

[1] 1.0679012 1.0679012 1.0133929 0.8730159 1.0133929 1.0679012 1.0679012

I think this is happening because when determining the nearest neighbors, the code uses kNN which selects exactly k points, but in the LOF calculation the neighborhood is supposed to include all points that are as close as the k-th nearest neighbor, which in the case of ties (like here) can include more than k points.

Weights in OPTICS

This is a feature request rather than issue strictly- it seems there should be an option to have weights in OPTICS as there is for DBSCAN. Please correct me if this thinking is faulty.

frNN object created from scratch couldn't be used in dbscan

I have a large distance matrix and want to first build frNN object from scratch to reduce memory burden. I first initialize a frNN object with one node and then add my distance and node id to this object.

frNN_dis <- function(dis,eps = 0,if_direct = T){
    if(if_direct){
        dis <- rbind(dis,dis[,c(2,1,3)])
    }
    dis <- as.data.frame(dis)
    colnames(dis) <- c("umi1","umi2","dis")
    dis <- dis[order(dis$umi1),]
    out_frNN <- frNN(as.dist(1), eps = 5)
    out_frNN$dist <- split(dis$dis,dis$umi1)
    out_frNN$id <- split(dis$umi2,dis$umi1)
    out_frNN$sort <- F
    out_frNN <- frNN(out_frNN,eps = eps)
    return(out_frNN)
}

But when I used frNN object built in this way in dbscan, it caused a segfault error

*** caught segfault ***
address 0x51, cause 'memory not mapped'

Traceback:
 1: dbscan_int(x, as.double(eps), as.integer(minPts), as.double(weights),     as.integer(borderPoints), as.integer(search), as.integer(bucketSize),     as.integer(splitRule), as.double(approx), frNN)
 2: dbscan(cpp_test_nn, eps = 0, minPts = 1)
 3: eval(expr, envir, enclos)
 4: eval(expr, envir, enclos)
 5: withVisible(eval(expr, envir, enclos))
 6: withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler,     error = eHandler, message = mHandler)
 7: doTryCatch(return(expr), name, parentenv, handler)
 8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
11: try(f, silent = TRUE)
12: handle(ev <- withCallingHandlers(withVisible(eval(expr, envir,     enclos)), warning = wHandler, error = eHandler, message = mHandler))
13: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval(expr,     envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)))
14: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos,     debug = debug, last = i == length(out), use_try = stop_on_error !=         2L, keep_warning = keep_warning, keep_message = keep_message,     output_handler = output_handler, include_timing = include_timing)
15: evaluate(request$content$code, envir = .GlobalEnv, output_handler = oh,     stop_on_error = 1L)
16: doTryCatch(return(expr), name, parentenv, handler)
17: tryCatchOne(expr, names, parentenv, handlers[[1L]])
18: tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
19: doTryCatch(return(expr), name, parentenv, handler)
20: tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]),     names[nh], parentenv, handlers[[nh]])
21: tryCatchList(expr, classes, parentenv, handlers)
22: tryCatch(evaluate(request$content$code, envir = .GlobalEnv, output_handler = oh,     stop_on_error = 1L), interrupt = function(cond) {    log_debug("Interrupt during execution")    interrupted <<- TRUE}, error = .self$handle_error)
23: executor$execute(msg)
24: handle_shell()
25: kernel$run()
26: IRkernel::main()
An irrecoverable exception occurred. R is aborting now ...

Compiler Compatibility V1.1-1

I'm not able to load dbscan starting with V1.1-0 in what appears to be a compiler issue. Any thoughts?

Details:

No trouble getting v 1.0 --
install.packages("dbscan", repos = "https://mran.revolutionanalytics.com/snapshot/2017-03-19")

Installing package into ‘/work/library/3.3.1’
(as ‘lib’ is unspecified)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 765k 100 765k 0 0 2974k 0 --:--:-- --:--:-- --:--:-- 2978k

installing source package ‘dbscan’ ...
** package ‘dbscan’ successfully unpacked and MD5 sums checked
** libs
g++ -I/usr/lib64/microsoft-open-r/3.3.1/microsoft-r/3.3/lib64/R/include -DNDEBUG -DU_STATIC_IMPLEMENTATION -I"/work/library/3.3.1/Rcpp/include" -fpic -DU_STATIC_IMPLEMENTATIN -O2 -g -c ANN.cpp -o ANN.o
g++ -I/usr/lib64/microsoft-open-r/3.3.1/microsoft-r/3.3/lib64/R/include -DNDEBUG -DU_STATIC_IMPLEMENTATION -I"/work/library/3.3.1/Rcpp/include" -fpic -DU_STATIC_IMPLEMENTATIN -O2 -g -c R_JP.cpp -o R_JP.o
g++ -I/usr/lib64/microsoft-open-r/3.3.1/microsoft-r/3.3/lib64/R/include -DNDEBUG -DU_STATIC_IMPLEMENTATION -I"/work/library/3.3.1/Rcpp/include" -fpic -DU_STATIC_IMPLEMENTATIN -O2 -g -c R_dbscan.cpp -o R_dbscan.o

Anything after is missing the g++ bash and returns non-zero exit status:
install.packages("dbscan", repos = "https://mran.revolutionanalytics.com/snapshot/2017-03-20")

Installing package into ‘/work/library/3.3.1’
(as ‘lib’ is unspecified)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 2031k 100 2031k 0 0 7531k 0 --:--:-- --:--:-- --:--:-- 7550k

installing source package ‘dbscan’ ...
** package ‘dbscan’ successfully unpacked and MD5 sums checked
** libs
I/usr/lib64/microsoft-open-r/3.3.1/microsoft-r/3.3/lib64/R/include -DNDEBUG -DU_STATIC_IMPLEMENTATION -I"/work/library/3.3.1/Rcpp/include" -c ANN.cpp -o ANN.o
sh: I/usr/lib64/microsoft-open-r/3.3.1/microsoft-r/3.3/lib64/R/include: No such file or directory
make: [ANN.o] Error 127 (ignored)
I/usr/lib64/microsoft-open-r/3.3.1/microsoft-r/3.3/lib64/R/include -DNDEBUG -DU_STATIC_IMPLEMENTATION -I"/work/library/3.3.1/Rcpp/include" -c R_JP.cpp -o R_JP.o
sh: I/usr/lib64/microsoft-open-r/3.3.1/microsoft-r/3.3/lib64/R/include: No such file or directory
make: [R_JP.o] Error 127 (ignored)

We've lost the g++ call to bash.

Details on my setup:
platform x86_64-pc-linux-gnu
version.string R version 3.3.1 (2016-06-21)
gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) (should handle CC++ 11 just fine)

Add minimum R version to DESCRIPTION

Possible optimizations

Hello,

I run DBSCAN on my machine,
OS: RedHat_7.2_x86_64
Memory: 504GB
CPUCount: 2 CoreCount: 48 HT: yes
CPU Model: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

I have dataset with approx. 290k instances and 17 features - taken from https://www.kaggle.com/dalpozz/creditcardfraud. Some features are dropped from dataset, I left (V1, V2, V3, V4, V5, V6, V7, V9, V10, V11, V12,V14, V16, V17, V18, V19, V21) features. Your implementation is much faster than scikit-learn's one, but still it takes several minutes to cluster this dataset. I'd like to run DBSCAN with different hyperparameters: eps and minPts to choose optimal ones, but it takes hours to do it. Could you add some optimizations, i.e. build kd-tree in parallel? It would make possible to use this algorithm for larger datasets.

Thanks in advance!

Error in mrd(x_dist, coredist) : number of mutual reachability distance values and size of the distances do not agree.

Hi,

I am using this package doing some research on a data set. It's a matrix which has 50970 rows and 18 columns. When I was running the function hdbscan with default settings, an error was raised which said "Error in mrd(x_dist, coredist) :
number of mutual reachability distance values and size of the distances do not agree."

Would you please suggesting how to fix this error? Thanks in advance.

c++ implementation

Will it be possible for you to create a c++ commit for dbscan implementation using your existing code? i am looking for a c++ implementation but am unable to find a standard implementation anywhere.

unrecognized parameters in OPTICS function

I was using package version 0.9.8 without issue. After upgrading, it appears that with both 1.0 and 1.1 I am confronted with the same error when using the OPTICS function:
simpleError in optics(layer[, 1:2], eps = 10, eps_cl = epc, minPts = mp, xi = 0.05): Unknown parameter: eps_cl, xi
Furthermore, when I leave out those parameters I get an optics object, but the optics_cut function is not recognized. I have replicated this on a windows and a linux machine. Has anyone else encountered this?

hdbscan

I'm trying out an algorithm for clustering texts called top2vec implemented by @michalovadek
This algorihm first applies doc2vec on texts to get document embeddings, next reduces the dimensionality of these embeddings to a lower dimensional space using uwot::umap after which dbscan::hdbscan is applied to find clusters.
When trying this out on a corpus with approximately 50000 documents, this fails in the call of dist in the call to hdbscan when passing a 2D matrix. A reproducible example is shown below with some fake data. Is there a way that hdbscan can handle more rows to cluster upon (possibly related to issue #35)

> library(dbscan)
> docs_umap <- matrix(rnorm(50000*2), ncol = 2)
> cl <- dbscan::hdbscan(docs_umap, minPts = 15L)
Error in dist(x, method = "euclidean") : 
  negative length vectors are not allowed
> cl <- dbscan::hdbscan(head(docs_umap, 10000), minPts = 15L)
> str(cl$cluster)
 num [1:10000] 0 4 4 4 0 4 4 0 4 4 ...

pkgdown website for package

It would be great to generate package website hosted via GitHub pages so that references are easier
to search etc. This GitHub Action would suffice: https://github.com/r-lib/actions/blob/v2-branch/examples/pkgdown.yaml if interested I can also assist with creating a PR

hdbscan, distance matrix

Currently the complete distance matrix is computed in the hdbscan function. Is it possible that parts of it are computed and used sequentially for the mutual reachability distance such that it could be stored in smaller objects? I currently get an error message about too large vector size when using the function on a large dataset.

dbscan to cull one spatial dataset based on another?

Hi all,
Thank you for your work on dbscan. It is a great resource. This is not an issue, but a request for advice.

Here's my situation:
I have two sets of xy coordinates, each on the same scale. One is very noisy, the other contains known very high-confidence data. I want to cull the first (noisy) dataset to only points within an epsilon radius of any points in the second (high-confidence) dataset.

For example, see the graphic at the bottom of this post. Here, I have applied a buffer (blue polygon) around the high-confidence points (the high-confidence points are not shown). All black points come from the low-confidence dataset. In this case, I would want to retain any points within the blue buffer.

I have an R script that does this, but it is SLOW. There are also some GIS packages that can do something similar (e.g. rgeos::gBuffer), but these require a bunch of dependencies that I would prefer to avoid. I was thinking that frNN and dbscan could be coerced to accomplish this task, but I wasn't sure.

Any advice is much appreciated.
Thanks,
John

predict for HDBSCAN

For a trained HDBSCAN object, I would like to predict the cluster for new data points similar to what is described here. I see that such a functionality exists for DBSCAN in the function predict.dbscan_fast(), but is missing for hdbscan.

Would it be possible to implement a predict.hdbscan() function similar to the one for dbscan_fast? Is there any technical reason why this function doesn't exist? Otherwise, I'd be happy to try to create a PR for that.

NA values on parameters in dbscan

Hello,

I don't really know if it's a bug or not, but when I write this in R :

dbscan::dbscan(cbind(
     x = runif(10, 0, 10) + rnorm(100, sd = 0.2),
     y = runif(10, 0, 10) + rnorm(100, sd = 0.2)
 ), eps = NA, minPts = NA)

R return this :

DBSCAN clustering for 100 objects.
Parameters: eps = NA, minPts = NA
The clustering contains 18 cluster(s) and 0 noise points.

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 
10  3 10  9  3 10  8  6 10  8  8  8  1  1  1  1  1  2 

Available fields: cluster, eps, minPts

But what does R or Rcpp do with the NA value, it is not infinite nor null, if it's calculate the value eps and minPts how do it calculated it and how can I extract it if it nor a bug.

Thank you,
Arthur

HDBSCAN parameters

Hi,

I would like to have access to min_samples and cluster_selection_method tunable parameters of the hdbscan function.

In the SciKit-learn docs (https://hdbscan.readthedocs.io/en/latest/parameter_selection.html) that the HDBSCAN vignette refers to, there is a chapter on parameter selection for HDBSCAN. While the current implementation of HDBSCAN in the dbscan package for R has only one tunable parameter, minPts, more parameters (including min_samples and cluster_selection_method) are described by the chapter. One scenario that the chapter describes in relation to the cluster_selection_method is:

If you are more interested in having small homogeneous clusters then you may find Excess of Mass has a tendency to pick one or two large clusters and then a number of small extra clusters. In this situation you may be tempted to recluster just the data in the single large cluster. Instead, a better option is to select 'leaf' as a cluster selection method.

This is very similar to what I get with my data (the dimensionality is roughly 4000-by-40): I obtain several smaller clusters (which are better separated) and one "mega-cluster".

I am quite certain that the "mega-cluster" has some meaningful structure within it that I would like to have resolved. From what I read in the SciKit-learn docs chapter, it seems possible to achieve this by tuning those other parameters, particularly, the cluster_selection_method. Is there a way to control and input explicit values to min_samples and cluster_selection_method parameters in the current hdbscan function from the dbscan package for R, or would it be possible to add this feature? Thank you.

My R session info:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pRolocGUI_1.11.2     RColorBrewer_1.1-2   ggplot2_2.2.1        dbscan_1.1-1         pRoloc_1.19.1       
 [6] MLInterfaces_1.56.0  cluster_2.0.6        annotate_1.54.0      XML_3.98-1.10        AnnotationDbi_1.38.2
[11] IRanges_2.10.5       S4Vectors_0.14.7     MSnbase_2.2.0        ProtGenerics_1.8.0   BiocParallel_1.10.1 
[16] mzR_2.10.0           Rcpp_0.12.15         Biobase_2.36.2       BiocGenerics_0.22.1 

loaded via a namespace (and not attached):
  [1] plyr_1.8.4            igraph_1.1.2          lazyeval_0.2.1        splines_3.4.2        
  [5] ggvis_0.4.3           crosstalk_1.0.0       digest_0.6.15         foreach_1.4.4        
  [9] BiocInstaller_1.26.1  htmltools_0.3.6       viridis_0.5.0         gdata_2.18.0         
 [13] magrittr_1.5          memoise_1.1.0         doParallel_1.0.11     sfsmisc_1.1-1        
 [17] limma_3.32.10         recipes_0.1.2         gower_0.1.2           rda_1.0.2-2          
 [21] dimRed_0.1.0          lpSolve_5.6.13        colorspace_1.3-2      blob_1.1.0           
 [25] dplyr_0.7.4           RCurl_1.95-4.10       hexbin_1.27.2         genefilter_1.58.1    
 [29] bindr_0.1             impute_1.50.1         survival_2.41-3       iterators_1.0.9      
 [33] glue_1.2.0            DRR_0.0.3             gtable_0.2.0          ipred_0.9-6          
 [37] zlibbioc_1.22.0       kernlab_0.9-25        ddalpha_1.3.1.1       prabclus_2.2-6       
 [41] DEoptimR_1.0-8        scales_0.5.0          vsn_3.44.0            mvtnorm_1.0-7        
 [45] DBI_0.7               viridisLite_0.3.0     xtable_1.8-2          foreign_0.8-69       
 [49] bit_1.1-12            proxy_0.4-21          mclust_5.4            preprocessCore_1.38.1
 [53] DT_0.4                lava_1.6              prodlim_1.6.1         htmlwidgets_1.0      
 [57] sampling_2.8          threejs_0.3.1         FNN_1.1               fpc_2.1-11           
 [61] modeltools_0.2-21     pkgconfig_2.0.1       flexmix_2.3-14        nnet_7.3-12          
 [65] caret_6.0-78          labeling_0.3          tidyselect_0.2.3      rlang_0.2.0          
 [69] reshape2_1.4.3        munsell_0.4.3         mlbench_2.1-1         tools_3.4.2          
 [73] RSQLite_2.0           pls_2.6-0             broom_0.4.3           stringr_1.3.0        
 [77] yaml_2.1.16           mzID_1.14.0           ModelMetrics_1.1.0    knitr_1.20           
 [81] bit64_0.9-7           robustbase_0.92-8     randomForest_4.6-12   purrr_0.2.4          
 [85] dendextend_1.7.0      bindrcpp_0.2          nlme_3.1-131.1        whisker_0.3-2        
 [89] mime_0.5              RcppRoll_0.2.2        biomaRt_2.32.1        compiler_3.4.2       
 [93] e1071_1.6-8           affyio_1.46.0         tibble_1.4.2          stringi_1.1.6        
 [97] lattice_0.20-35       trimcluster_0.1-2     Matrix_1.2-12         psych_1.7.8          
[101] gbm_2.1.3             pillar_1.1.0          MALDIquant_1.17       bitops_1.0-6         
[105] httpuv_1.3.5          R6_2.2.2              pcaMethods_1.68.0     affy_1.54.0          
[109] hwriter_1.3.2         gridExtra_2.3         codetools_0.2-15      MASS_7.3-48          
[113] gtools_3.5.0          assertthat_0.2.0      CVST_0.2-1            withr_2.1.1          
[117] mnormt_1.5-5          diptest_0.75-7        grid_3.4.2            rpart_4.1-12         
[121] timeDate_3043.102     tidyr_0.8.0           class_7.3-14          Rtsne_0.13           
[125] shiny_1.0.5           lubridate_1.7.2       base64enc_0.1-3

Segmentation fault in HDBSCAN when clustering a large(?) dataset

Hi there,

I was clustering a large-ish dataset with 60k data points using HDBSCAN when the program crashed. I then discovered that I can reliably crash the hdbscan function with a segmentation fault as follows (to reproduce it you need quite a bit of RAM, it crashes once around 60GB are allocated):

library(dbscan)

minpts <- 100
data <- data.frame(feature = 1:60000)

hdbscan(data, minpts)
#> Error: callr failed, could not start R, exited with non-zero status, has crashed or was killed 
#>  *** caught segfault ***
#>  address 0x7fda3398a2c8, cause 'memory not mapped'
#> 
#> Traceback:
#>  1: prims(mrd, n)
#>  2: hdbscan(data, minpts)
#>  3: tryCatchList(expr, classes, parentenv, handlers)
#>  4: tryCatch(hdbscan(data, minpts))
#>  5: eval(expr, envir, enclos)
#>  6: eval(expr, envir, enclos)
#>  7: withVisible(eval(expr, envir, enclos))
#>  8: withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler,     error = eHandler, message = mHandler)
#>  9: doTryCatch(return(expr), name, parentenv, handler)
#> 10: tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 11: tryCatchList(expr, classes, parentenv, handlers)
#> 12: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        s

A quick debugging session with gdb shows that the problem occurs here:

> source("crash-hdbscan.R")

Program received signal SIGSEGV, Segmentation fault.
prims (x_dist=..., n=n@entry=60000) at prims_mst.cpp:77
77      prims_mst.cpp: No such file or directory.

I'm running a 64-bit version of R and the latest version of the dbscan package:

R.version
#>                _                           
#> platform       x86_64-pc-linux-gnu         
#> arch           x86_64                      
#> os             linux-gnu                   
#> system         x86_64, linux-gnu           
#> status                                     
#> major          3                           
#> minor          6.1                         
#> year           2019                        
#> month          07                          
#> day            05                          
#> svn rev        76782                       
#> language       R                           
#> version.string R version 3.6.1 (2019-07-05)
#> nickname       Action of the Toes

packageVersion("dbscan")
#> [1] '1.1.4'

Possible fixes I can think of:

(Using size_t for the index variable might fix the issue) EDIT: won't help I don't think
Splitting up the computation somehow as suggested in #35 could potentially help by reducing the memory consumption

At the very least there should be an error() that causes the clustering to fail in a controlled manner without causing a segmentation fault (which may also cause the R Session to crash).

Getting an error when using predict: x has to be a numeric matrix.

originally i was using the following code to create a model and make prediction (for simplicity i'm making predictions on the learning data itself):

model=dbscan(learnData,0.5,minPts =7);
predict(model,testData,learnData)

in R 3.6.0 the code used to work fine. we upgraded to R 4.0.2 (and therefore newer dbscan) and it stopped working with the following error:

Error in frNN(data, query = newdata, eps = eps, sort = TRUE, ...) :
x has to be a numeric matrix.

I found this issue that looks the same, but was closed:
#14

I even tried the code that was given in the closed issue as an example but it also fails with the same error:

library('dbscan')
data(iris)
d <- cluster::daisy(iris, metric = "gower", stand = TRUE)
model <- dbscan(d, eps = .23, minPts = 50)
predict(model, newdata = iris[1:5,], data = iris)

I also tried upgrading to R 4.2.1, and made sure i'm with dbscan_1.1-10. but still i'm getting the same error.

"Pointes [...]" -> "Points [...]"

Currently, $kNNdist$ has

xlab = "Pointes (sample) sorted by distance"

which I believe should be changed to

xlab = "Points (sample) sorted by distance"

Thanks a lot for the great package.

install.packages -> buildHDBSCAN.cpp(45): error

I can install dbscan on my laptop, no problem.
But it is not installing on a linux cluster I use for big data.
I have tried with R 3.3.3 and 3.4.0. I get the same error after invoking

install.packages("dbscan")

Error Message:

buildHDBSCAN.cpp(45): error: more than one operator "==" matches these operands:
        built-in operator "pointer == pointer"
        function "Rcpp::operator==(Rcpp::Na_Proxy, SEXP)"
        operand types are: Rcpp::internal::generic_name_proxy<19> == SEXP
if (!hcl.containsElementNamed("labels") || hcl["labels"] == R_NilValue){
                                                         ^

compilation aborted for buildHDBSCAN.cpp (code 2)
make: *** 


[/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2 /Compiler/intel2016.4/r/3.3.3/lib64/R/etc/Makeconf:141: buildHDBSCAN.o] Error 2
ERROR: compilation failed for package ‘dbscan’
 removing ‘/home/xxxxxx/R/x86_64-pc-linux-gnu-library/3.3/dbscan’

The downloaded source packages are in
    ‘/tmp/RtmpI1SJp8/downloaded_packages’
Warning message:
In install.packages("dbscan") :
  installation of package ‘dbscan’ had non-zero exit status

install error within ios system

install_github("mhahsler/dbscan")
Downloading github repo mhahsler/dbscan@master
Installing dbscan
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ
--no-save --no-restore CMD INSTALL
'/private/var/folders/n9/hh04k0tn79q9q501k2b4rbbw0000gn/T/RtmpLkatHj/devtools3b955cbbe805/mhahsler-dbscan-1d0a3ac'
--library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library'
--install-tests

installing source package ‘dbscan’ ...
** libs
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c ANN.cpp -o ANN.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [ANN.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c R_JP.cpp -o R_JP.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [R_JP.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c R_dbscan.cpp -o R_dbscan.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [R_dbscan.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c R_density.cpp -o R_density.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [R_density.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c R_frNN.cpp -o R_frNN.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [R_frNN.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c R_kNN.cpp -o R_kNN.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [R_kNN.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c R_optics.cpp -o R_optics.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [R_optics.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c R_regionQuery.cpp -o R_regionQuery.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [R_regionQuery.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c RcppExports.cpp -o RcppExports.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [RcppExports.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c bd_fix_rad_search.cpp -o bd_fix_rad_search.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [bd_fix_rad_search.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c bd_pr_search.cpp -o bd_pr_search.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [bd_pr_search.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c bd_search.cpp -o bd_search.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [bd_search.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c bd_tree.cpp -o bd_tree.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [bd_tree.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c brute.cpp -o brute.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [brute.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c buildHDBSCAN.cpp -o buildHDBSCAN.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [buildHDBSCAN.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c dendrogram.cpp -o dendrogram.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [dendrogram.o] Error 127 (ignored)
gcc -arch x86_64 -std=gnu99 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -fPIC -mtune=core2 -g -O2 -c init.c -o init.o
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c kd_dump.cpp -o kd_dump.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [kd_dump.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c kd_fix_rad_search.cpp -o kd_fix_rad_search.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [kd_fix_rad_search.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c kd_pr_search.cpp -o kd_pr_search.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [kd_pr_search.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c kd_search.cpp -o kd_search.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [kd_search.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c kd_split.cpp -o kd_split.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [kd_split.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c kd_tree.cpp -o kd_tree.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [kd_tree.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c kd_util.cpp -o kd_util.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [kd_util.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c mrd.cpp -o mrd.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [mrd.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c perf.cpp -o perf.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [perf.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c prims_mst.cpp -o prims_mst.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [prims_mst.o] Error 127 (ignored)
I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I"/Library/Frameworks/R.framework/Versions/3.1/Resources/library/Rcpp/include" -c union_find.cpp -o union_find.o
/bin/sh: I/Library/Frameworks/R.framework/Resources/include: No such file or directory
make: [union_find.o] Error 127 (ignored)
-dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -L/usr/local/lib -o dbscan.so ANN.o R_JP.o R_dbscan.o R_density.o R_frNN.o R_kNN.o R_optics.o R_regionQuery.o RcppExports.o bd_fix_rad_search.o bd_pr_search.o bd_search.o bd_tree.o brute.o buildHDBSCAN.o dendrogram.o init.o kd_dump.o kd_fix_rad_search.o kd_pr_search.o kd_search.o kd_split.o kd_tree.o kd_util.o mrd.o perf.o prims_mst.o union_find.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
/bin/sh: -dynamiclib: command not found
make: *** [dbscan.so] Error 127
ERROR: compilation failed for package ‘dbscan’
removing ‘/Library/Frameworks/R.framework/Versions/3.1/Resources/library/dbscan’

Possible Memory Leak

I have been running into a segfault error when running hdbscan. I initially ran into the error when using the doc2vec library which calls hdbscan. I only run into the error when running on my full set of data (137649 rows, ~300mb), but not for a subset. The error still happens even if I try increasing minPts, or increasing the size of the server I am using (I have tried up to 600GB RAM).

Is there any way around this error? Please let me know if there's anything I can do to help debug!

library(doc2vec)
# download sample file - note: file is ~300mb
utils::download.file("https://www.dropbox.com/s/geer73bjp936gaw/gdelt_seg_d2v.bin?dl=1", "temp.bin")
d2v <- read.paragraph2vec(file = "temp.bin")
emb <- as.matrix(d2v)
embedding_umap   <- uwot::tumap(emb , n_neighbors = 100L, n_components = 2, metric = "cosine")
thisfails <- dbscan::hdbscan(embedding_umap , minPts = 25)

Here is the output of sessionInfo():

Matrix products: default
BLAS: /software/free/R/R-4.0.0/lib/R/lib/libRblas.so
LAPACK: /software/free/R/R-4.0.0/lib/R/lib/libRlapack.so

Random number generation:
RNG: L'Ecuyer-CMRG
Normal: Inversion
Sample: Rejection

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ranger_0.12.1 vctrs_0.3.7 rlang_0.4.10
[4] mosaicCore_0.9.0 yardstick_0.0.8 workflowsets_0.0.2
[7] workflows_0.2.2 tune_0.1.5 tidyr_1.1.3
[10] tibble_3.1.1 rsample_0.0.9 recipes_0.1.16
[13] purrr_0.3.4 parsnip_0.1.5 modeldata_0.1.0
[16] infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.5
[19] dials_0.0.9 scales_1.1.1 broom_0.7.6
[22] tidymodels_0.1.3 lubridate_1.7.10 gsubfn_0.7
[25] proto_1.0.0 data.table_1.13.6 dbscan_1.1-8
[28] uwot_0.1.10 Matrix_1.3-2 stringr_1.4.0
[31] doc2vec_0.2.0 futile.logger_1.4.3

loaded via a namespace (and not attached):
[1] splines_4.0.0 foreach_1.5.1 here_0.1
[4] prodlim_2019.11.13 assertthat_0.2.1 conflicted_1.0.4
[7] GPfit_1.0-8 globals_0.14.0 ipred_0.9-11
[10] pillar_1.6.0 backports_1.2.0 lattice_0.20-41
[13] glue_1.4.2 pROC_1.17.0.1 digest_0.6.27
[16] pryr_0.1.4 hardhat_0.1.5 colorspace_2.0-0
[19] plyr_1.8.6 timeDate_3043.102 pkgconfig_2.0.3
[22] lhs_1.1.1 DiceDesign_1.9 listenv_0.8.0
[25] RSpectra_0.16-0 gower_0.2.2 lava_1.6.9
[28] generics_0.1.0 ellipsis_0.3.1 withr_2.3.0
[31] furrr_0.2.2 nnet_7.3-14 cli_2.4.0
[34] survival_3.2-7 magrittr_1.5 crayon_1.3.4
[37] memoise_1.1.0 ps_1.4.0 fansi_0.4.1
[40] future_1.21.0 parallelly_1.24.0 MASS_7.3-53
[43] class_7.3-17 tools_4.0.0 formatR_1.7
[46] lifecycle_1.0.0 munsell_0.5.0 lambda.r_1.2.4
[49] compiler_4.0.0 grid_4.0.0 rstudioapi_0.13
[52] iterators_1.0.13 RcppAnnoy_0.0.18 gtable_0.3.0
[55] codetools_0.2-18 DBI_1.1.0 R6_2.5.0
[58] utf8_1.1.4 rprojroot_1.3-2 futile.options_1.0.1
[61] stringi_1.5.3 parallel_4.0.0 Rcpp_1.0.6
[64] rpart_4.1-15 tidyselect_1.1.0

matching implementations & algorithm correctness

@mhahsler I'm trying to add dbscan and optics to my largeVis package. I've been using your package to generate testing data to make sure that mine is producing the same results. I've come across a couple of thingies that I'm not understand and I hope you can help me clear it up.

In particular, on optics, my implementation and yours are producing similar but different results. The source of the discrepancy seems to be that they get different results in their calculation of core distance.

I've isolated an example. Take the dataset produced by:

data(iris)
dat <- as.matrix(iris[, 1:4])
dupes <- which(duplicated(dat))
dat <- dat[-dupes, ]

With eps = 1 and minPts = 10, my implementation calculates a core distance for point 1 of 0.3. dbscan::optics with those settings, and all other parameters at their defaults, seems to calculate 0.316.

With search = 'linear', dbscan::optics gives:

dbscan::optics(dat, eps = 1, minPts = 10, search = "linear")$coredist[1]
[1] 0.244949

Checking manually, it seems to me that 0.3 is the right answer:

distances <- dist(dat)
neighbors[1:12, 1] # an adjacency matrix generated by largeVis; 0-indexed
[1] 17  4 39 28 27 40  7 49 37 21 48 26 
as.matrix(distances)[1, neighbors[1:12] + 1]

18	5	40	29	28	41	8	50	38	22	49	27
0.1000000	0.1414214	0.1414214	0.1414214	0.1414214	0.1732051	0.1732051	0.2236068	0.2449490	0.3000000	0.3000000	0.3162278

Any ideas? Could this be an issue in the approximate nearest neighbor search?

some strange results of sNN function

I use the function sNN() to find the shared neighbors of points, but I notice a strange result when I test this function on an easy example. The code and results are shown below:

testdata <- c(-2,-1,0,1,2,2.4,2.5,3,3.5,4)
distancematrix <- dist(testdata,method = "minkowski",p=2)
test_res <- sNN(x=distancematrix,k=5,sort = FALSE)
test_res$id[3,]
1 2 3 4 5
2 4 1 5 6
test_res$shared[3,]
[1] 5 4 5 0 0

The shared k=5 neighbors of '0' and '2' are '1' and '2', but sNN() says they have no shared neighbors. Did I do anything wrong? Any suggestions will be appreciated!

BD-trees

Add box-based trees to the NN interface.

kNN crashing (segfault) when matrix has Inf values

The kNN function gave me a segmentation fault when some of the values were infinite. This was fixed by capping the numbers calculated. Apologies for not reproducing the data/code...

Best,
Yonatan