mhahsler / dbscan Goto Github PK
View Code? Open in Web Editor NEWDensity Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
License: GNU General Public License v3.0
Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms - R package
License: GNU General Public License v3.0
Hi there,
I was clustering a large-ish dataset with 60k data points using HDBSCAN when the program crashed. I then discovered that I can reliably crash the hdbscan
function with a segmentation fault as follows (to reproduce it you need quite a bit of RAM, it crashes once around 60GB are allocated):
library(dbscan)
minpts <- 100
data <- data.frame(feature = 1:60000)
hdbscan(data, minpts)
#> Error: callr failed, could not start R, exited with non-zero status, has crashed or was killed
#> *** caught segfault ***
#> address 0x7fda3398a2c8, cause 'memory not mapped'
#>
#> Traceback:
#> 1: prims(mrd, n)
#> 2: hdbscan(data, minpts)
#> 3: tryCatchList(expr, classes, parentenv, handlers)
#> 4: tryCatch(hdbscan(data, minpts))
#> 5: eval(expr, envir, enclos)
#> 6: eval(expr, envir, enclos)
#> 7: withVisible(eval(expr, envir, enclos))
#> 8: withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)
#> 9: doTryCatch(return(expr), name, parentenv, handler)
#> 10: tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 11: tryCatchList(expr, classes, parentenv, handlers)
#> 12: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L s
A quick debugging session with gdb shows that the problem occurs here:
> source("crash-hdbscan.R")
Program received signal SIGSEGV, Segmentation fault.
prims (x_dist=..., n=n@entry=60000) at prims_mst.cpp:77
77 prims_mst.cpp: No such file or directory.
I'm running a 64-bit version of R and the latest version of the dbscan
package:
R.version
#> _
#> platform x86_64-pc-linux-gnu
#> arch x86_64
#> os linux-gnu
#> system x86_64, linux-gnu
#> status
#> major 3
#> minor 6.1
#> year 2019
#> month 07
#> day 05
#> svn rev 76782
#> language R
#> version.string R version 3.6.1 (2019-07-05)
#> nickname Action of the Toes
packageVersion("dbscan")
#> [1] '1.1.4'
Possible fixes I can think of:
size_t
for the index
variable might fix the issue) EDIT: won't help I don't thinkAt the very least there should be an error()
that causes the clustering to fail in a controlled manner without causing a segmentation fault (which may also cause the R Session to crash).
I'm not able to load dbscan starting with V1.1-0 in what appears to be a compiler issue. Any thoughts?
Details:
No trouble getting v 1.0 --
install.packages("dbscan", repos = "https://mran.revolutionanalytics.com/snapshot/2017-03-19")
Installing package into ‘/work/library/3.3.1’
(as ‘lib’ is unspecified)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 765k 100 765k 0 0 2974k 0 --:--:-- --:--:-- --:--:-- 2978k
Anything after is missing the g++ bash and returns non-zero exit status:
install.packages("dbscan", repos = "https://mran.revolutionanalytics.com/snapshot/2017-03-20")
Installing package into ‘/work/library/3.3.1’
(as ‘lib’ is unspecified)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 2031k 100 2031k 0 0 7531k 0 --:--:-- --:--:-- --:--:-- 7550k
We've lost the g++ call to bash.
Details on my setup:
platform x86_64-pc-linux-gnu
version.string R version 3.3.1 (2016-06-21)
gcc version 4.8.5 20150623 (Red Hat 4.8.5-11) (GCC) (should handle CC++ 11 just fine)
Will it be possible for you to create a c++ commit for dbscan implementation using your existing code? i am looking for a c++ implementation but am unable to find a standard implementation anywhere.
I have been running into a segfault error when running hdbscan
. I initially ran into the error when using the doc2vec
library which calls hdbscan
. I only run into the error when running on my full set of data (137649 rows, ~300mb), but not for a subset. The error still happens even if I try increasing minPts, or increasing the size of the server I am using (I have tried up to 600GB RAM).
Is there any way around this error? Please let me know if there's anything I can do to help debug!
library(doc2vec)
# download sample file - note: file is ~300mb
utils::download.file("https://www.dropbox.com/s/geer73bjp936gaw/gdelt_seg_d2v.bin?dl=1", "temp.bin")
d2v <- read.paragraph2vec(file = "temp.bin")
emb <- as.matrix(d2v)
embedding_umap <- uwot::tumap(emb , n_neighbors = 100L, n_components = 2, metric = "cosine")
thisfails <- dbscan::hdbscan(embedding_umap , minPts = 25)
Here is the output of sessionInfo()
:
Matrix products: default
BLAS: /software/free/R/R-4.0.0/lib/R/lib/libRblas.so
LAPACK: /software/free/R/R-4.0.0/lib/R/lib/libRlapack.so
Random number generation:
RNG: L'Ecuyer-CMRG
Normal: Inversion
Sample: Rejection
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ranger_0.12.1 vctrs_0.3.7 rlang_0.4.10
[4] mosaicCore_0.9.0 yardstick_0.0.8 workflowsets_0.0.2
[7] workflows_0.2.2 tune_0.1.5 tidyr_1.1.3
[10] tibble_3.1.1 rsample_0.0.9 recipes_0.1.16
[13] purrr_0.3.4 parsnip_0.1.5 modeldata_0.1.0
[16] infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.5
[19] dials_0.0.9 scales_1.1.1 broom_0.7.6
[22] tidymodels_0.1.3 lubridate_1.7.10 gsubfn_0.7
[25] proto_1.0.0 data.table_1.13.6 dbscan_1.1-8
[28] uwot_0.1.10 Matrix_1.3-2 stringr_1.4.0
[31] doc2vec_0.2.0 futile.logger_1.4.3
loaded via a namespace (and not attached):
[1] splines_4.0.0 foreach_1.5.1 here_0.1
[4] prodlim_2019.11.13 assertthat_0.2.1 conflicted_1.0.4
[7] GPfit_1.0-8 globals_0.14.0 ipred_0.9-11
[10] pillar_1.6.0 backports_1.2.0 lattice_0.20-41
[13] glue_1.4.2 pROC_1.17.0.1 digest_0.6.27
[16] pryr_0.1.4 hardhat_0.1.5 colorspace_2.0-0
[19] plyr_1.8.6 timeDate_3043.102 pkgconfig_2.0.3
[22] lhs_1.1.1 DiceDesign_1.9 listenv_0.8.0
[25] RSpectra_0.16-0 gower_0.2.2 lava_1.6.9
[28] generics_0.1.0 ellipsis_0.3.1 withr_2.3.0
[31] furrr_0.2.2 nnet_7.3-14 cli_2.4.0
[34] survival_3.2-7 magrittr_1.5 crayon_1.3.4
[37] memoise_1.1.0 ps_1.4.0 fansi_0.4.1
[40] future_1.21.0 parallelly_1.24.0 MASS_7.3-53
[43] class_7.3-17 tools_4.0.0 formatR_1.7
[46] lifecycle_1.0.0 munsell_0.5.0 lambda.r_1.2.4
[49] compiler_4.0.0 grid_4.0.0 rstudioapi_0.13
[52] iterators_1.0.13 RcppAnnoy_0.0.18 gtable_0.3.0
[55] codetools_0.2-18 DBI_1.1.0 R6_2.5.0
[58] utf8_1.1.4 rprojroot_1.3-2 futile.options_1.0.1
[61] stringi_1.5.3 parallel_4.0.0 Rcpp_1.0.6
[64] rpart_4.1-15 tidyselect_1.1.0
I'm trying out an algorithm for clustering texts called top2vec implemented by @michalovadek
This algorihm first applies doc2vec on texts to get document embeddings, next reduces the dimensionality of these embeddings to a lower dimensional space using uwot::umap
after which dbscan::hdbscan
is applied to find clusters.
When trying this out on a corpus with approximately 50000 documents, this fails in the call of dist
in the call to hdbscan
when passing a 2D matrix. A reproducible example is shown below with some fake data. Is there a way that hdbscan can handle more rows to cluster upon (possibly related to issue #35)
> library(dbscan)
> docs_umap <- matrix(rnorm(50000*2), ncol = 2)
> cl <- dbscan::hdbscan(docs_umap, minPts = 15L)
Error in dist(x, method = "euclidean") :
negative length vectors are not allowed
> cl <- dbscan::hdbscan(head(docs_umap, 10000), minPts = 15L)
> str(cl$cluster)
num [1:10000] 0 4 4 4 0 4 4 0 4 4 ...
@mhahsler I'm trying to add dbscan
and optics
to my largeVis
package. I've been using your package to generate testing data to make sure that mine is producing the same results. I've come across a couple of thingies that I'm not understand and I hope you can help me clear it up.
In particular, on optics, my implementation and yours are producing similar but different results. The source of the discrepancy seems to be that they get different results in their calculation of core distance.
I've isolated an example. Take the dataset produced by:
data(iris)
dat <- as.matrix(iris[, 1:4])
dupes <- which(duplicated(dat))
dat <- dat[-dupes, ]
With eps = 1 and minPts = 10, my implementation calculates a core distance for point 1 of 0.3. dbscan::optics
with those settings, and all other parameters at their defaults, seems to calculate 0.316.
With search = 'linear'
, dbscan::optics
gives:
dbscan::optics(dat, eps = 1, minPts = 10, search = "linear")$coredist[1]
[1] 0.244949
Checking manually, it seems to me that 0.3 is the right answer:
distances <- dist(dat)
neighbors[1:12, 1] # an adjacency matrix generated by largeVis; 0-indexed
[1] 17 4 39 28 27 40 7 49 37 21 48 26
as.matrix(distances)[1, neighbors[1:12] + 1]
18 | 5 | 40 | 29 | 28 | 41 | 8 | 50 | 38 | 22 | 49 | 27 |
---|---|---|---|---|---|---|---|---|---|---|---|
0.1000000 | 0.1414214 | 0.1414214 | 0.1414214 | 0.1414214 | 0.1732051 | 0.1732051 | 0.2236068 | 0.2449490 | 0.3000000 | 0.3000000 | 0.3162278 |
Any ideas? Could this be an issue in the approximate nearest neighbor search?
I was using package version 0.9.8 without issue. After upgrading, it appears that with both 1.0 and 1.1 I am confronted with the same error when using the OPTICS function:
simpleError in optics(layer[, 1:2], eps = 10, eps_cl = epc, minPts = mp, xi = 0.05): Unknown parameter: eps_cl, xi
Furthermore, when I leave out those parameters I get an optics object, but the optics_cut function is not recognized. I have replicated this on a windows and a linux machine. Has anyone else encountered this?
Hi all,
Thank you for your work on dbscan. It is a great resource. This is not an issue, but a request for advice.
Here's my situation:
I have two sets of xy coordinates, each on the same scale. One is very noisy, the other contains known very high-confidence data. I want to cull the first (noisy) dataset to only points within an epsilon radius of any points in the second (high-confidence) dataset.
For example, see the graphic at the bottom of this post. Here, I have applied a buffer (blue polygon) around the high-confidence points (the high-confidence points are not shown). All black points come from the low-confidence dataset. In this case, I would want to retain any points within the blue buffer.
I have an R script that does this, but it is SLOW. There are also some GIS packages that can do something similar (e.g. rgeos::gBuffer
), but these require a bunch of dependencies that I would prefer to avoid. I was thinking that frNN
and dbscan
could be coerced to accomplish this task, but I wasn't sure.
Any advice is much appreciated.
Thanks,
John
I noticed an edge case in LOF:
> dbscan::lof(dist(c(1,2,3,4,5,6,7)), k=3)
[1] 1.0555556 1.0555556 1.0555556 0.9047619 0.9047619 1.1111111 1.1111111
By symmetry, points 1, 2, 3 should respectively have the same LOFs as points 7, 6, 5. According to my own calculation the answer should be:
[1] 1.0679012 1.0679012 1.0133929 0.8730159 1.0133929 1.0679012 1.0679012
I think this is happening because when determining the nearest neighbors, the code uses kNN which selects exactly k points, but in the LOF calculation the neighborhood is supposed to include all points that are as close as the k-th nearest neighbor, which in the case of ties (like here) can include more than k points.
Can DBSCAN be used in data with non-continuous variables? For example, a dummy for married (0 or 1) or a multi-level categorical variable for marital status (married, single, divorced, widowed etc.). If so, how? I get an error that "x has to be a numeric matrix".
I noticed that it is possible to extract arbitrary number of clusters with hdbscan by using the cutree
function on the hc
component of the hdbscan output. But is there any simple ways to get the membership probabilities for each element given the fixed number of cluster? I.e. a matrix which gives the cluster probabilities for each element and cluster (such as fanny? in
cluster`)?
It would be great to generate package website hosted via GitHub pages so that references are easier
to search etc. This GitHub Action would suffice: https://github.com/r-lib/actions/blob/v2-branch/examples/pkgdown.yaml if interested I can also assist with creating a PR
This is a feature request rather than issue strictly- it seems there should be an option to have weights in OPTICS as there is for DBSCAN. Please correct me if this thinking is faulty.
I have been using LOF for analysing more than 10,000 data point. However, the lof function in dbscan package return nan value.
I would like to analyze trajectories (x, y, t) with multiple values of x and y. First I tried scikit-learn on Python (like this QA: https://stackoverflow.com/questions/52926477/run-dbscan-on-trajectories) but it turned out that we cannot do DBSCAN of multiple x and y values in scikit-learn (only x or y, or single x and single y would work).
With R, I attempted to do DBSCAN with below data, but it required a matrix.
data <- list(list(c(0, 1), c(0, 1)), list(c(1, 1), c(1, 1)), list(c(1, 3), c(1, 3)))
Is it possible to do DBSCAN for multiple x and y values with your package? If possible, how can I input the data?
Like
x1, x2, x3, ..., y1, y2, y3, ... without correspondence of x and y? (Hopefully there should be a correspondence of x and y...)
originally i was using the following code to create a model and make prediction (for simplicity i'm making predictions on the learning data itself):
model=dbscan(learnData,0.5,minPts =7);
predict(model,testData,learnData)
in R 3.6.0 the code used to work fine. we upgraded to R 4.0.2 (and therefore newer dbscan) and it stopped working with the following error:
Error in frNN(data, query = newdata, eps = eps, sort = TRUE, ...) :
x has to be a numeric matrix.
I found this issue that looks the same, but was closed:
#14
I even tried the code that was given in the closed issue as an example but it also fails with the same error:
library('dbscan')
data(iris)
d <- cluster::daisy(iris, metric = "gower", stand = TRUE)
model <- dbscan(d, eps = .23, minPts = 50)
predict(model, newdata = iris[1:5,], data = iris)
I also tried upgrading to R 4.2.1, and made sure i'm with dbscan_1.1-10. but still i'm getting the same error.
Hi there, I'm having issue with predicting cluster labeling for a test data, based on a dbscan clustering model on the training data. I used gower distance matrix when creating the model:
> gowerdist_train <- daisy(analdata_train,
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
Using this gowerdist matrix, the dbscan clustering model created was:
> sb <- dbscan(gowerdist_train, eps = .23, minPts = 50)
Then I try to use predict to label a test dataset using the above dbscan object:
> predict(sb, newdata = analdata_test, data = analdata_train)
But I receive the following error:
Error in frNN(rbind(data, newdata), eps = object$eps, sort = TRUE,
...) : x has to be a numeric matrix
I can take a guess on where this error might be coming from, which is probably due to the absence of the gower distance matrix that hasn't been created for the test data. My question is, should I create a gower distance matrix for all data (datanal_train + datanal_test) separately and feed it into predict? how else would the algorithm know what the distance of test data from the train data is, in order to label?
In that case, would the newdata parameter be the new gower distance matrix that contains ALL (train + test) data? and the data parameter in predict would be the training distance matrix, gowerdist_train?
What I am not quite sure about is how would the predict algorithm distinguish between the test and train data set in the newly created gowerdist_all matrix?
The two matrices (new gowerdist for all data and the gowerdist_train) would obviously not have the same dimensions. Also, it doesn't make sense to me to create a gower distance matrix only for the test data because distances must be relative to the test data, not the test data itself?
It hasn’t been very clear in the dbscan documentation how to use the predict function for distance matrix data (as opposed to raw data)
tried using gower distance matrix for all data (train + test) as my new data and received an error when fed to predict:
> gowerdist_all <- daisy(rbind(analdata_train, analdata_test),
metric = "gower",
stand = FALSE,
type = list(asymm = c(5,6)))
> test_sb_label <- predict(sb, newdata = gowerdist_all, data = gowerdist_train)
ERROR: Error in 1:nrow(data) : argument of length 0 In addition: Warning message: In rbind(data, newdata) : number of columns of result is not a multiple of vector length (arg 1)
The kNN function gave me a segmentation fault when some of the values were infinite. This was fixed by capping the numbers calculated. Apologies for not reproducing the data/code...
Best,
Yonatan
Hello,
I don't really know if it's a bug or not, but when I write this in R :
dbscan::dbscan(cbind(
x = runif(10, 0, 10) + rnorm(100, sd = 0.2),
y = runif(10, 0, 10) + rnorm(100, sd = 0.2)
), eps = NA, minPts = NA)
R return this :
DBSCAN clustering for 100 objects.
Parameters: eps = NA, minPts = NA
The clustering contains 18 cluster(s) and 0 noise points.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
10 3 10 9 3 10 8 6 10 8 8 8 1 1 1 1 1 2
Available fields: cluster, eps, minPts
But what does R or Rcpp do with the NA value, it is not infinite nor null, if it's calculate the value eps and minPts how do it calculated it and how can I extract it if it nor a bug.
Thank you,
Arthur
For a trained HDBSCAN object, I would like to predict the cluster for new data points similar to what is described here. I see that such a functionality exists for DBSCAN in the function predict.dbscan_fast()
, but is missing for hdbscan
.
Would it be possible to implement a predict.hdbscan()
function similar to the one for dbscan_fast
? Is there any technical reason why this function doesn't exist? Otherwise, I'd be happy to try to create a PR for that.
Hello, thank you for this very helpful package.
Would you be open to a pull request adding broom tidier methods for dbscan and hdbscan objects?
Dear author,
I have started a detailed examination of the hdbscan function contained in the dbscan package because I observed an anomalous distribution of the outlier scores for my dataset. In fact, the scores tended to group all around two values: 0 and 1.
In order to grasp the reasons behind this problem I have tried to apply the hdbscan function both in R and in python to the same dataset.
The result that I got is that using the python the distribution of the outlier scores makes a lot more sense than the one outputted by R on the same dataset.
Is there a known issue on this?
Would you be able to fix this?
Thanks in advance for your feedback
Best regards
Matteo
Add box-based trees to the NN interface.
Currently the complete distance matrix is computed in the hdbscan function. Is it possible that parts of it are computed and used sequentially for the mutual reachability distance such that it could be stored in smaller objects? I currently get an error message about too large vector size when using the function on a large dataset.
Currently hdbscan is documenting the following, but only x, minPts, gen_hdbscan_tree and gen_simplified_tree are used, is that intended?
#' @param x a data matrix (Euclidean distances are used) or a [dist] object
#' calculated with an arbitrary distance metric.
#' @param minPts integer; Minimum size of clusters. See details.
#' @param gen_hdbscan_tree logical; should the robust single linkage tree be
#' explicitly computed (see cluster tree in Chaudhuri et al, 2010).
#' @param gen_simplified_tree logical; should the simplified hierarchy be
#' explicitly computed (see Campello et al, 2013).
#' @param verbose report progress.
#' @param ... additional arguments are passed on.
#' @param scale integer; used to scale condensed tree based on the graphics
#' device. Lower scale results in wider trees.
#' @param gradient character vector; the colors to build the condensed tree
#' coloring with.
#' @param show_flat logical; whether to draw boxes indicating the most stable
#' clusters.
#' @param coredist numeric vector with precomputed core distances (optional).
I have a large distance matrix and want to first build frNN
object from scratch to reduce memory burden. I first initialize a frNN
object with one node and then add my distance and node id to this object.
frNN_dis <- function(dis,eps = 0,if_direct = T){
if(if_direct){
dis <- rbind(dis,dis[,c(2,1,3)])
}
dis <- as.data.frame(dis)
colnames(dis) <- c("umi1","umi2","dis")
dis <- dis[order(dis$umi1),]
out_frNN <- frNN(as.dist(1), eps = 5)
out_frNN$dist <- split(dis$dis,dis$umi1)
out_frNN$id <- split(dis$umi2,dis$umi1)
out_frNN$sort <- F
out_frNN <- frNN(out_frNN,eps = eps)
return(out_frNN)
}
But when I used frNN object built in this way in dbscan
, it caused a segfault error
*** caught segfault ***
address 0x51, cause 'memory not mapped'
Traceback:
1: dbscan_int(x, as.double(eps), as.integer(minPts), as.double(weights), as.integer(borderPoints), as.integer(search), as.integer(bucketSize), as.integer(splitRule), as.double(approx), frNN)
2: dbscan(cpp_test_nn, eps = 0, minPts = 1)
3: eval(expr, envir, enclos)
4: eval(expr, envir, enclos)
5: withVisible(eval(expr, envir, enclos))
6: withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)
7: doTryCatch(return(expr), name, parentenv, handler)
8: tryCatchOne(expr, names, parentenv, handlers[[1L]])
9: tryCatchList(expr, classes, parentenv, handlers)
10: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L sm <- strsplit(conditionMessage(e), "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && isTRUE(getOption("show.error.messages"))) { cat(msg, file = outFile) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
11: try(f, silent = TRUE)
12: handle(ev <- withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler))
13: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval(expr, envir, enclos)), warning = wHandler, error = eHandler, message = mHandler)))
14: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos, debug = debug, last = i == length(out), use_try = stop_on_error != 2L, keep_warning = keep_warning, keep_message = keep_message, output_handler = output_handler, include_timing = include_timing)
15: evaluate(request$content$code, envir = .GlobalEnv, output_handler = oh, stop_on_error = 1L)
16: doTryCatch(return(expr), name, parentenv, handler)
17: tryCatchOne(expr, names, parentenv, handlers[[1L]])
18: tryCatchList(expr, names[-nh], parentenv, handlers[-nh])
19: doTryCatch(return(expr), name, parentenv, handler)
20: tryCatchOne(tryCatchList(expr, names[-nh], parentenv, handlers[-nh]), names[nh], parentenv, handlers[[nh]])
21: tryCatchList(expr, classes, parentenv, handlers)
22: tryCatch(evaluate(request$content$code, envir = .GlobalEnv, output_handler = oh, stop_on_error = 1L), interrupt = function(cond) { log_debug("Interrupt during execution") interrupted <<- TRUE}, error = .self$handle_error)
23: executor$execute(msg)
24: handle_shell()
25: kernel$run()
26: IRkernel::main()
An irrecoverable exception occurred. R is aborting now ...
The link to download and install is not working for me. I used -
install.packages("dbscan")
http://cran.rstudio.com/bin/windows/contrib/3.2/dbscan_0.9-6.zip
may you clarify is multi-density clustering is implemented, since it is mentioned on references ?
@inproceedings{ghanbarpour2014exdbscan,
title={EXDBSCAN: An extension of DBSCAN to detect clusters in multi-density datasets},
author={Ghanbarpour, Asieh and Minaei, Behrooz},
booktitle={Intelligent Systems (ICIS), 2014 Iranian Conference on},
pages={1--5},
year={2014},
organization={IEEE}
}
Malzer & Baum describe how adding a minimum threshold value of eps
to HDBSCAN can help with 'micro-clusters' in high density regions. This is implemented in the hdbscan Python package as a cluster_selection_epsilon
parameter. Perhaps it could be added to this package too?
install_github("mhahsler/dbscan")
Downloading github repo mhahsler/dbscan@master
Installing dbscan
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ
--no-save --no-restore CMD INSTALL
'/private/var/folders/n9/hh04k0tn79q9q501k2b4rbbw0000gn/T/RtmpLkatHj/devtools3b955cbbe805/mhahsler-dbscan-1d0a3ac'
--library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library'
--install-tests
This command previously would work now it fails:
dbscan::lof(x = dt, k = 3, sort = F, approx = 1.1)
Error in kNN(x, k, sort = TRUE, ...) : formal argument "sort" matched by multiple actual arguments
I use the function sNN() to find the shared neighbors of points, but I notice a strange result when I test this function on an easy example. The code and results are shown below:
testdata <- c(-2,-1,0,1,2,2.4,2.5,3,3.5,4)
distancematrix <- dist(testdata,method = "minkowski",p=2)
test_res <- sNN(x=distancematrix,k=5,sort = FALSE)
test_res$id[3,]
1 2 3 4 5
2 4 1 5 6
test_res$shared[3,]
[1] 5 4 5 0 0
The shared k=5 neighbors of '0' and '2' are '1' and '2', but sNN() says they have no shared neighbors. Did I do anything wrong? Any suggestions will be appreciated!
Currently,
xlab = "Pointes (sample) sorted by distance"
which I believe should be changed to
xlab = "Points (sample) sorted by distance"
Thanks a lot for the great package.
Running DBScan with MinPts=1 and eps = 50 on the attached data returns 11 clusters, i am expecting 10 clusters ... am i missing something?
Column A is X Data, Column B is Y Data. Please note this data is always organized smallest to largest and the Y Data is always 1.
Please look at tab 42127348110000, i have added tab 42127348110000-manual clusters to show my assumptions for clusters.
I am running R as follows:
data <- read.csv(file.names[i])
res <- dbscan(data, eps=50, MinPts = 1
thanks.
I can install dbscan on my laptop, no problem.
But it is not installing on a linux cluster I use for big data.
I have tried with R 3.3.3 and 3.4.0. I get the same error after invoking
install.packages("dbscan")
Error Message:
buildHDBSCAN.cpp(45): error: more than one operator "==" matches these operands:
built-in operator "pointer == pointer"
function "Rcpp::operator==(Rcpp::Na_Proxy, SEXP)"
operand types are: Rcpp::internal::generic_name_proxy<19> == SEXP
if (!hcl.containsElementNamed("labels") || hcl["labels"] == R_NilValue){
^
compilation aborted for buildHDBSCAN.cpp (code 2)
make: ***
[/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2 /Compiler/intel2016.4/r/3.3.3/lib64/R/etc/Makeconf:141: buildHDBSCAN.o] Error 2
ERROR: compilation failed for package ‘dbscan’
removing ‘/home/xxxxxx/R/x86_64-pc-linux-gnu-library/3.3/dbscan’
The downloaded source packages are in
‘/tmp/RtmpI1SJp8/downloaded_packages’
Warning message:
In install.packages("dbscan") :
installation of package ‘dbscan’ had non-zero exit status
Hi,
I would like to have access to min_samples and cluster_selection_method tunable parameters of the hdbscan
function.
In the SciKit-learn docs (https://hdbscan.readthedocs.io/en/latest/parameter_selection.html) that the HDBSCAN vignette refers to, there is a chapter on parameter selection for HDBSCAN. While the current implementation of HDBSCAN in the dbscan package for R has only one tunable parameter, minPts, more parameters (including min_samples and cluster_selection_method) are described by the chapter. One scenario that the chapter describes in relation to the cluster_selection_method is:
If you are more interested in having small homogeneous clusters then you may find Excess of Mass has a tendency to pick one or two large clusters and then a number of small extra clusters. In this situation you may be tempted to recluster just the data in the single large cluster. Instead, a better option is to select 'leaf' as a cluster selection method.
This is very similar to what I get with my data (the dimensionality is roughly 4000-by-40): I obtain several smaller clusters (which are better separated) and one "mega-cluster".
I am quite certain that the "mega-cluster" has some meaningful structure within it that I would like to have resolved. From what I read in the SciKit-learn docs chapter, it seems possible to achieve this by tuning those other parameters, particularly, the cluster_selection_method. Is there a way to control and input explicit values to min_samples and cluster_selection_method parameters in the current hdbscan
function from the dbscan package for R, or would it be possible to add this feature? Thank you.
My R session info:
> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.3
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] pRolocGUI_1.11.2 RColorBrewer_1.1-2 ggplot2_2.2.1 dbscan_1.1-1 pRoloc_1.19.1
[6] MLInterfaces_1.56.0 cluster_2.0.6 annotate_1.54.0 XML_3.98-1.10 AnnotationDbi_1.38.2
[11] IRanges_2.10.5 S4Vectors_0.14.7 MSnbase_2.2.0 ProtGenerics_1.8.0 BiocParallel_1.10.1
[16] mzR_2.10.0 Rcpp_0.12.15 Biobase_2.36.2 BiocGenerics_0.22.1
loaded via a namespace (and not attached):
[1] plyr_1.8.4 igraph_1.1.2 lazyeval_0.2.1 splines_3.4.2
[5] ggvis_0.4.3 crosstalk_1.0.0 digest_0.6.15 foreach_1.4.4
[9] BiocInstaller_1.26.1 htmltools_0.3.6 viridis_0.5.0 gdata_2.18.0
[13] magrittr_1.5 memoise_1.1.0 doParallel_1.0.11 sfsmisc_1.1-1
[17] limma_3.32.10 recipes_0.1.2 gower_0.1.2 rda_1.0.2-2
[21] dimRed_0.1.0 lpSolve_5.6.13 colorspace_1.3-2 blob_1.1.0
[25] dplyr_0.7.4 RCurl_1.95-4.10 hexbin_1.27.2 genefilter_1.58.1
[29] bindr_0.1 impute_1.50.1 survival_2.41-3 iterators_1.0.9
[33] glue_1.2.0 DRR_0.0.3 gtable_0.2.0 ipred_0.9-6
[37] zlibbioc_1.22.0 kernlab_0.9-25 ddalpha_1.3.1.1 prabclus_2.2-6
[41] DEoptimR_1.0-8 scales_0.5.0 vsn_3.44.0 mvtnorm_1.0-7
[45] DBI_0.7 viridisLite_0.3.0 xtable_1.8-2 foreign_0.8-69
[49] bit_1.1-12 proxy_0.4-21 mclust_5.4 preprocessCore_1.38.1
[53] DT_0.4 lava_1.6 prodlim_1.6.1 htmlwidgets_1.0
[57] sampling_2.8 threejs_0.3.1 FNN_1.1 fpc_2.1-11
[61] modeltools_0.2-21 pkgconfig_2.0.1 flexmix_2.3-14 nnet_7.3-12
[65] caret_6.0-78 labeling_0.3 tidyselect_0.2.3 rlang_0.2.0
[69] reshape2_1.4.3 munsell_0.4.3 mlbench_2.1-1 tools_3.4.2
[73] RSQLite_2.0 pls_2.6-0 broom_0.4.3 stringr_1.3.0
[77] yaml_2.1.16 mzID_1.14.0 ModelMetrics_1.1.0 knitr_1.20
[81] bit64_0.9-7 robustbase_0.92-8 randomForest_4.6-12 purrr_0.2.4
[85] dendextend_1.7.0 bindrcpp_0.2 nlme_3.1-131.1 whisker_0.3-2
[89] mime_0.5 RcppRoll_0.2.2 biomaRt_2.32.1 compiler_3.4.2
[93] e1071_1.6-8 affyio_1.46.0 tibble_1.4.2 stringi_1.1.6
[97] lattice_0.20-35 trimcluster_0.1-2 Matrix_1.2-12 psych_1.7.8
[101] gbm_2.1.3 pillar_1.1.0 MALDIquant_1.17 bitops_1.0-6
[105] httpuv_1.3.5 R6_2.2.2 pcaMethods_1.68.0 affy_1.54.0
[109] hwriter_1.3.2 gridExtra_2.3 codetools_0.2-15 MASS_7.3-48
[113] gtools_3.5.0 assertthat_0.2.0 CVST_0.2-1 withr_2.1.1
[117] mnormt_1.5-5 diptest_0.75-7 grid_3.4.2 rpart_4.1-12
[121] timeDate_3043.102 tidyr_0.8.0 class_7.3-14 Rtsne_0.13
[125] shiny_1.0.5 lubridate_1.7.2 base64enc_0.1-3
Thank you for the fantastic dbscan package.
I believe I am finding an error when trying to use dbscan::hdbscan in large datasets.
Error in mrd(x_dist, coredist) :
number of mutual reachability distance values and size of the distances do not agree.
Here is a reproducible example that generates a large dataset.
If n <- 20000
the code runs without errors. It seems that values for 30000 and above trigger the error.
library(dbscan)
library(uwot)
n <- 30000
ngenes <- 3000
n.means <- 2^runif(ngenes, 2, 10)
n.disp <- 100/n.means + 0.5
set.seed(1000)
m1 <- matrix(rnbinom(ngenes*n, mu=n.means, size=1/n.disp), ncol=n)
set.seed(1000)
m2 <- matrix(rnbinom(ngenes*n, mu=5*n.means, size=1/n.disp), ncol=n)
m <- cbind(m1, m2)
um <- uwot::umap(t(m))
hd <- dbscan::hdbscan(um, 10)
I couldn't exactly pinpoint as to why it happens. But it seems to happen when calling dbscan::mrdist()
Possibly related to #46 ? Apologies if it was better to continue the conversation there. However, I do not get a memory error in this issue and I confirmed that there was enough memory to allocate the object.
Session Info:
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] uwot_0.1.10 Matrix_1.3-4 dbscan_1.1-10
loaded via a namespace (and not attached):
[1] backports_1.3.0 circlize_0.4.13 igraph_1.2.8
[4] lazyeval_0.2.2 splines_4.1.1 flowCore_2.6.0
[7] BiocParallel_1.28.0 usethis_2.1.3 GenomeInfoDb_1.30.0
[10] ggplot2_3.3.5 amap_0.8-18 digest_0.6.28
[13] foreach_1.5.1 yulab.utils_0.0.4 htmltools_0.5.2
[16] viridis_0.6.2 ggalluvial_0.12.3 fansi_0.5.0
[19] magrittr_2.0.1 ScaledMatrix_1.2.0 cluster_2.1.2
[22] doParallel_1.0.16 mixtools_1.2.0 limma_3.50.0
[25] fastcluster_1.2.3 ComplexHeatmap_2.10.0 RcppParallel_5.1.4
[28] matrixStats_0.61.0 cytolib_2.6.0 colorspace_2.0-2
[31] xfun_0.28 dplyr_1.0.7 crayon_1.4.2
[34] RCurl_1.98-1.5 jsonlite_1.7.2 survival_3.2-13
[37] iterators_1.0.13 ape_5.5 glue_1.4.2
[40] gtable_0.3.0 zlibbioc_1.40.0 XVector_0.34.0
[43] Rsubread_2.8.1 GetoptLong_1.0.5 DelayedArray_0.20.0
[46] car_3.0-12 BiocSingular_1.10.0 kernlab_0.9-29
[49] shape_1.4.6 SingleCellExperiment_1.16.0 prabclus_2.3-2
[52] BiocGenerics_0.40.0 DEoptimR_1.0-9 abind_1.4-5
[55] scales_1.1.1 DBI_1.1.1 edgeR_3.36.0
[58] rstatix_0.7.0 miniUI_0.1.1.1 Rcpp_1.0.7
[61] viridisLite_0.4.0 xtable_1.8-4 clue_0.3-60
[64] gridGraphics_0.5-1 tidytree_0.3.5 dqrng_0.3.0
[67] rsvd_1.0.5 mclust_5.4.7 stats4_4.1.1
[70] metapod_1.2.0 gplots_3.1.1 RColorBrewer_1.1-2
[73] fpc_2.2-9 modeltools_0.2-23 ellipsis_0.3.2
[76] pkgconfig_2.0.3 flexmix_2.3-17 scuttle_1.4.0
[79] nnet_7.3-16 copykit_0.0.0.9039 janitor_2.1.0
[82] locfit_1.5-9.4 utf8_1.2.2 DNAcopy_1.68.0
[85] ggplotify_0.1.0 tidyselect_1.1.1 rlang_0.4.12
[88] later_1.3.0 munsell_0.5.0 tools_4.1.1
[91] generics_0.1.1 broom_0.7.10 evaluate_0.14
[94] stringr_1.4.0 fastmap_1.1.0 yaml_2.2.1
[97] ggtree_3.2.0 fs_1.5.0 knitr_1.36
[100] robustbase_0.93-9 caTools_1.18.2 purrr_0.3.4
[103] nlme_3.1-153 sparseMatrixStats_1.6.0 mime_0.12
[106] scran_1.22.1 aplot_0.1.1 pracma_2.3.3
[109] compiler_4.1.1 beeswarm_0.4.0 png_0.1-7
[112] treeio_1.18.0 tibble_3.1.5 statmod_1.4.36
[115] stringi_1.7.5 forcats_0.5.1 lattice_0.20-45
[118] bluster_1.4.0 vctrs_0.3.8 copynumber_1.34.0
[121] pillar_1.6.4 lifecycle_1.0.1 GlobalOptions_0.1.2
[124] RcppAnnoy_0.0.19 BiocNeighbors_1.12.0 data.table_1.14.2
[127] cowplot_1.1.1 bitops_1.0-7 irlba_2.3.3
[130] httpuv_1.6.3 patchwork_1.1.1 GenomicRanges_1.46.0
[133] R6_2.5.1 promises_1.2.0.1 RProtoBufLib_2.6.0
[136] KernSmooth_2.23-20 gridExtra_2.3 vipor_0.4.5
[139] IRanges_2.28.0 codetools_0.2-18 boot_1.3-28
[142] MASS_7.3-54 gtools_3.9.2 assertthat_0.2.1
[145] SummarizedExperiment_1.24.0 rjson_0.2.20 withr_2.4.3
[148] S4Vectors_0.32.2 GenomeInfoDbData_1.2.7 diptest_0.76-0
[151] parallel_4.1.1 grid_4.1.1 ggfun_0.0.4
[154] beachmat_2.10.0 tidyr_1.1.4 class_7.3-19
[157] snakecase_0.11.0 rmarkdown_2.11 DelayedMatrixStats_1.16.0
[160] carData_3.0-4 segmented_1.3-4 MatrixGenerics_1.6.0
[163] ggnewscale_0.4.5 lubridate_1.8.0 Biobase_2.54.0
[166] shiny_1.7.1 ggbeeswarm_0.6.0
Thank you, let me know if I can help more.
Or Hierarchical Density-Based Spatial Clustering of Applications with Noise. Python implementation here: https://github.com/lmcinnes/hdbscan
It's nice, because you don't have to tune the eps parameter, which makes it a little easier to use. It'd be awesome to have an implementation of hdbscan in the R package dbscan.
Thank you!
Kind of duplicate to #20
We get an error compiling buildHDBSCAN.cpp with intels icpc on our HPC:
buildHDBSCAN.cpp(46): error: more than one operator "==" matches these operands:
built-in operator "pointer == pointer"
function "Rcpp::operator==(Rcpp::Na_Proxy, SEXP)"
operand types are: Rcpp::internal::generic_name_proxy<19, Rcpp::PreserveStorage> == SEXP
if (!hcl.containsElementNamed("labels") || hcl["labels"] == R_NilValue){
from the command
icpc -std=c++11 -I/cm/shared/uniol/software/R/3.3.1-intel-2016b/lib64/R/include -DNDEBUG -I/cm/shared/uniol/software/imkl/11.3.3.210-iimpi-2016b/mkl/include -I/cm/shared/uniol/software/libreadline/6.3-intel-2016b/include -I/cm/shared/uniol/software/ncurses/6.0-intel-2016b/include -I/cm/shared/uniol/software/bzip2/1.0.6-intel-2016b/include -I/cm/shared/uniol/software/XZ/5.2.2-intel-2016b/include -I/cm/shared/uniol/software/zlib/1.2.8-intel-2016b/include -I/cm/shared/uniol/software/SQLite/3.13.0-intel-2016b/include -I/cm/shared/uniol/software/PCRE/8.38-intel-2016b/include -I/cm/shared/uniol/software/libpng/1.6.23-intel-2016b/include -I/cm/shared/uniol/software/libjpeg-turbo/1.5.0-intel-2016b/include -I/cm/shared/uniol/software/LibTIFF/4.0.6-intel-2016b/include -I/cm/shared/uniol/software/Java/1.8.0_112/include -I/cm/shared/uniol/software/Tcl/8.6.5-intel-2016b/include -I/cm/shared/uniol/software/Tk/8.6.5-intel-2016b/include -I/cm/shared/uniol/software/cURL/7.49.1-intel-2016b/include -I/cm/shared/uniol/software/libxml2/2.9.4-intel-2016b/include -I/cm/shared/uniol/software/GDAL/2.1.0-intel-2016b/include -I/cm/shared/uniol/software/PROJ/4.9.2-intel-2016b/include -I/cm/shared/uniol/software/GMP/6.1.1-intel-2016b/include -I"/user/sebo8575/R/x86_64-pc-linux-gnu-library/3.3/Rcpp/include" -fpic -O2 -xHost -ftz -fp-speculation=safe -fp-model source -c buildHDBSCAN.cpp -o buildHDBSCAN.o
which could be solved by just casting hcl["labels"]
- seems Intel is more picky than gcc. The first patch is:
diff --git a/src/buildHDBSCAN.cpp b/src/buildHDBSCAN.cpp
index 89e8e71..d8dee7b 100644
--- a/src/buildHDBSCAN.cpp
+++ b/src/buildHDBSCAN.cpp
@@ -43,7 +43,7 @@ List buildDendrogram(List hcl) {
NumericVector height = hcl["height"];
IntegerVector order = hcl["order"];
List labels = List(); // allows to avoid type inference
- if (!hcl.containsElementNamed("labels") || hcl["labels"] == R_NilValue){
+ if (!hcl.containsElementNamed("labels") || (SEXP)hcl["labels"] == R_NilValue){
labels = seq_along(order);
} else {
labels = as<StringVector>(hcl["labels"]);
maybe you can integrate a similar patch in the next release.
You do not give details on the indexes chosen in the comparison methods. From the runtime numbers, it appears that you did not add enable an index in ELKI? Do the runtimes include JVM startup cost, R startup cost, or did you use internal runtime measurements (e.g. -time
).
https://rdrr.io/cran/dbscan/f/inst/doc/dbscan.pdf
I suggest to make separate benchmarks with and without indexing, and experiment with different indexes such as the cover tree which performed quite well in other studies. Benchmarking an implementation without index vs. one with index obviously is unfair. For OPTICS, I suggest you use a small epsilon whenever using an index, for obvious reasons.
Furthermore, benchmarks with such tiny data sets (<=8000 instances) are likely not very informative. Startup costs such as the JVM hotspot compilation will contribute substantially.
Here is one slightly larger (but still small!), but real data and 13k instances: https://cs.joensuu.fi/sipu/datasets/MopsiLocationsUntil2012-Finland.txt and with appropriate parameters this should work with both DBSCAN and OPTICS. For more meaningful results, I suggest to use at minimum 30.000-50.000 instances of real data, e.g. coordinates from Tweets (supposedly not shareable) or Flickr images (could be shareable?). With indexes, the data size should probably go up to a million points.
Also the R dbscan package documentation should be more upfront about the performance strongly depending on the use of Euclidean distance because of the kd-tree. The ANN library is great, but this is a substantial limitation. For a tweet locations example, Haversine distance is much more appropriate.
Sorry that I have not yet prepared a technical report on the filtering step. It has some reasonable theory behind it, but it is too incremental as that a major conference would publish this, so there only exists a draft report.
Both will be very valuable for users, and therefore belong into the manual
It would be of great use (in other packages and experimentation) to export internal functions (of HDBSCAN) like mrd
, computeStability
which are valuable as standalone functions.
Trying to learn and use dbscan in R recently. When I try the simple example found in the document,
data(iris)
iris <- as.matrix(iris[,1:4])
kNNdistplot(iris, k = 5)
abline(h=.5, col = "red", lty=2)
res <- dbscan(iris, eps = .5, minPts = 5)
goes through everything until it hits the dbscan command (last line). The error thrown is as following
Error in pmatch(toupper(splitRule), .ANNsplitRule) : object '.ANNsplitRule' not found**
Any insights on what I've missed?
Hello,
I run DBSCAN on my machine,
OS: RedHat_7.2_x86_64
Memory: 504GB
CPUCount: 2 CoreCount: 48 HT: yes
CPU Model: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
I have dataset with approx. 290k instances and 17 features - taken from https://www.kaggle.com/dalpozz/creditcardfraud. Some features are dropped from dataset, I left (V1, V2, V3, V4, V5, V6, V7, V9, V10, V11, V12,V14, V16, V17, V18, V19, V21) features. Your implementation is much faster than scikit-learn's one, but still it takes several minutes to cluster this dataset. I'd like to run DBSCAN with different hyperparameters: eps and minPts to choose optimal ones, but it takes hours to do it. Could you add some optimizations, i.e. build kd-tree in parallel? It would make possible to use this algorithm for larger datasets.
Thanks in advance!
Hi,
I am using this package doing some research on a data set. It's a matrix which has 50970 rows and 18 columns. When I was running the function hdbscan with default settings, an error was raised which said "Error in mrd(x_dist, coredist) :
number of mutual reachability distance values and size of the distances do not agree."
Would you please suggesting how to fix this error? Thanks in advance.
Li
Hi,
Changing back from my comment earlier. Some of the unexpected behavior that I saw was due to bugs in my own code -- a good thing!
Still one thing remains that I find odd: Using uniform random weights does improve clustering... at least in my data set.
But the issue can be closed!
Hey!
I'm using tidySEM to visualise structural equation models. The package calls dbscan::pointdensity() at some point.
For a specific case, this led to an "R session aborted" when calling pointdensity().
I can consistently reproduce this on my machine with the following MWE:
library(dbscan)
tmp <- structure(list(x = c(5, 6, 7, 9, 10, 11.25, 5.05, 6.3, 7, 9,
9.7, 10.95, 3, 3, 3.3, 3.3, 3, 3, 13, 13, 12.7, 12.7, 13, 13,
8, 8, 5.95, 10.05, 8, 6, 9.75, 5.75, 10, 8, NaN), y = c(16.05,
16.05, 16.05, 16.05, 16.05, 15.8, 4, 3.75, 3.95, 3.95, 3.75,
4, 13.05, 12.05, 11.25, 8.75, 7.95, 6.95, 13.05, 12.05, 11.25,
8.75, 7.95, 6.95, 11.95, 8.05, 10, 10, 10, 12, 11.75, 7.75, 8,
10, NaN)), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L,
10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L,
23L, 24L, 30L, 31L, 32L, 33L, 58L, 59L, 60L, 61L, 62L, 63L, 35L
), class = "data.frame")
pointdensity(x = tmp, eps = 5)
I hope this is reproducible.
Best,
Paul
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.