Dear author, I have started a detailed examination of the hdbscan fu

Do you have a reproducible example by chance? A <a href="https://github.com/tidyverse/

Reprex should do most things for you actually. Example: <div class="highlight high

Discrepancies in outlier score between HDBSCAN R and python about dbscan HOT 7 CLOSED

mhahsler commented on May 23, 2024

Discrepancies in outlier score between HDBSCAN R and python

from dbscan.

Comments (7)

peekxc commented on May 23, 2024

Do you have a reproducible example by chance? A reprex or something would help a lot

from dbscan.

matteo-1980 commented on May 23, 2024

Okay. I will produce it. Where shall I then store code and data?

from dbscan.

peekxc commented on May 23, 2024

Reprex should do most things for you actually. Example:

data("iris")
model <- dbscan::hdbscan(iris[,-5], 5)
print(model)
#> HDBSCAN clustering for 150 objects.
#> Parameters: minPts = 5
#> The clustering contains 2 cluster(s) and 0 noise points.
#> 
#>   1   2 
#> 100  50 
#> 
#> Available fields: cluster, minPts, cluster_scores,
#>                   membership_prob, outlier_scores, hc
head(model$outlier_scores)
#> [1] 0.0000000 0.0000000 0.4654775 0.4226497 0.1835034 0.6220355

^{Created on 2019-11-01 by the reprex package (v0.3.0)}

from dbscan.

matteo-1980 commented on May 23, 2024

Here it is :-)
Only the R part for the moment (I have issues with Python, which I hope to solve on Monday).
As you can see most of the observations get NA as outlier score and the remaining observations take either 0 or 1 as score, very few obs being distributed in the middle.
I guess this is related to overlapping points. I noticed this when replicating the data: with artificial data I did not get the problem. I was able to reproduce it only by reducing the variety among the points.
Thanks for the help

# Libraries
library(dbscan)
#> Warning: package 'dbscan' was built under R version 3.5.3
library(DescTools)
#> Warning: package 'DescTools' was built under R version 3.5.3
library(reprex)
#> Warning: package 'reprex' was built under R version 3.5.3

# Generate data
dset1=data.frame("id1"= c(rep(1,500),rep(0,500)),
                 "id2"= c(rep(0,500),rep(1,500)),
                 "val"= c(round(rnorm(250,1,0.5),1),round(rnorm(250,2,0.5),1),round(rnorm(250,2,0.25),2),round(rnorm(250,3,0.25),1)) )

# Check repeated observations
print(paste0('Repeated observations over total observations: ',round(1-dim(unique(dset1))[1]/dim(dset1)[1],4)*100,'%'))
#> [1] "Repeated observations over total observations: 85.8%"

# Generate outlying observation
dset1=rbind(dset1,data.frame("id1"=0, "id2"=1, "val"=-3))

# Launch hdbscan
hdb_res1=hdbscan(dset1, minPts = 5)

# Check na scores
print(paste0('Scores equal to NA over total observations: ',round(sum(is.na(hdb_res1$outlier_scores))/dim(dset1)[1],4)*100,'%'))
#> [1] "Scores equal to NA over total observations: 82.02%"

# Plot histogram of outlier scores
Freq(hdb_res1$outlier_scores)
#>         level  freq   perc  cumfreq  cumperc
#> 1     [0,0.1]    71  39.4%       71    39.4%
#> 2   (0.1,0.2]     0   0.0%       71    39.4%
#> 3   (0.2,0.3]     0   0.0%       71    39.4%
#> 4   (0.3,0.4]     0   0.0%       71    39.4%
#> 5   (0.4,0.5]     3   1.7%       74    41.1%
#> 6   (0.5,0.6]     0   0.0%       74    41.1%
#> 7   (0.6,0.7]     5   2.8%       79    43.9%
#> 8   (0.7,0.8]     0   0.0%       79    43.9%
#> 9   (0.8,0.9]     0   0.0%       79    43.9%
#> 10    (0.9,1]   101  56.1%      180   100.0%

^{Created on 2019-11-02 by the reprex package (v0.3.0)}

from dbscan.

peekxc commented on May 23, 2024

Just some initial thoughts, without going too far into the code.

NaN's are very often caused by dividing things like 0/0. The NA's are could likely be caused by the fact that the core dist is 0.0 for most of the observations. This occurs because most of the data is redundant.

dset1 <- data.frame(
    "id1"= c(rep(1,500),rep(0,500)),
    "id2"= c(rep(0,500),rep(1,500)),
    "val"= c(round(rnorm(250,1,0.5),1),round(rnorm(250,2,0.5),1),round(rnorm(250,2,0.25),2),round(rnorm(250,3,0.25),1)) 
)
nrow(dset1)
#> [1] 1000
nrow(unique(dset1))
#> [1] 140

^{Created on 2019-11-02 by the reprex package (v0.3.0)}

This becomes worse in HDBSCAN's specific internals, actually. While redundant data is already a nice pathological way to mess with the outlier detection (since the distance between identical points is 0, so notions of outlierness are based on distributions of distance get skewed), in HDBSCAN recall the principle distance to consider the mutual reachability distance:

d_mrd(x_i, x_j) = max( d_core(x_i), d_core(x_j), d(x_i, x_j) )

Indeed, in the above example, less than half a percent of the distances are actually unique:

library(dbscan)

# Generate data
n <- 500
dset1=data.frame(
    "id1"= c(rep(1,n),rep(0,n)),
    "id2"= c(rep(0,n),rep(1,n)),
    "val"= round(c(rnorm(n/4,1,0.5),rnorm(n/4,2,0.5),rnorm(n/4,2,0.25), rnorm(n/4,3,0.25)), 1)
)
dset1=rbind(dset1,data.frame("id1"=0, "id2"=1, "val"=-3))

x <- dset1 
minPts <- 5
core_dist <- kNNdist(x, k = minPts - 1)[, minPts -1]
xdist <- dist(x, method = "euclidean")
mrd <- dbscan:::mrd(xdist, core_dist)

length(unique(mrd))/length(mrd)
#> [1] 0.0004295704

^{Created on 2019-11-02 by the reprex package (v0.3.0)}

The definition of GLOSH from Eq. 8 of Campello's 2015 paper defines the GLOSH score as 1 - e_max(x_i)/e(x_i). So the way to get NaN's is probably when e_max(x_i)/e(x_i) is NaN.

Some preliminary experiments seems to point toward this as the cause as well. You can generate arbitrarily number of NaN's if you try to think of

library(dbscan)

# Generate data
n <- 24
dset1=data.frame(
    "id1"= c(rep(1,n),rep(0,n)),
  "id2"= c(rep(0,n),rep(1,n)),
    "val"= round(c(rnorm(n/4,1,0.5),rnorm(n/4,2,0.5),rnorm(n/4,2,0.25), rnorm(n/4,3,0.25)), 1)
)
dset1=rbind(dset1,data.frame("id1"=0, "id2"=1, "val"=-3))

## outlier scores fine here
hdb <- hdbscan(dset1, minPts = 5)
hdb$outlier_scores
#>  [1] 6.250000e-01 7.771561e-16 6.666667e-01 2.500000e-01 4.000000e-01
#>  [6] 4.000000e-01 0.000000e+00 0.000000e+00 0.000000e+00 7.771561e-16
#> [11] 7.771561e-16 2.500000e-01 2.500000e-01 0.000000e+00 2.500000e-01
#> [16] 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e-01 0.000000e+00
#> [21] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.250000e-01
#> [26] 7.771561e-16 6.666667e-01 2.500000e-01 4.000000e-01 4.000000e-01
#> [31] 0.000000e+00 0.000000e+00 0.000000e+00 7.771561e-16 7.771561e-16
#> [36] 2.500000e-01 2.500000e-01 0.000000e+00 2.500000e-01 0.000000e+00
#> [41] 0.000000e+00 0.000000e+00 4.000000e-01 0.000000e+00 0.000000e+00
#> [46] 0.000000e+00 0.000000e+00 0.000000e+00 9.164407e-01

## Many neighborhood with score 1-0/0
hdb <- hdbscan(dset1, minPts = 2)
hdb$outlier_scores
#>  [1]   0   0   0   0   0   0 NaN   0 NaN   0   0   1 NaN NaN NaN NaN NaN
#> [18]   0   1   0   0 NaN NaN NaN   0   0   0   0   0   0 NaN   0 NaN   0
#> [35]   0   1 NaN NaN NaN NaN NaN   0   1   0   0 NaN NaN NaN   1

^{Created on 2019-11-02 by the reprex package (v0.3.0)}

I could add a check to see if the scores are NaN, and then reassign them to be 0, but that may not be helpful. Rather, I recommend removing redundant data to begin with. You can keep track of their indices and assign them the same outlier score as the point they are identical with in post-processing.

I don't know what the Python guys do. Maybe they have some additional checks to assign NaN's to 0, or maybe they remove redundant data before actually processing the data, and assign the scores post-process.

from dbscan.

matteo-1980 commented on May 23, 2024

Dear author,
I have reproduced the case in Python.
I get what follows.
Actually Python handles repeated observations well and identifies the synthetic outlier properly.
I think the R package should be fixed: this is a bug.
Can we expect a fix?
Thanks

# Libraries
import hdbscan as hd
import pandas as pd
import numpy as np

# Generate data
id1=[1] * 500 + [0]*500
id2=[0] * 500 + [1]*500
va1=list(np.random.normal(1,0.5,250)) + list(np.random.normal(2,0.5,250)) + list(np.random.normal(2,0.25,250)) + list(np.random.normal(3,0.25,250))
val=[ round(elem, 1) for elem in va1 ]
dset1=pd.DataFrame({'id1':id1,'id2':id2,'val':val})

# Add synthetic outlier
out1=pd.DataFrame({'id1':[0],'id2':[1],'val':[-100]})
dset1=dset1.append(out1, ignore_index=True)

# Create clustering object
clusterer = hd.HDBSCAN()

# Fit the clustering
clusterer.fit(dset1)

# Get scores in a variable
scores=clusterer.outlier_scores_

# Print frequencies
print ( 'Frequencies of scores:')
print ( pd.crosstab(index=pd.cut(scores, 20), columns="count"))

Frequencies of scores:
col_0                count
row_0                     
(-0.000986, 0.0493]    983
(0.0493, 0.0986]         0
(0.0986, 0.148]          0
(0.148, 0.197]           0
(0.197, 0.246]           0
(0.246, 0.296]           0
(0.296, 0.345]           0
(0.345, 0.394]           0
(0.394, 0.444]           0
(0.444, 0.493]           0
(0.493, 0.542]           4
(0.542, 0.592]           0
(0.592, 0.641]           0
(0.641, 0.69]            3
(0.69, 0.739]            0
(0.739, 0.789]           1
(0.789, 0.838]           0
(0.838, 0.887]           0
(0.887, 0.937]           1
(0.937, 0.986]           1

# Get highest value in clusterer.outlier_scores_
print('Maximum score of the non-outlying observations: ', round(max(scores[0:999]),2))
print('Score of the outlying observation: ', round(scores[1000],2))

Maximum score of the non-outlying observations:  0.89
Score of the outlying observation:  0.99

from dbscan.

mhahsler commented on May 23, 2024

GLOSH now (development version on CRAN) returns now 0 instead of NaN if we have k duplicate points in the data. This is consistent since if we have for example 3 observations at exactly the same location, then at k <= 3 this point is not an outlier.

from dbscan.

Discrepancies in outlier score between HDBSCAN R and python about dbscan HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent