Coder Social home page Coder Social logo

Comments (7)

peekxc avatar peekxc commented on May 23, 2024

Do you have a reproducible example by chance? A reprex or something would help a lot

from dbscan.

matteo-1980 avatar matteo-1980 commented on May 23, 2024

Okay. I will produce it. Where shall I then store code and data?

from dbscan.

peekxc avatar peekxc commented on May 23, 2024

Reprex should do most things for you actually. Example:

data("iris")
model <- dbscan::hdbscan(iris[,-5], 5)
print(model)
#> HDBSCAN clustering for 150 objects.
#> Parameters: minPts = 5
#> The clustering contains 2 cluster(s) and 0 noise points.
#> 
#>   1   2 
#> 100  50 
#> 
#> Available fields: cluster, minPts, cluster_scores,
#>                   membership_prob, outlier_scores, hc
head(model$outlier_scores)
#> [1] 0.0000000 0.0000000 0.4654775 0.4226497 0.1835034 0.6220355

Created on 2019-11-01 by the reprex package (v0.3.0)

from dbscan.

matteo-1980 avatar matteo-1980 commented on May 23, 2024

Here it is :-)
Only the R part for the moment (I have issues with Python, which I hope to solve on Monday).
As you can see most of the observations get NA as outlier score and the remaining observations take either 0 or 1 as score, very few obs being distributed in the middle.
I guess this is related to overlapping points. I noticed this when replicating the data: with artificial data I did not get the problem. I was able to reproduce it only by reducing the variety among the points.
Thanks for the help

# Libraries
library(dbscan)
#> Warning: package 'dbscan' was built under R version 3.5.3
library(DescTools)
#> Warning: package 'DescTools' was built under R version 3.5.3
library(reprex)
#> Warning: package 'reprex' was built under R version 3.5.3

# Generate data
dset1=data.frame("id1"= c(rep(1,500),rep(0,500)),
                 "id2"= c(rep(0,500),rep(1,500)),
                 "val"= c(round(rnorm(250,1,0.5),1),round(rnorm(250,2,0.5),1),round(rnorm(250,2,0.25),2),round(rnorm(250,3,0.25),1)) )

# Check repeated observations
print(paste0('Repeated observations over total observations: ',round(1-dim(unique(dset1))[1]/dim(dset1)[1],4)*100,'%'))
#> [1] "Repeated observations over total observations: 85.8%"

# Generate outlying observation
dset1=rbind(dset1,data.frame("id1"=0, "id2"=1, "val"=-3))

# Launch hdbscan
hdb_res1=hdbscan(dset1, minPts = 5)

# Check na scores
print(paste0('Scores equal to NA over total observations: ',round(sum(is.na(hdb_res1$outlier_scores))/dim(dset1)[1],4)*100,'%'))
#> [1] "Scores equal to NA over total observations: 82.02%"

# Plot histogram of outlier scores
Freq(hdb_res1$outlier_scores)
#>         level  freq   perc  cumfreq  cumperc
#> 1     [0,0.1]    71  39.4%       71    39.4%
#> 2   (0.1,0.2]     0   0.0%       71    39.4%
#> 3   (0.2,0.3]     0   0.0%       71    39.4%
#> 4   (0.3,0.4]     0   0.0%       71    39.4%
#> 5   (0.4,0.5]     3   1.7%       74    41.1%
#> 6   (0.5,0.6]     0   0.0%       74    41.1%
#> 7   (0.6,0.7]     5   2.8%       79    43.9%
#> 8   (0.7,0.8]     0   0.0%       79    43.9%
#> 9   (0.8,0.9]     0   0.0%       79    43.9%
#> 10    (0.9,1]   101  56.1%      180   100.0%

Created on 2019-11-02 by the reprex package (v0.3.0)

from dbscan.

peekxc avatar peekxc commented on May 23, 2024

Just some initial thoughts, without going too far into the code.

NaN's are very often caused by dividing things like 0/0. The NA's are could likely be caused by the fact that the core dist is 0.0 for most of the observations. This occurs because most of the data is redundant.

dset1 <- data.frame(
    "id1"= c(rep(1,500),rep(0,500)),
    "id2"= c(rep(0,500),rep(1,500)),
    "val"= c(round(rnorm(250,1,0.5),1),round(rnorm(250,2,0.5),1),round(rnorm(250,2,0.25),2),round(rnorm(250,3,0.25),1)) 
)
nrow(dset1)
#> [1] 1000
nrow(unique(dset1))
#> [1] 140

Created on 2019-11-02 by the reprex package (v0.3.0)

This becomes worse in HDBSCAN's specific internals, actually. While redundant data is already a nice pathological way to mess with the outlier detection (since the distance between identical points is 0, so notions of outlierness are based on distributions of distance get skewed), in HDBSCAN recall the principle distance to consider the mutual reachability distance:

d_mrd(x_i, x_j) = max( d_core(x_i), d_core(x_j), d(x_i, x_j) )

Indeed, in the above example, less than half a percent of the distances are actually unique:

library(dbscan)

# Generate data
n <- 500
dset1=data.frame(
    "id1"= c(rep(1,n),rep(0,n)),
    "id2"= c(rep(0,n),rep(1,n)),
    "val"= round(c(rnorm(n/4,1,0.5),rnorm(n/4,2,0.5),rnorm(n/4,2,0.25), rnorm(n/4,3,0.25)), 1)
)
dset1=rbind(dset1,data.frame("id1"=0, "id2"=1, "val"=-3))

x <- dset1 
minPts <- 5
core_dist <- kNNdist(x, k = minPts - 1)[, minPts -1]
xdist <- dist(x, method = "euclidean")
mrd <- dbscan:::mrd(xdist, core_dist)

length(unique(mrd))/length(mrd)
#> [1] 0.0004295704

Created on 2019-11-02 by the reprex package (v0.3.0)

The definition of GLOSH from Eq. 8 of Campello's 2015 paper defines the GLOSH score as 1 - e_max(x_i)/e(x_i). So the way to get NaN's is probably when e_max(x_i)/e(x_i) is NaN.

Some preliminary experiments seems to point toward this as the cause as well. You can generate arbitrarily number of NaN's if you try to think of

library(dbscan)

# Generate data
n <- 24
dset1=data.frame(
    "id1"= c(rep(1,n),rep(0,n)),
  "id2"= c(rep(0,n),rep(1,n)),
    "val"= round(c(rnorm(n/4,1,0.5),rnorm(n/4,2,0.5),rnorm(n/4,2,0.25), rnorm(n/4,3,0.25)), 1)
)
dset1=rbind(dset1,data.frame("id1"=0, "id2"=1, "val"=-3))

## outlier scores fine here
hdb <- hdbscan(dset1, minPts = 5)
hdb$outlier_scores
#>  [1] 6.250000e-01 7.771561e-16 6.666667e-01 2.500000e-01 4.000000e-01
#>  [6] 4.000000e-01 0.000000e+00 0.000000e+00 0.000000e+00 7.771561e-16
#> [11] 7.771561e-16 2.500000e-01 2.500000e-01 0.000000e+00 2.500000e-01
#> [16] 0.000000e+00 0.000000e+00 0.000000e+00 4.000000e-01 0.000000e+00
#> [21] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 6.250000e-01
#> [26] 7.771561e-16 6.666667e-01 2.500000e-01 4.000000e-01 4.000000e-01
#> [31] 0.000000e+00 0.000000e+00 0.000000e+00 7.771561e-16 7.771561e-16
#> [36] 2.500000e-01 2.500000e-01 0.000000e+00 2.500000e-01 0.000000e+00
#> [41] 0.000000e+00 0.000000e+00 4.000000e-01 0.000000e+00 0.000000e+00
#> [46] 0.000000e+00 0.000000e+00 0.000000e+00 9.164407e-01

## Many neighborhood with score 1-0/0
hdb <- hdbscan(dset1, minPts = 2)
hdb$outlier_scores
#>  [1]   0   0   0   0   0   0 NaN   0 NaN   0   0   1 NaN NaN NaN NaN NaN
#> [18]   0   1   0   0 NaN NaN NaN   0   0   0   0   0   0 NaN   0 NaN   0
#> [35]   0   1 NaN NaN NaN NaN NaN   0   1   0   0 NaN NaN NaN   1

Created on 2019-11-02 by the reprex package (v0.3.0)

I could add a check to see if the scores are NaN, and then reassign them to be 0, but that may not be helpful. Rather, I recommend removing redundant data to begin with. You can keep track of their indices and assign them the same outlier score as the point they are identical with in post-processing.

I don't know what the Python guys do. Maybe they have some additional checks to assign NaN's to 0, or maybe they remove redundant data before actually processing the data, and assign the scores post-process.

from dbscan.

matteo-1980 avatar matteo-1980 commented on May 23, 2024

Dear author,
I have reproduced the case in Python.
I get what follows.
Actually Python handles repeated observations well and identifies the synthetic outlier properly.
I think the R package should be fixed: this is a bug.
Can we expect a fix?
Thanks

# Libraries
import hdbscan as hd
import pandas as pd
import numpy as np

# Generate data
id1=[1] * 500 + [0]*500
id2=[0] * 500 + [1]*500
va1=list(np.random.normal(1,0.5,250)) + list(np.random.normal(2,0.5,250)) + list(np.random.normal(2,0.25,250)) + list(np.random.normal(3,0.25,250))
val=[ round(elem, 1) for elem in va1 ]
dset1=pd.DataFrame({'id1':id1,'id2':id2,'val':val})

# Add synthetic outlier
out1=pd.DataFrame({'id1':[0],'id2':[1],'val':[-100]})
dset1=dset1.append(out1, ignore_index=True)

# Create clustering object
clusterer = hd.HDBSCAN()

# Fit the clustering
clusterer.fit(dset1)

# Get scores in a variable
scores=clusterer.outlier_scores_

# Print frequencies
print ( 'Frequencies of scores:')
print ( pd.crosstab(index=pd.cut(scores, 20), columns="count"))

Frequencies of scores:
col_0                count
row_0                     
(-0.000986, 0.0493]    983
(0.0493, 0.0986]         0
(0.0986, 0.148]          0
(0.148, 0.197]           0
(0.197, 0.246]           0
(0.246, 0.296]           0
(0.296, 0.345]           0
(0.345, 0.394]           0
(0.394, 0.444]           0
(0.444, 0.493]           0
(0.493, 0.542]           4
(0.542, 0.592]           0
(0.592, 0.641]           0
(0.641, 0.69]            3
(0.69, 0.739]            0
(0.739, 0.789]           1
(0.789, 0.838]           0
(0.838, 0.887]           0
(0.887, 0.937]           1
(0.937, 0.986]           1

# Get highest value in clusterer.outlier_scores_
print('Maximum score of the non-outlying observations: ', round(max(scores[0:999]),2))
print('Score of the outlying observation: ', round(scores[1000],2))

Maximum score of the non-outlying observations:  0.89
Score of the outlying observation:  0.99

from dbscan.

mhahsler avatar mhahsler commented on May 23, 2024

GLOSH now (development version on CRAN) returns now 0 instead of NaN if we have k duplicate points in the data. This is consistent since if we have for example 3 observations at exactly the same location, then at k <= 3 this point is not an outlier.

from dbscan.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.