hbpmedical / ccc Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 440 KB

3C-strategy implementation in R from TAU team

R 97.18% Shell 2.82%

algorithm-library

ccc's People

Contributors

Stargazers

Watchers

ccc's Issues

Wrong estimation of optimum number of clusters

When a number of clusters is not provided, the pipeline tries to estimate the optimal amount by calculating the "gap" statistic for up to 10 clusters. This is done inside the k_euclidean(), k_manhattan(), ... functions, via the cluster::clusGap() function. However, the arguments provided as input are wrong.

cluster::clusGap() is defined as:
clusGap(x, FUNcluster, K.max, B = 100, d.power = 1, spaceH0 = c("scaledPCA", "original"), verbose = interactive(), ...)

The way it is called inside the pipeline is:
clusGap_best <- cluster::clusGap(x, FUN = pam, K.max = K.max, B, verbose)
where B=100 and verbose=FALSE are the default values in the parent function.

However, it should have been:
clusGap_best <- cluster::clusGap(x, FUN = pam, K.max = K.max, B = B, verbose = verbose).

So, while the position of B is fortuitously correct, that of verbose is not and, as it stands, the pipeline is assigning the value verbose=FALSE=0 to d.power. See plots below.

Issue in get_xy_from_DATA_C2 function

The get_xy_from_DATA_C2 function calculates the features x and target y components as:

x <- DATA[, META_DATA$varName[META_DATA$varCategory == "CM"]]
y <- DATA[, META_DATA$varName[META_DATA$varCategory == "DX"]]

where META_DATA$varName[META_DATA$varCategory == "CM"] is supposed to list all the columns in DATA corresponding to the META_DATA category "CM", and the same for "DX".

However, this is not correct as META_DATA$varName[META_DATA$varCategory == "CM"] returns a factor which is incorrectly forced into an index, rather than taken as the column name.

This can be tested using the example provided in the page, e.g.

y <- get_xy_from_DATA_C2(c3_sample1, c3_sample1_categories)$y

is supposed to return the column of c3_sample1 whose name has been associated into c3_sample1_categories to "DX". By opening c3_sample1_categories, we see that this is the column real_DX_f, i.e. the 1st column in c3_sample1.

However, the get_xy_from_DATA_C2 function does not return the 1st column, but the 21st, because real_DX_f is the 21st element in the level list returned by META_DATA$varName[META_DATA$varCategory == "DX"].

Sample code:

DATA = c3_sample1
META_DATA = c3_sample1_categories
str(META_DATA$varName[META_DATA$varCategory == "DX"])

Output:

Factor w/ 21 levels "CM.1","CM.10",..: 21

As a result, the target y get assigned to the column in c3_sample1 named PB.9, and not real_DX_f. Same for x.

I have fixed this locally by defining a new get_xy_from_DATA_C2_ function as:

x <- DATA[, as.character(META_DATA$varName[META_DATA$varCategory == "CM"])]
y <- DATA[, as.character(META_DATA$varName[META_DATA$varCategory == "DX"])]

Underdefined arguments for clustering function within C2 function

The clustering function takes (x, k.gap = 2, method = "Euclidean", plot.clustering = FALSE) as input arguments.
When called within the C2 function, only two arguments are provided:

final_cluster <- clustering(subx, k.gap = num_clust)

which means that the clustering method will always be Euclidean, and no plot will be produced, regardless of user's input.
I have tested this by running the example code provided, and no plot was produced despite plot.clustering = TRUE, nor results changed by changing clustering_method="Manhattan" to clustering_method="Euclidean".

I have resolved locally by defining a new C2_ function with:

final_cluster <- clustering(subx, k.gap = num_clust, method = clustering_method, plot.clustering = plot.clustering)

hbpmedical / ccc Goto Github PK

ccc's People

Contributors

Stargazers

Watchers

ccc's Issues

Wrong estimation of optimum number of clusters

Issue in get_xy_from_DATA_C2 function

Underdefined arguments for clustering function within C2 function

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent