hbpmedical / ccc Goto Github PK
View Code? Open in Web Editor NEW3C-strategy implementation in R from TAU team
3C-strategy implementation in R from TAU team
When a number of clusters is not provided, the pipeline tries to estimate the optimal amount by calculating the "gap" statistic for up to 10 clusters. This is done inside the k_euclidean()
, k_manhattan()
, ... functions, via the cluster::clusGap()
function. However, the arguments provided as input are wrong.
cluster::clusGap()
is defined as:
clusGap(x, FUNcluster, K.max, B = 100, d.power = 1, spaceH0 = c("scaledPCA", "original"), verbose = interactive(), ...)
The way it is called inside the pipeline is:
clusGap_best <- cluster::clusGap(x, FUN = pam, K.max = K.max, B, verbose)
where B=100
and verbose=FALSE
are the default values in the parent function.
However, it should have been:
clusGap_best <- cluster::clusGap(x, FUN = pam, K.max = K.max, B = B, verbose = verbose)
.
So, while the position of B
is fortuitously correct, that of verbose
is not and, as it stands, the pipeline is assigning the value verbose=FALSE=0
to d.power
. See plots below.
The get_xy_from_DATA_C2
function calculates the features x
and target y
components as:
x <- DATA[, META_DATA$varName[META_DATA$varCategory == "CM"]]
y <- DATA[, META_DATA$varName[META_DATA$varCategory == "DX"]]
where META_DATA$varName[META_DATA$varCategory == "CM"]
is supposed to list all the columns in DATA
corresponding to the META_DATA
category "CM", and the same for "DX".
However, this is not correct as META_DATA$varName[META_DATA$varCategory == "CM"]
returns a factor which is incorrectly forced into an index, rather than taken as the column name.
This can be tested using the example provided in the page, e.g.
y <- get_xy_from_DATA_C2(c3_sample1, c3_sample1_categories)$y
is supposed to return the column of c3_sample1
whose name has been associated into c3_sample1_categories
to "DX". By opening c3_sample1_categories
, we see that this is the column real_DX_f
, i.e. the 1st column in c3_sample1
.
However, the get_xy_from_DATA_C2
function does not return the 1st column, but the 21st, because real_DX_f
is the 21st element in the level list returned by META_DATA$varName[META_DATA$varCategory == "DX"]
.
Sample code:
DATA = c3_sample1
META_DATA = c3_sample1_categories
str(META_DATA$varName[META_DATA$varCategory == "DX"])
Output:
Factor w/ 21 levels "CM.1","CM.10",..: 21
As a result, the target y
get assigned to the column in c3_sample1
named PB.9
, and not real_DX_f
. Same for x
.
I have fixed this locally by defining a new get_xy_from_DATA_C2_
function as:
x <- DATA[, as.character(META_DATA$varName[META_DATA$varCategory == "CM"])]
y <- DATA[, as.character(META_DATA$varName[META_DATA$varCategory == "DX"])]
The clustering
function takes (x, k.gap = 2, method = "Euclidean", plot.clustering = FALSE)
as input arguments.
When called within the C2
function, only two arguments are provided:
final_cluster <- clustering(subx, k.gap = num_clust)
which means that the clustering method will always be Euclidean, and no plot will be produced, regardless of user's input.
I have tested this by running the example code provided, and no plot was produced despite plot.clustering = TRUE
, nor results changed by changing clustering_method="Manhattan"
to clustering_method="Euclidean"
.
I have resolved locally by defining a new C2_
function with:
final_cluster <- clustering(subx, k.gap = num_clust, method = clustering_method, plot.clustering = plot.clustering)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.