ramhiser / clusteval Goto Github PK
View Code? Open in Web Editor NEWClustering Evaluation in R
License: Other
Clustering Evaluation in R
License: Other
Got this in an email from @khughitt:
passing invalid inputs (e.g. vectors of characters instead of numerics) to the `cluster_similarity` function leads to a core dump and the R session being killed, e.g.:
> cluster_similarity(c('a', 'b', 'c'), c('a', 'a', 'c'))
terminate called after throwing an instance of 'Rcpp::not_compatible'
what(): Not compatible with requested type: [type=character; target=double].
[1] 27028 abort (core dumped) R
A quick type check in the function(s) before calling the Rcpp functions should be enough to prevent this.
Currently, the Jaccard takes only cluster labels as arguments, but for estimation purposes, it would be useful to have an option that we could instead pass the comemberships.
Do one of the following:
This vignette should list all of the statistics implemented and include a brief description of clustering comembership and the calculation of the 2x2 contingency tables.
Repeatable example:
library(clusteval)
n <- 10
labels1 <- rep(1, 10)
labels2 <- rep(2, 10)
adjusted_rand(labels1, labels2)
[1] NaN
This case should return 1. I confirmed that mclust::adjustedRandIndex
has this behavior. I also checked on some examples that we get the same similarity value as mclust::adjustedRandIndex
.
Determine a list of 5-10 similarity functions to add that utilize the comembership_summary
function. Implement these.
See this blog post for starters.
Demonstrate clustering evaluation with examples.
From email I received from Kurt Hornik of CRAN:
Package: clusteval Version: 0.1
Check: DESCRIPTION meta-information ... NOTE
License components which are templates and need ‘+ file LICENSE’:
MIT
It seems I need to add a LICENSE
file.
Functions:
rand
rand_glmm
rand_standard
Along with ramhiser/itertools2#38, got Ripley'd over last night's CRAN submission. Thing to fix:
We see
- checking top-level files ... NOTE
Non-standard file/directory found at top level:
‘cran-comments.md’which should not be in the tarball. Please scrupulously follow the policies and check before submission.
For larger sample sizes combined with large numbers of cluster labels comembership_table()
can return negative numbers for the number of discordant pairs.
set.seed(1)
a <- sample(1:20, 70000, replace = TRUE)
b <- sample(1:20, 70000, replace = TRUE)
clusteval::comembership_table(a, b)
output:
$n_11
6125067
$n_10
116356347
$n_01
116372976
$n_00
-2083856686
The main idea is to bootstrap B
times and look at the width of confidence intervals as a measure of stability to determine the true number of clusters. The value of K
that minimizes the width of the confidence interval of some criterion specified is the optimal value of K
.
Because the parameter configurations have changed significantly and because I never finished the documentation, finish the documentation for the following functions:
sim_unif
sim_normal
sim_student
sim_data
The Dunn index is an internal evaluation technique. Cluster k
results in an intracluster distance \Delta_k
, which is computed as one of:
An intercluster distance is then calculated as a comparison of the clusters.
Suleman 2017 show that hard clustering similarities like rand and jaccard can be easily extended to fuzzy clusterings by replacing the comembership 0/1 indicator with a normalized manhattan distance between cluster weights. I have implemented a prototype but it would be nice to have it in C++ and available to other users through the existing clusteval package.
clusteval
documentation on CRAN.Functions:
jaccard
jaccard_glmm
jaccard_standard
This is too slow with the clValid
package. Let's make it faster for future usage.
The package documentation is in need of much TLC. We need to write a package description. This should be added to 3 different places in various spots:
README.md
R/help.r
DESCRIPTION
The documentation for these functions is dismal. Let's fix that.
There were two NOTES in my package submission. Ripley emailed me and told me to fix them:
We see
- checking top-level files ... NOTE
Non-standard file/directory found at top level:
‘NEWS.md’- checking R code for possible problems ... NOTE
plot.clustomit: no visible binding for global variable ‘method’
plot.clustomit: no visible binding for global variable ‘ClustOmit’
plot.clustomit: no visible binding for global variable ‘Cluster’Please fix
The plot should look similar to the density plots in the ClustOmit paper.
The Davies-Bouldin index is an internal clustering evaluation method. The formula is straightforward to implement and requires a distance metric to be given.
From Wikipedia:
Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as Si above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies Bouldin Index. This index thus defined is an average over all the i clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over. The number i for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into. This has applications in deciding the value of k in the kmeans algorithm, where the value of k is not known apriori.
cluster_wrapper
functionAfter I ran the simulations for the first 5 simulation configurations (i.e. the first 5 rows of simgrid) using a small value of B = 6ish and D = 5, things worked fine.
Now, I have increased B to 100 and D to 100.
When I did that, I am getting the following 2 errors and warning multiple times:
Error : number of cluster centres must lie between 1 and nrow(x)
Error in kmeans(x = x, centers = num_clusters, nstart = num_starts, ...) :
more cluster centers than distinct data points.
In addition: Warning message:
In FUN(c(2L, 1L, 3L)[[3L]], ...) : Returning NA
When mclapply exits, I receive:
Warning message:
In mclapply(seq_len(nrow(simgrid)), function(i) { :
all scheduled cores encountered errors in user code
Need to resolve this so that we can move on with simulation.
Currently, the clustering similarity functions implemented utilize helper functions from various packages. Some of them are much slower than my Rcpp
implementation. To streamline the clustering similarity calculations, do the following:
comembership_summary
function that uses Rcpp
to compute the 2x2 similarity tablejaccard_naive
and rand_naive
functionsstatistic
and method = c('naive')
arguments: See the entropy package for examples.Information-theoretic criterion for comparing two partitions.
My initial exploratory research into this topic yielded a lot of bloat and creep into the package. Here, we enumerate all of the changes that must be made to purge the package of this bloat.
TODO
foreach
packagemclust
packageconsensus.r
to a consensus
branch for later researchsim_gamma
plot.r
NOTES
to journal and then remove the fileclustomit
paper to that project folderEmail address: Bettina.Gruen at jku.at
Related packages are listed under Additional Functionality.
Provide summary to Bettina to add to the task view.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.