ltla / bluster Goto Github PK
View Code? Open in Web Editor NEWClone of the Bioconductor repository for the bluster package.
Home Page: https://bioconductor.org/packages/devel/bioc/html/bluster.html
Clone of the Bioconductor repository for the bluster package.
Home Page: https://bioconductor.org/packages/devel/bioc/html/bluster.html
Kind of makes more sense.
Thanks for the package!
Hope I'm not misinterpreting, but in the function clusterRMSD
, it seems you're not in fact computing the root mean square, you're just summing square deviations.
igraph 1.3.0 is about to be released soon on CRAN, and it will change the implementation of the Louvain clustering to be randomized like it should have been from the very beginning. As a result, it seems like unit tests in bluster
that rely on the Louvain clustering will break. See this PR in the underlying C core of the igraph library that introduced randomization. I believe that setting a fixed seed before calling the Louvain clustering will fix things here but I am not 100% sure - is there anything we can do to help fixing the unit tests for the upcoming release?
The idea is to compute the proportion of neighbor not in the cluster, divide by the proportion of cells not in the cluster, and cap that at unity. This value should be close to zero if the clusters are pure, and large if not.
Currently, the cluster purity just decreases monotonically when the number of clusters increases, simply because there's higher chance of getting a cell from another cluster in your set of neighbors. By dividing, we favor clusterings that retain purity in the presence of many other clusters, hopefully creating some local minima of impurity where the denominator increases faster than the numerator. This would allow the impurity to be used for auto-choosing the cluster resolution a la the silhouette width.
I have been in need of an way of using bluster to abstract over a two-step approach of hierarchical clustering over a subset of my observations followed by classification of remaining observations using knn1.
I modelled the following after your twoStepParam.
Do you think it is a reasonable abstraction to include in future versions of bluster, and if so, how would you like a submission?
#' A two-step approach of clustering a sample subset of observations
#' followed by classification of remaining observations (e.g. using knn1).
#'
#' For large datasets, we can cluster on a subset using an initial
#' technique (e.g., using HclustParam) and then classify the remaining
#' observations based on this clustering using a classifier (e.g., and
#' by default, knn1).
#'
#' @param initialTechnique A \linkS4class{BlusterParam} object specifying an initial clustering.
#' @param initialSample A integer vector specifying the sample on which to conduct initial clustering. A single value is taken as size of random sample. 0L, the default, takes the root of n.
#' @param classifier A classifier function such as the default, knn1 (required to have same signature as knn1).
#' @inheritParams clusterRows
#' @param BLUSPARAM A \linkS4class{KmeansParam} object.
#' @param full Logical scalar indicating whether the initial clustering results should be returned.
#'
#' @details
#'
#' Here, the idea is to allow using an expensive algorithm to
#' establish a training set of clusters on a subset of the data,
#' followed by a less expensive algorithm to classify the remaining
#' unclustered observations.
#'
#' The default choice is to use default hierarchical clustering for
#' the initialTechnique, with initialSample being the root of the
#' number of observations to be used and knn1 as the classifier.
#'
#' @return
#' The \code{postClassifyParam} constructor will return a \linkS4class{postClassifyParam} object with the specified parameters.
#'
#' The \code{clusterRows} method will return a factor of length equal to \code{nrow(x)} containing the cluster assignments.
#' If \code{full=TRUE}, a list is returned with a \code{clusters} factor and an \code{objects} list containing:
#' \itemize{
#' \item \code{initialCluster}, the result of initial clustering.
#' \item \code{initialSample}, the integer vector indexing rows of x which were initially clustered.
#' }
#'
#' @author Malcolm Cook
#' @examples
#' m <- matrix(runif(100000), ncol=10)
#' stuff <- clusterRows(m, postClassifyParam())
#' table(stuff)
#'
#' @name postClassifyParam-class
#' @aliases
#' show,postClassifyParam-method
NULL
#' @export
#' @rdname postClassifyParam-class
setClass("postClassifyParam", contains="BlusterParam", slots=c(initialTechnique="BlusterParam", initialSample="integer",classifier="function"))
#' @export
#' @rdname postClassifyParam-class
postClassifyParam <- function(initialTechnique=HclustParam(), initialSample=0L,classifier=class::knn1) {
new("postClassifyParam", initialTechnique=initialTechnique, initialSample=initialSample,classifier=classifier)
}
#' @export
#' @importFrom utils capture.output
setMethod("show", "postClassifyParam", function(object) {
callNextMethod()
cat("initialTechnique:\n")
fout <- capture.output(show(object@initialTechnique))
cat(paste0(" ", paste(fout, collapse="\n "), "\n"))
cat("initialSample:\n")
sout <- capture.output(show(object@initialSample))
cat("classifier:\n")
sout <- capture.output(show(object@classifier))
cat(paste0(" ", paste(sout, collapse="\n "), "\n"))
})
#' @export
#' @rdname postClassifyParam-
setMethod("clusterRows", c("ANY", "postClassifyParam"), function(x, BLUSPARAM, full=FALSE) {
n<-nrow(x)
clusters<-rep(NA,n) # allocate results vector.
initialSample<-BLUSPARAM@initialSample
if(identical(0L,initialSample)) initialSample<-as.integer(floor(sqrt(n)))
stopifnot(is.integer(initialSample))
if(1L==length(initialSample)) initialSample<-sample(n,initialSample)
initialCluster<-clusterRows(x[initialSample,], BLUSPARAM@initialTechnique, full=TRUE)
clusters[initialSample]<- initialCluster$clusters
clusters[-initialSample]<- BLUSPARAM@classifier(x[initialSample,],x[-initialSample,],clusters[initialSample])
clusters <- factor(clusters)
if (!full) {
clusters
} else {
list(clusters=clusters
,objects=list(initialCluster=initialCluster,initialSample=initialSample)
)
}
})
The bluster package is currently relying on stats::dist
for distance calculations in the clustering process.
Limitation in this is that the stats::dist
function covers only a relatively small set of dissimilarity indices. For instance, it is missing many dissimilarity indices that are commonly used in ecological analyses and available for instance through vegan::vegdist
. Extending the availability of dissimilarity indices would be beneficial for making the bluster
package support other applications of SummarizedExperiment family, for instance in microbiome research that we are working on. Providing access to readily available dissimilarity indices would support users.
Suggested solution:
vegan::vegdist
in the bluster packageThis would concern multiple functions.
The process would then look, for instance in the context of clusterRows
and hierarchical clustering, something like:
clusterRows(sce, distfun=stats::dist, HclustParam(metric="euclidean"))
clusterRows(sce, distfun=vegan::vegdist, HclustParam(metric="bray"))
etc.
The HclustParam
is currently using stats::hclust
as clustering function. But other functions exist to do hierarchical clustering, for example fastcluster::hclust
.
In some situations, the choice of the function used would be handy as the computation time can be quite high.
A solution would be to add a parameter to HclustParam
to let the user choose the function. The default still being stats::hclust
.
The new use would be: clusterRows(sce, HclustParam(clust.func = fastcluster::hclust))
Do you think such addition to bluster would be ok?
Might as well, for the purists.
I am using the clusterRows
and HclustParam
from bluster v1.0.0 (R 4.0.3 and Bioconductor 3.12) to perform cell clustering. I tried it on Jupyter notebook and R
in terminal, both resulted in termination of the process.
I've used the same codes on other datasets (< 10K cells) without issue.
# sce
# class: SingleCellExperiment
# dim: 32285 26875
# This works
my.dist <- dist(reducedDim(sce, "PCA"), method = "euclidean")
hclust.clusters <- cutreeDynamic(hclust(my.dist, method = "ward.D2"), distM = as.matrix(my.dist))
# Kernel killed
hclust.clusters <- clusterRows(reducedDim(sce, "PCA"), HclustParam(method = "ward.D2", cut.dynamic = TRUE))
KmeansParam
, arguably.Use full
if the current objects
is already a list and you can't think of any other name.
I want to compute the RMSD of each cluster but I'm not sure what to actually use it for. It's not a very good per-cluster metric and it's monotonically decreasing with increasing number of clusters so it's not very good for choosing the number of clusters either.
Guess I could just combine it with PCAtools's elbow point implementation and hope for the best.
Hi, It seems Annoy algorithm in bluster
cannot accelerate the clustering process. the uncorrected_sce
is a SingleCellExperiment
object with a reducedDimNames
slot PCA
which has a dim ([1] 111455 50
).
seurat_obj
is converted from uncorrected_sce
microbenchmark::microbenchmark(
scran::clusterCells(
uncorrected_sce,
use.dimred = "PCA",
BLUSPARAM = bluster::SNNGraphParam(
k = 25L, type = "rank"
)
),
scran::clusterCells(
uncorrected_sce,
use.dimred = "PCA",
BLUSPARAM = bluster::SNNGraphParam(
k = 25L,
BNPARAM = BiocNeighbors::AnnoyParam()
)
),
scran::clusterCells(
uncorrected_sce,
use.dimred = "PCA",
BLUSPARAM = bluster::SNNGraphParam(
k = 25L, type = "jaccard", cluster.fun = "louvain"
)
),
scran::clusterCells(
uncorrected_sce,
use.dimred = "PCA",
BLUSPARAM = bluster::SNNGraphParam(
k = 25L, type = "jaccard", cluster.fun = "louvain",
BNPARAM = BiocNeighbors::AnnoyParam()
)
),
{
seurat_nei <- FindNeighbors(
seurat_obj,
dims = 1:50,
k.param = 25L
)
seurat_nei <- FindClusters(seurat_nei, resolution = 1)
},
times = 1L
)
expr
scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L, type = "rank"))
scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L, BNPARAM = BiocNeighbors::AnnoyParam()))
scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L, type = "jaccard", cluster.fun = "louvain"))
scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L, type = "jaccard", cluster.fun = "louvain", BNPARAM = BiocNeighbors::AnnoyParam()))
{ seurat_nei <- FindNeighbors(seurat_obj, dims = 1:50, k.param = 25L) seurat_nei <- FindClusters(seurat_nei, resolution = 1) }
min lq mean median
2519.89881 2519.89881 2519.89881 2519.89881
2402.44808 2402.44808 2402.44808 2402.44808
59.89610 59.89610 59.89610 59.89610
66.01533 66.01533 66.01533 66.01533
39.35732 39.35732 39.35732 39.35732
uq max neval
2519.89881 2519.89881 1
2402.44808 2402.44808 1
59.89610 59.89610 1
66.01533 66.01533 1
39.35732 39.35732 1
Hi, would it be possible to add the DMM algorithm to bluster? It's an algorithm that's commonly used in microbial ecology along with metagenomic & 16S rRNA count data.
We had functions related to it over at microbiome/mia, but figured it could have its place in bluster as well. We've started discussing it in a PR in microbiome/bluster.
Do you think such a param could have its place in bluster?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.