ltla / bluster Goto Github PK

View Code? Open in Web Editor NEW

3.0 4.0 4.0 227 KB

Clone of the Bioconductor repository for the bluster package.

Home Page: https://bioconductor.org/packages/devel/bioc/html/bluster.html

R 95.43% C++ 4.57%

bioconductor

bluster's People

Contributors

Stargazers

Watchers

Forkers

sudolin metamaden bananacancer microbiome

bluster's Issues

`linkedClusters` should use Jaccard by default

Kind of makes more sense.

RMSD not taking root mean square but sum of squares

Thanks for the package!

Hope I'm not misinterpreting, but in the function clusterRMSD, it seems you're not in fact computing the root mean square, you're just summing square deviations.

Unit tests will break in igraph 1.3.0

igraph 1.3.0 is about to be released soon on CRAN, and it will change the implementation of the Louvain clustering to be randomized like it should have been from the very beginning. As a result, it seems like unit tests in bluster that rely on the Louvain clustering will break. See this PR in the underlying C core of the igraph library that introduced randomization. I believe that setting a fixed seed before calling the Louvain clustering will fix things here but I am not 100% sure - is there anything we can do to help fixing the unit tests for the upcoming release?

Cluster purity should be cluster impurity

The idea is to compute the proportion of neighbor not in the cluster, divide by the proportion of cells not in the cluster, and cap that at unity. This value should be close to zero if the clusters are pure, and large if not.

Currently, the cluster purity just decreases monotonically when the number of clusters increases, simply because there's higher chance of getting a cell from another cluster in your set of neighbors. By dividing, we favor clusterings that retain purity in the presence of many other clusters, hopefully creating some local minima of impurity where the denominator increases faster than the numerator. This would allow the impurity to be used for auto-choosing the cluster resolution a la the silhouette width.

clustering on a sample/subset followed by classification

I have been in need of an way of using bluster to abstract over a two-step approach of hierarchical clustering over a subset of my observations followed by classification of remaining observations using knn1.

I modelled the following after your twoStepParam.

Do you think it is a reasonable abstraction to include in future versions of bluster, and if so, how would you like a submission?

#' A two-step approach of clustering a sample subset of observations
#' followed by classification of remaining observations (e.g. using knn1).
#'
#' For large datasets, we can cluster on a subset using an initial
#' technique (e.g., using HclustParam) and then classify the remaining
#' observations based on this clustering using a classifier (e.g., and
#' by default, knn1).
#'
#' @param initialTechnique A \linkS4class{BlusterParam} object specifying an initial clustering.
#' @param initialSample A integer vector specifying the sample on which to conduct initial clustering. A single value is taken as size of random sample.  0L, the default, takes the root of n.
#' @param classifier A classifier function such as the default, knn1 (required to have same signature as knn1).
#' @inheritParams clusterRows
#' @param BLUSPARAM A \linkS4class{KmeansParam} object.
#' @param full Logical scalar indicating whether the initial clustering results should be returned.
#' 
#' @details
#'
#' Here, the idea is to allow using an expensive algorithm to
#' establish a training set of clusters on a subset of the data,
#' followed by a less expensive algorithm to classify the remaining
#' unclustered observations.
#' 
#' The default choice is to use default hierarchical clustering for
#' the initialTechnique, with initialSample being the root of the
#' number of observations to be used and knn1 as the classifier.
#'
#' @return 
#' The \code{postClassifyParam} constructor will return a \linkS4class{postClassifyParam} object with the specified parameters.
#'
#' The \code{clusterRows} method will return a factor of length equal to \code{nrow(x)} containing the cluster assignments.
#' If \code{full=TRUE}, a list is returned with a \code{clusters} factor and an \code{objects} list containing:
#' \itemize{
#' \item \code{initialCluster}, the result of initial clustering.
#' \item \code{initialSample}, the integer vector indexing rows of x which were initially clustered.
#' }
#'
#' @author Malcolm Cook
#' @examples
#' m <- matrix(runif(100000), ncol=10)
#' stuff <- clusterRows(m, postClassifyParam())
#' table(stuff)
#'
#' @name postClassifyParam-class
#' @aliases
#' show,postClassifyParam-method
NULL

#' @export
#' @rdname postClassifyParam-class
setClass("postClassifyParam", contains="BlusterParam", slots=c(initialTechnique="BlusterParam", initialSample="integer",classifier="function"))

#' @export
#' @rdname postClassifyParam-class
postClassifyParam <- function(initialTechnique=HclustParam(), initialSample=0L,classifier=class::knn1) {
    new("postClassifyParam", initialTechnique=initialTechnique, initialSample=initialSample,classifier=classifier)
}

#' @export
#' @importFrom utils capture.output
setMethod("show", "postClassifyParam", function(object) {
    callNextMethod()
    cat("initialTechnique:\n")
    fout <- capture.output(show(object@initialTechnique))
    cat(paste0("  ", paste(fout, collapse="\n  "), "\n"))
    cat("initialSample:\n")
    sout <- capture.output(show(object@initialSample))
    cat("classifier:\n")
    sout <- capture.output(show(object@classifier))
    cat(paste0("  ", paste(sout, collapse="\n  "), "\n"))
})

#' @export
#' @rdname postClassifyParam-
setMethod("clusterRows", c("ANY", "postClassifyParam"), function(x, BLUSPARAM, full=FALSE) {
  n<-nrow(x)
  clusters<-rep(NA,n) # allocate results vector.
  initialSample<-BLUSPARAM@initialSample
  if(identical(0L,initialSample)) initialSample<-as.integer(floor(sqrt(n)))
  stopifnot(is.integer(initialSample))
  if(1L==length(initialSample)) initialSample<-sample(n,initialSample)
  initialCluster<-clusterRows(x[initialSample,], BLUSPARAM@initialTechnique, full=TRUE)
  clusters[initialSample]<- initialCluster$clusters
  clusters[-initialSample]<- BLUSPARAM@classifier(x[initialSample,],x[-initialSample,],clusters[initialSample])
  clusters <- factor(clusters)
  if (!full) {
    clusters
  } else {
    list(clusters=clusters
         ,objects=list(initialCluster=initialCluster,initialSample=initialSample)
         )
  }
})

Distance/dissimilarity measures: extension

The bluster package is currently relying on stats::dist for distance calculations in the clustering process.

Limitation in this is that the stats::dist function covers only a relatively small set of dissimilarity indices. For instance, it is missing many dissimilarity indices that are commonly used in ecological analyses and available for instance through vegan::vegdist. Extending the availability of dissimilarity indices would be beneficial for making the bluster package support other applications of SummarizedExperiment family, for instance in microbiome research that we are working on. Providing access to readily available dissimilarity indices would support users.

Extension of HclustParam with hclust function choice

The HclustParam is currently using stats::hclust as clustering function. But other functions exist to do hierarchical clustering, for example fastcluster::hclust.
In some situations, the choice of the function used would be handy as the computation time can be quite high.

A solution would be to add a parameter to HclustParam to let the user choose the function. The default still being stats::hclust.

The new use would be: clusterRows(sce, HclustParam(clust.func = fastcluster::hclust))

Do you think such addition to bluster would be ok?

`approxSilhouette` should have an `exact=TRUE` mode

Might as well, for the purists.

R kernel killed

I am using the clusterRows and HclustParam from bluster v1.0.0 (R 4.0.3 and Bioconductor 3.12) to perform cell clustering. I tried it on Jupyter notebook and R in terminal, both resulted in termination of the process.

I've used the same codes on other datasets (< 10K cells) without issue.

# sce
# class: SingleCellExperiment 
# dim: 32285 26875 

# This works
my.dist <- dist(reducedDim(sce, "PCA"), method = "euclidean")
hclust.clusters <- cutreeDynamic(hclust(my.dist, method = "ward.D2"), distM = as.matrix(my.dist))

# Kernel killed
hclust.clusters <- clusterRows(reducedDim(sce, "PCA"), HclustParam(method = "ward.D2", cut.dynamic = TRUE))

All `objects` should be a list of objects, rather than the objects themselves.

HclustParam's object should be a list with a distance matrix and the hclust object
Same for KmeansParam, arguably.

Use full if the current objects is already a list and you can't think of any other name.

Need to figure out what to make of the WCSS

I want to compute the RMSD of each cluster but I'm not sure what to actually use it for. It's not a very good per-cluster metric and it's monotonically decreasing with increasing number of clusters so it's not very good for choosing the number of clusters either.

Guess I could just combine it with PCAtools's elbow point implementation and hope for the best.

Annoy algorithm in `bluster` cannot accelerate the clustering process

Hi, It seems Annoy algorithm in bluster cannot accelerate the clustering process. the uncorrected_sce is a SingleCellExperiment object with a reducedDimNames slot PCA which has a dim ([1] 111455 50).
seurat_obj is converted from uncorrected_sce

microbenchmark::microbenchmark(
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "rank"
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L,
            BNPARAM = BiocNeighbors::AnnoyParam()
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "jaccard", cluster.fun = "louvain"
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "jaccard", cluster.fun = "louvain",
            BNPARAM = BiocNeighbors::AnnoyParam()
        )
    ),
    {
        seurat_nei <- FindNeighbors(
            seurat_obj,
            dims = 1:50,
            k.param = 25L
        )
        seurat_nei <- FindClusters(seurat_nei, resolution = 1)
    },
    times = 1L
)

                                                               expr
                                                                    scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "rank"))
                                            scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      BNPARAM = BiocNeighbors::AnnoyParam()))
                                        scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "jaccard", cluster.fun = "louvain"))
 scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "jaccard", cluster.fun = "louvain", BNPARAM = BiocNeighbors::AnnoyParam()))
                                                       {     seurat_nei <- FindNeighbors(seurat_obj, dims = 1:50, k.param = 25L)     seurat_nei <- FindClusters(seurat_nei, resolution = 1) }
        min         lq       mean     median
 2519.89881 2519.89881 2519.89881 2519.89881
 2402.44808 2402.44808 2402.44808 2402.44808
   59.89610   59.89610   59.89610   59.89610
   66.01533   66.01533   66.01533   66.01533
   39.35732   39.35732   39.35732   39.35732
         uq        max neval
 2519.89881 2519.89881     1
 2402.44808 2402.44808     1
   59.89610   59.89610     1
   66.01533   66.01533     1
   39.35732   39.35732     1

Addition of the DMM algorithm

Hi, would it be possible to add the DMM algorithm to bluster? It's an algorithm that's commonly used in microbial ecology along with metagenomic & 16S rRNA count data.
We had functions related to it over at microbiome/mia, but figured it could have its place in bluster as well. We've started discussing it in a PR in microbiome/bluster.

Do you think such a param could have its place in bluster?