Coder Social home page Coder Social logo

bluster's People

Contributors

bananacancer avatar jwokaty avatar ltla avatar nturaga avatar tuomasborman avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

bluster's Issues

Unit tests will break in igraph 1.3.0

igraph 1.3.0 is about to be released soon on CRAN, and it will change the implementation of the Louvain clustering to be randomized like it should have been from the very beginning. As a result, it seems like unit tests in bluster that rely on the Louvain clustering will break. See this PR in the underlying C core of the igraph library that introduced randomization. I believe that setting a fixed seed before calling the Louvain clustering will fix things here but I am not 100% sure - is there anything we can do to help fixing the unit tests for the upcoming release?

Cluster purity should be cluster impurity

The idea is to compute the proportion of neighbor not in the cluster, divide by the proportion of cells not in the cluster, and cap that at unity. This value should be close to zero if the clusters are pure, and large if not.

Currently, the cluster purity just decreases monotonically when the number of clusters increases, simply because there's higher chance of getting a cell from another cluster in your set of neighbors. By dividing, we favor clusterings that retain purity in the presence of many other clusters, hopefully creating some local minima of impurity where the denominator increases faster than the numerator. This would allow the impurity to be used for auto-choosing the cluster resolution a la the silhouette width.

clustering on a sample/subset followed by classification

I have been in need of an way of using bluster to abstract over a two-step approach of hierarchical clustering over a subset of my observations followed by classification of remaining observations using knn1.

I modelled the following after your twoStepParam.

Do you think it is a reasonable abstraction to include in future versions of bluster, and if so, how would you like a submission?

#' A two-step approach of clustering a sample subset of observations
#' followed by classification of remaining observations (e.g. using knn1).
#'
#' For large datasets, we can cluster on a subset using an initial
#' technique (e.g., using HclustParam) and then classify the remaining
#' observations based on this clustering using a classifier (e.g., and
#' by default, knn1).
#'
#' @param initialTechnique A \linkS4class{BlusterParam} object specifying an initial clustering.
#' @param initialSample A integer vector specifying the sample on which to conduct initial clustering. A single value is taken as size of random sample.  0L, the default, takes the root of n.
#' @param classifier A classifier function such as the default, knn1 (required to have same signature as knn1).
#' @inheritParams clusterRows
#' @param BLUSPARAM A \linkS4class{KmeansParam} object.
#' @param full Logical scalar indicating whether the initial clustering results should be returned.
#' 
#' @details
#'
#' Here, the idea is to allow using an expensive algorithm to
#' establish a training set of clusters on a subset of the data,
#' followed by a less expensive algorithm to classify the remaining
#' unclustered observations.
#' 
#' The default choice is to use default hierarchical clustering for
#' the initialTechnique, with initialSample being the root of the
#' number of observations to be used and knn1 as the classifier.
#'
#' @return 
#' The \code{postClassifyParam} constructor will return a \linkS4class{postClassifyParam} object with the specified parameters.
#'
#' The \code{clusterRows} method will return a factor of length equal to \code{nrow(x)} containing the cluster assignments.
#' If \code{full=TRUE}, a list is returned with a \code{clusters} factor and an \code{objects} list containing:
#' \itemize{
#' \item \code{initialCluster}, the result of initial clustering.
#' \item \code{initialSample}, the integer vector indexing rows of x which were initially clustered.
#' }
#'
#' @author Malcolm Cook
#' @examples
#' m <- matrix(runif(100000), ncol=10)
#' stuff <- clusterRows(m, postClassifyParam())
#' table(stuff)
#'
#' @name postClassifyParam-class
#' @aliases
#' show,postClassifyParam-method
NULL

#' @export
#' @rdname postClassifyParam-class
setClass("postClassifyParam", contains="BlusterParam", slots=c(initialTechnique="BlusterParam", initialSample="integer",classifier="function"))

#' @export
#' @rdname postClassifyParam-class
postClassifyParam <- function(initialTechnique=HclustParam(), initialSample=0L,classifier=class::knn1) {
    new("postClassifyParam", initialTechnique=initialTechnique, initialSample=initialSample,classifier=classifier)
}

#' @export
#' @importFrom utils capture.output
setMethod("show", "postClassifyParam", function(object) {
    callNextMethod()
    cat("initialTechnique:\n")
    fout <- capture.output(show(object@initialTechnique))
    cat(paste0("  ", paste(fout, collapse="\n  "), "\n"))
    cat("initialSample:\n")
    sout <- capture.output(show(object@initialSample))
    cat("classifier:\n")
    sout <- capture.output(show(object@classifier))
    cat(paste0("  ", paste(sout, collapse="\n  "), "\n"))
})

#' @export
#' @rdname postClassifyParam-
setMethod("clusterRows", c("ANY", "postClassifyParam"), function(x, BLUSPARAM, full=FALSE) {
  n<-nrow(x)
  clusters<-rep(NA,n) # allocate results vector.
  initialSample<-BLUSPARAM@initialSample
  if(identical(0L,initialSample)) initialSample<-as.integer(floor(sqrt(n)))
  stopifnot(is.integer(initialSample))
  if(1L==length(initialSample)) initialSample<-sample(n,initialSample)
  initialCluster<-clusterRows(x[initialSample,], BLUSPARAM@initialTechnique, full=TRUE)
  clusters[initialSample]<- initialCluster$clusters
  clusters[-initialSample]<- BLUSPARAM@classifier(x[initialSample,],x[-initialSample,],clusters[initialSample])
  clusters <- factor(clusters)
  if (!full) {
    clusters
  } else {
    list(clusters=clusters
         ,objects=list(initialCluster=initialCluster,initialSample=initialSample)
         )
  }
})

Distance/dissimilarity measures: extension

The bluster package is currently relying on stats::dist for distance calculations in the clustering process.

Limitation in this is that the stats::dist function covers only a relatively small set of dissimilarity indices. For instance, it is missing many dissimilarity indices that are commonly used in ecological analyses and available for instance through vegan::vegdist. Extending the availability of dissimilarity indices would be beneficial for making the bluster package support other applications of SummarizedExperiment family, for instance in microbiome research that we are working on. Providing access to readily available dissimilarity indices would support users.

Suggested solution:

  • Add support for vegan::vegdist in the bluster package
  • Implement this so that the user could define the distance function as a function argument. This way one could avoid adding new dependencies in the bluster package.

This would concern multiple functions.

The process would then look, for instance in the context of clusterRows and hierarchical clustering, something like:

clusterRows(sce, distfun=stats::dist, HclustParam(metric="euclidean"))

clusterRows(sce, distfun=vegan::vegdist, HclustParam(metric="bray"))

etc.

Extension of HclustParam with hclust function choice

The HclustParam is currently using stats::hclust as clustering function. But other functions exist to do hierarchical clustering, for example fastcluster::hclust.
In some situations, the choice of the function used would be handy as the computation time can be quite high.

A solution would be to add a parameter to HclustParam to let the user choose the function. The default still being stats::hclust.

The new use would be: clusterRows(sce, HclustParam(clust.func = fastcluster::hclust))

Do you think such addition to bluster would be ok?

R kernel killed

I am using the clusterRows and HclustParam from bluster v1.0.0 (R 4.0.3 and Bioconductor 3.12) to perform cell clustering. I tried it on Jupyter notebook and R in terminal, both resulted in termination of the process.

I've used the same codes on other datasets (< 10K cells) without issue.

# sce
# class: SingleCellExperiment 
# dim: 32285 26875 

# This works
my.dist <- dist(reducedDim(sce, "PCA"), method = "euclidean")
hclust.clusters <- cutreeDynamic(hclust(my.dist, method = "ward.D2"), distM = as.matrix(my.dist))

# Kernel killed
hclust.clusters <- clusterRows(reducedDim(sce, "PCA"), HclustParam(method = "ward.D2", cut.dynamic = TRUE))

Need to figure out what to make of the WCSS

I want to compute the RMSD of each cluster but I'm not sure what to actually use it for. It's not a very good per-cluster metric and it's monotonically decreasing with increasing number of clusters so it's not very good for choosing the number of clusters either.

Guess I could just combine it with PCAtools's elbow point implementation and hope for the best.

Annoy algorithm in `bluster` cannot accelerate the clustering process

Hi, It seems Annoy algorithm in bluster cannot accelerate the clustering process. the uncorrected_sce is a SingleCellExperiment object with a reducedDimNames slot PCA which has a dim ([1] 111455 50).
seurat_obj is converted from uncorrected_sce

microbenchmark::microbenchmark(
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "rank"
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L,
            BNPARAM = BiocNeighbors::AnnoyParam()
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "jaccard", cluster.fun = "louvain"
        )
    ),
    scran::clusterCells(
        uncorrected_sce,
        use.dimred = "PCA",
        BLUSPARAM = bluster::SNNGraphParam(
            k = 25L, type = "jaccard", cluster.fun = "louvain",
            BNPARAM = BiocNeighbors::AnnoyParam()
        )
    ),
    {
        seurat_nei <- FindNeighbors(
            seurat_obj,
            dims = 1:50,
            k.param = 25L
        )
        seurat_nei <- FindClusters(seurat_nei, resolution = 1)
    },
    times = 1L
)
                                                               expr
                                                                    scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "rank"))
                                            scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      BNPARAM = BiocNeighbors::AnnoyParam()))
                                        scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "jaccard", cluster.fun = "louvain"))
 scran::clusterCells(uncorrected_sce, use.dimred = "PCA", BLUSPARAM = bluster::SNNGraphParam(k = 25L,      type = "jaccard", cluster.fun = "louvain", BNPARAM = BiocNeighbors::AnnoyParam()))
                                                       {     seurat_nei <- FindNeighbors(seurat_obj, dims = 1:50, k.param = 25L)     seurat_nei <- FindClusters(seurat_nei, resolution = 1) }
        min         lq       mean     median
 2519.89881 2519.89881 2519.89881 2519.89881
 2402.44808 2402.44808 2402.44808 2402.44808
   59.89610   59.89610   59.89610   59.89610
   66.01533   66.01533   66.01533   66.01533
   39.35732   39.35732   39.35732   39.35732
         uq        max neval
 2519.89881 2519.89881     1
 2402.44808 2402.44808     1
   59.89610   59.89610     1
   66.01533   66.01533     1
   39.35732   39.35732     1

Addition of the DMM algorithm

Hi, would it be possible to add the DMM algorithm to bluster? It's an algorithm that's commonly used in microbial ecology along with metagenomic & 16S rRNA count data.
We had functions related to it over at microbiome/mia, but figured it could have its place in bluster as well. We've started discussing it in a PR in microbiome/bluster.

Do you think such a param could have its place in bluster?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.