Hi, It seems Annoy algorithm in bluster cannot accele

Works fine for me. <div class="highlight highlight-source-r notranslate position-r

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

here is my sessionInfo: <div class="snippet-clipboard-content notranslate position

FWIW the vast majority of time is spent inside igraph 's <code class="

Annoy algorithm in `bluster` cannot accelerate the clustering process about bluster HOT 5 OPEN

Yunuuuu commented on August 31, 2024

Annoy algorithm in `bluster` cannot accelerate the clustering process

from bluster.

Comments (5)

LTLA commented on August 31, 2024

Works fine for me.

library(bluster)
library(BiocNeighbors)

m <- matrix(rnorm(5e6), ncol=50)
system.time(X <- clusterRows(m, BLUSPARAM = SNNGraphParam(cluster.fun="louvain", BNPARAM=AnnoyParam())))
##    user  system elapsed 
## 181.504   1.076 182.632 

system.time(X <- clusterRows(m, BLUSPARAM = SNNGraphParam(cluster.fun="louvain"))) 
# ... DNF after 10 minutes...

For smaller datasets, there may not be any speed-up as the Annoy approach writes its index to file for more general multi-threading via BPPARAM; the file I/O offsets any performance gain from approximation.

Session information

Still on the last devel cycle, but there weren't any major changes in the latest release, so it shouldn't matter.

R version 4.2.1 RC (2022-06-17 r82506)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-2-branch-devel/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-2-branch-devel/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocNeighbors_1.15.1 bluster_1.7.0       

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9           lattice_0.20-45      codetools_0.2-18    
 [4] grid_4.2.1           stats4_4.2.1         magrittr_2.0.3      
 [7] cli_3.4.1            rlang_1.0.6          S4Vectors_0.35.4    
[10] Matrix_1.5-1         BiocParallel_1.31.12 igraph_1.3.5        
[13] parallel_4.2.1       compiler_4.2.1       pkgconfig_2.0.3     
[16] BiocGenerics_0.43.4  cluster_2.1.4

from bluster.

Yunuuuu commented on August 31, 2024

@LTLA , thanks for your reply and the development of the great single cell toolkit in Bioconducot. Since I only used the first 50 PCs to clustering, annoy cannot provide the optimization. I recently compare the performance between scran and seurat. Seurat used annoy algorithm as the default and used louvain and jaccard to cluster cells, which I found much faster than scran in the same paramters, but I cannot get a similar performance. is it possible to run faster here ?

from bluster.

Yunuuuu commented on August 31, 2024

here is my sessionInfo:

[R]> sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
 [1] LC_CTYPE=en_US.UTF-8      
 [2] LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=zh_CN.UTF-8      
 [8] LC_NAME=C                 
 [9] LC_ADDRESS=C              
[10] LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_CN.UTF-8
[12] LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices
[5] utils     datasets  methods   base     

other attached packages:
 [1] here_1.0.1                 
 [2] batchelor_1.14.0           
 [3] SingleCellExperiment_1.20.0
 [4] SummarizedExperiment_1.28.0
 [5] Biobase_2.58.0             
 [6] GenomicRanges_1.50.0       
 [7] GenomeInfoDb_1.34.0        
 [8] IRanges_2.32.0             
 [9] S4Vectors_0.36.0           
[10] BiocGenerics_0.44.0        
[11] MatrixGenerics_1.10.0      
[12] matrixStats_0.62.0         

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9               
 [2] bluster_1.8.0            
 [3] compiler_4.2.1           
 [4] progressr_0.11.0         
 [5] XVector_0.38.0           
 [6] bitops_1.0-7             
 [7] BiocNeighbors_1.16.0     
 [8] tools_4.2.1              
 [9] DelayedMatrixStats_1.20.0
[10] zlibbioc_1.44.0          
[11] statmod_1.4.37           
[12] metapod_1.6.0            
[13] digest_0.6.30            
[14] jsonlite_1.8.3           
[15] lattice_0.20-45          
[16] pkgconfig_2.0.3          
[17] rlang_1.0.6              
[18] igraph_1.3.5             
[19] Matrix_1.5-1             
[20] DelayedArray_0.24.0      
[21] cli_3.4.1                
[22] parallel_4.2.1           
[23] GenomeInfoDbData_1.2.9   
[24] cluster_2.1.4            
[25] locfit_1.5-9.6           
[26] rprojroot_2.0.3          
[27] grid_4.2.1               
[28] scuttle_1.8.0            
[29] BiocParallel_1.32.0      
[30] limma_3.54.0             
[31] irlba_2.3.5.1            
[32] edgeR_3.40.0             
[33] magrittr_2.0.3           
[34] BiocSingular_1.14.0      
[35] codetools_0.2-18         
[36] sparseMatrixStats_1.10.0 
[37] beachmat_2.14.0          
[38] rsvd_1.0.5               
[39] dqrng_0.3.0              
[40] ResidualMatrix_1.8.0     
[41] ScaledMatrix_1.6.0       
[42] RCurl_1.98-1.9           
[43] scran_1.26.0

from bluster.

LTLA commented on August 31, 2024

FWIW the vast majority of time is spent inside igraph's cluster_* functions. With 100,000 random cells, the nearest neighbors and graph construction takes about 30 seconds; the rest of it (> 4 minutes) is spent inside cluster_louvain.

You can speed up the graph construction by parallelizing it via BPPARAM, but it'll probably just save you a few seconds or so. The actual clustering itself is serial so it doesn't benefit from parallelization, at least not in igraph's C implementation.

IIRC Seurat had their own implementation of the Louvain algorithm. I don't know whether or not this is of the same quality as igraph's implementation, but if it's noticeably faster, they may be taking some short-cuts that igraph does not.

from bluster.

Yunuuuu commented on August 31, 2024

Thanks for your detail explanation, I'll persist in bioconductor single cell toolkit instead of pursuing performance blindly.

So from from my side it would be ok to close the issue now, and I also got the same description that we won't get any benifits from annoy for small datasets in http://bioconductor.org/books/3.15/OSCA.advanced/dealing-with-big-data.html#fast-approximations, I'll read this book again.

Thanks for your help @LTLA

from bluster.

Annoy algorithm in `bluster` cannot accelerate the clustering process about bluster HOT 5 OPEN

Comments (5)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent