lutzhamel / popsom Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 4.0 2.07 MB

R package for self-organizing maps

License: GNU General Public License v3.0

R 91.29% Fortran 7.67% C 1.04%

popsom's People

Contributors

Stargazers

Watchers

Forkers

gregbreard roberttatoian borenzheng minghao2016

popsom's Issues

Topology of map

Hi, Thanks for the package.

Can you please let me know if the topology of the popsom map is toroidal or planar? Also, is the grid lattice rectangular or hexagonal?

Thanks!

Performance claims cannot be confirmed (using simple experiments)

I have difficulties confirming the performance claims (see the experiments below). However, it may well be that the experiments performed are oversimplified and/or poorly specified (they are mainly based on default settings).
It would be very helpful if you could provide instructions and examples that can be used to test the performance. I also suggest to include an example illustrating the performance improvements in the software paper.

library(popsom)
library(som)
library(kohonen)
library(MASS)
library(microbenchmark)
library(ggplot2)

# method wrappers

pop_wrp <- function(dat, ...) popsom::map(as.data.frame(dat), ...)
som_wrp <- function(dat, ...) som::som(dat, ...)
koh_wrp <- function(dat, ...) kohonen::som(as.matrix(dat), ...)

# data sets

# iris
data(iris)
df_iris <- subset(iris, select = -Species)

# wines from package kohonen
data(wines)
df_wines <- scale(wines)

# synthetic data with three clusters
p <- 10
n <- 500
siglarg <- diag(rep(1, p * p), p, p)
means <- c(0, -50, 50)

clusts <- lapply(means, function(mu) mvrnorm(n = n, mu = rep(mu, p), Sigma = siglarg))
df_sim <- do.call(rbind, clusts)


datsets <- list(df_iris, df_wines, df_sim)
bmr <- lapply(datsets, 
              function(dat) microbenchmark(pop_wrp(dat, train = 1000),
                                           som_wrp(dat, xdim = 10, ydim = 5), # no default values for xdim and ylim. set to popsom defaults
                                           koh_wrp(dat)))

ggplot2::autoplot(bmr[[1]]) + ggplot2::ggtitle("Wine data")
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

ggplot2::autoplot(bmr[[2]]) + ggplot2::ggtitle("Iris data")
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

ggplot2::autoplot(bmr[[3]]) + ggplot2::ggtitle("Simulated data")
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Linux Mint 19.2
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.3.5        microbenchmark_1.4-7 MASS_7.3-54         
#> [4] kohonen_3.0.10       som_0.3-5.1          popsom_5.2          
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.7        compiler_4.1.0    pillar_1.6.1      highr_0.9        
#>  [5] viridis_0.6.1     tools_4.1.0       dotCall64_1.0-1   digest_0.6.27    
#>  [9] viridisLite_0.4.0 evaluate_0.14     lifecycle_1.0.0   tibble_3.1.2     
#> [13] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.11      reprex_2.0.1     
#> [17] cli_3.0.0         rstudioapi_0.13   yaml_2.2.1        spam_2.7-0       
#> [21] xfun_0.24         gridExtra_2.3     withr_2.4.2       stringr_1.4.0    
#> [25] dplyr_1.0.6       knitr_1.33        maps_3.3.0        fields_12.5      
#> [29] generics_0.1.0    fs_1.5.0          vctrs_0.3.8       tidyselect_1.1.1 
#> [33] grid_4.1.0        glue_1.4.2        R6_2.5.0          hash_2.2.6.1     
#> [37] fansi_0.5.0       rmarkdown_2.8     farver_2.1.0      purrr_0.3.4      
#> [41] magrittr_2.0.1    scales_1.1.1      htmltools_0.5.1.1 ellipsis_0.3.2   
#> [45] colorspace_2.0-2  utf8_1.2.1        stringi_1.6.2     munsell_0.5.0    
#> [49] crayon_1.4.1

^{Created on 2021-08-09 by the reprex package (v2.0.1)}

Provide a high-level description of SOM

Summary does not contain a high-level description of the package functionality other than that it provides an implementation of self-organising maps and that a self-organising map is an artificial neural network designed for unsupervised learning. This is insufficient information to get a high-level idea of what SOMs are and what they are commonly used for.

Statement of need is missing

The authors statement of need is more of a results section. The only statement of need in this paragraph is: "Training a self-organizing map is time consuming." Authors should expand on how training SOMs is time consuming: which application domains? what is the dimensionality of such datasets resulting in the slow execution time?

Include a brief explanation of the example in the software paper

The documentation and the included examples already make the use of popsom very easy. To make the package even more accessible, I suggest including a brief explanation of the summary output and the starburst plot in section Usage of the software paper.

__gfortran_os_error_at Error installing popsom

I get this error when trying to install the popsom package on macos Monterey 12.2.1:

** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘popsom’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/00LOCK-popsom/00new/popsom/libs/popsom.so':
  dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/00LOCK-popsom/00new/popsom/libs/popsom.so, 0x0006): symbol not found in flat namespace '__gfortran_os_error_at'
Error: loading failed
Execution halted

I wonder if the correct FORTRAN library is being used? One search suggested the error can be caused by using the wrong version of the FORTRAN library. It appears that libgfortran.5.dylib is being used.

Version being installed: popsom_6.0.tar.gz

My sessionInfo():

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] conos_1.4.5     leidenAlg_1.0.1 igraph_1.2.11   Matrix_1.4-0   

loaded via a namespace (and not attached):
 [1] circlize_0.4.14       shape_1.4.6           GetoptLong_1.0.5      tidyselect_1.1.2      purrr_0.3.4           lattice_0.20-45      
 [7] colorspace_2.0-3      vctrs_0.3.8           generics_0.1.2        stats4_4.1.2          sccore_1.0.1          utf8_1.2.2           
[13] rlang_1.0.1           pillar_1.7.0          glue_1.6.2            DBI_1.1.2             BiocGenerics_0.40.0   RColorBrewer_1.1-2   
[19] matrixStats_0.61.0    foreach_1.5.2         lifecycle_1.0.1       munsell_0.5.0         Matrix.utils_0.9.8    gtable_0.3.0         
[25] GlobalOptions_0.1.2   codetools_0.2-18      ComplexHeatmap_2.10.0 IRanges_2.28.0        doParallel_1.0.17     parallel_4.1.2       
[31] fansi_1.0.2           Rcpp_1.0.8            BiocManager_1.30.16   scales_1.1.1          grr_0.9.5             S4Vectors_0.32.3     
[37] gridExtra_2.3         rjson_0.2.21          ggplot2_3.3.5         png_0.1-7             digest_0.6.29         Rtsne_0.15           
[43] dplyr_1.0.8           ggrepel_0.9.1         grid_4.1.2            clue_0.3-60           cli_3.2.0             tools_4.1.2          
[49] magrittr_2.0.2        tibble_3.1.6          cluster_2.1.2         crayon_1.5.0          pkgconfig_2.0.3       ellipsis_0.3.2       
[55] assertthat_0.2.1      iterators_1.0.14      R6_2.5.1              compiler_4.1.2

Bug in function `map`

Running map sometimes triggers an error, see the following MREs.

(Some) error conditions this was observed for

smaller numbers of training iterations (see microbenchmarks)
default values for xdim and ydim
depends on random initialization

library(popsom)
#> 
#> Attaching package: 'popsom'
#> The following objects are masked from 'package:stats':
#> 
#>     fitted, predict
#> The following object is masked from 'package:base':
#> 
#>     summary

data(iris)
df <- subset(iris, select = -Species)
labels = subset(iris, select = Species)

# triggers error
m <- map(df, labels, train = 100, seed = 10) 
#> Error in map$unique.centroids[[cluster.ix]]: subscript out of bounds

m <- map(df, labels, train = 10, seed = 1)
#> Error in map$unique.centroids[[cluster.ix]]: subscript out of bounds

# does not trigger error
m <- map(df, labels, train = 101, seed = 10)
m <- map(df, labels, train = 100, seed = 1) 
m <- map(df, labels, xdim = 15, ydim = 10, train = 100, seed = 10)

Microbenchmarks

microbenchmark::microbenchmark(map(df, labels, train = 100))
#> Error in map$unique.centroids[[cluster.ix]]: subscript out of bounds
microbenchmark::microbenchmark(map(df, labels, train = 451))
#> Error in map$unique.centroids[[cluster.ix]]: subscript out of bounds
microbenchmark::microbenchmark(map(df, labels, train = 1000))
#> Unit: milliseconds
#>                           expr      min       lq     mean   median      uq
#>  map(df, labels, train = 1000) 299.2242 315.6117 331.4985 323.6654 334.924
#>       max neval
#>  463.2775   100

Session info

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Linux Mint 19.2
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] popsom_5.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.1.0       pillar_1.6.1         highr_0.9           
#>  [4] viridis_0.6.1        tools_4.1.0          dotCall64_1.0-1     
#>  [7] digest_0.6.27        viridisLite_0.4.0    evaluate_0.14       
#> [10] lifecycle_1.0.0      tibble_3.1.2         gtable_0.3.0        
#> [13] pkgconfig_2.0.3      rlang_0.4.11         reprex_2.0.1        
#> [16] cli_3.0.0            rstudioapi_0.13      microbenchmark_1.4-7
#> [19] yaml_2.2.1           spam_2.7-0           xfun_0.24           
#> [22] gridExtra_2.3        withr_2.4.2          stringr_1.4.0       
#> [25] dplyr_1.0.6          knitr_1.33           maps_3.3.0          
#> [28] fields_12.5          generics_0.1.0       fs_1.5.0            
#> [31] vctrs_0.3.8          tidyselect_1.1.1     grid_4.1.0          
#> [34] glue_1.4.2           R6_2.5.0             hash_2.2.6.1        
#> [37] fansi_0.5.0          rmarkdown_2.8        purrr_0.3.4         
#> [40] ggplot2_3.3.5        magrittr_2.0.1       scales_1.1.1        
#> [43] htmltools_0.5.1.1    ellipsis_0.3.2       colorspace_2.0-2    
#> [46] utf8_1.2.1           stringi_1.6.2        munsell_0.5.0       
#> [49] crayon_1.4.1

^{Created on 2021-08-09 by the reprex package (v2.0.1)}

Reduce self-citations and refer to relevant work

The references in the manuscript fall in one of following categories:

author's own publications (Hamel 2018, Hamel 2016, Hamel 2011, Hamel 2021, Tatoian 2018)
publications on which the author's work is directly based (Kohonen 2001, Ultsch 1990)
publications on which the author's work is directly compared to (Yan 2016, Wehrens 2018)
references to applications of SOM in the field (Liakos 2018, Mathys 2019, Matić 2018, Miller 1996)

Please update the manuscript to address the following issues:

Reduce the number of self-citations to <25%. Currently 38% of citations are self-citations-
The citations which are meant to give a broad overview of the applications of SOM in the field are very superficial: "Self-organising maps have been applied in virtually every scientific discipline where some sort of data exploration or analysis is necessary." This is a very broad statement which is hard to verify. Please elaborate a bit on common applications of SOM. For example, FlowSOM uses SOMs to analyse Flow Cytometry data to gain insight into the expression profiles of smaller cell populations which might otherwise be missed (Van Gassen et al, Cytometry A, 2015).

Uninformative error message if learning rate is not adequately specified

The documentation of map states that alpha, "the learning rate, should be a positive non-zero real number."
However, a non-zero positive learning rate can result in an error with a rather uninformative error message (see MRE below).
I suggest to provide a more meaningful error message, for example -- if this is the case -- that the learning rate is chosen to large.

library(popsom)
data(iris)
df <- subset(iris, select = -Species)
labels <- subset(iris ,select = Species)

# triggers error
m <- popsom::map(df, labels, xdim = 15, ydim = 10, train = 10000, alpha = 2.3, seed = 42)
#> Error in ks.test(map.df[[i]], data.df[[i]]): not enough 'x' data

# does not trigger error
m <- popsom::map(df, labels, xdim = 15, ydim = 10, train = 10000, alpha = 2.2, seed = 42)

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Linux Mint 19.2
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] popsom_5.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.1.0       pillar_1.6.1         highr_0.9           
#>  [4] viridis_0.6.1        tools_4.1.0          dotCall64_1.0-1     
#>  [7] digest_0.6.27        viridisLite_0.4.0    evaluate_0.14       
#> [10] lifecycle_1.0.0      tibble_3.1.2         gtable_0.3.0        
#> [13] pkgconfig_2.0.3      rlang_0.4.11         reprex_2.0.1        
#> [16] cli_3.0.0            rstudioapi_0.13      microbenchmark_1.4-7
#> [19] yaml_2.2.1           spam_2.7-0           xfun_0.24           
#> [22] gridExtra_2.3        withr_2.4.2          stringr_1.4.0       
#> [25] dplyr_1.0.6          knitr_1.33           maps_3.3.0          
#> [28] fields_12.5          generics_0.1.0       fs_1.5.0            
#> [31] vctrs_0.3.8          tidyselect_1.1.1     grid_4.1.0          
#> [34] glue_1.4.2           R6_2.5.0             hash_2.2.6.1        
#> [37] fansi_0.5.0          rmarkdown_2.8        purrr_0.3.4         
#> [40] ggplot2_3.3.5        magrittr_2.0.1       scales_1.1.1        
#> [43] htmltools_0.5.1.1    ellipsis_0.3.2       colorspace_2.0-2    
#> [46] utf8_1.2.1           stringi_1.6.2        munsell_0.5.0       
#> [49] crayon_1.4.1

^{Created on 2021-08-09 by the reprex package (v2.0.1)}

implement summary S3 function

Implement a summary.map for the generic summary. TODO: what should this function display?

citations in the description field of the DESCRIPTION file

For future submissions insert references about the methods in the Description field in the form Authors (year) doi:10..... or arXiv:.....?

Include Community guidelines and Installation instructions in the Readme

In my opinion, it would be good practice to include installation instructions in the Readme, for example:

Installation

Install the last release from CRAN:

install.packages("popsom")

Moreover, I think a note on how to contribute/report bugs is missing.

Split up package code to improve readability

The package consists of a very large R file which makes navigation of the functions in the package difficult. Following common R practices, I suggest splitting this file up into multiple R files, ideally one file per function.

Please consider generating the man/* files using roxygen2. Having the documentation close to the functions allow developers to easily interpret the functionality of each function.

S3 generics get masked

Functions popsom::summary, popsom::predict and popsom::fitted mask S3 generics.

This is very inconvenient as it breaks method dispatch for objects of other classes when popsom is loaded.
Appropriate S3 methods should therefore be implemented.

With this in mind, I also suggest to move the starburst functionality into a S3 plot method and to add a S3 print method.

fix the copy content on the map man pages

fix the copy content on the map man pages:

function call def too long
references need to be fixed.

Elaborate on performance claims

The authors compare the performance of their method to the som and kohonen packages.

Authors claim measured speedups of 60× in comparison to som (Yan, 2016). However, according to the author's own time measurements, this speedup is closer to 15×. Can you provide more insight into how
Can authors also provide insight into the scalability of the algorithm for the execution time in function of input dataset size (number of observations and number of features).

popsom masking fitted and predict from the stats package

figure out why this masking is happening. I assume that fittted and predict are implementations of the generic functions fitted and predict.

Improve quality of writing

Paragraphs should consist of full sentences. Partial sentences are used when referring to pieces of code or when citing articles.

Example of incomplete sentence due to code block:

This is easily verified with a scatter plot matrix of the iris dataset using
-code block-
and shown in Figure 2.

Example of incomplete sentence due to citation:

A number of R-packages exist that implement self-organizing maps including (Wehrens & Kruisselbrink, 2018) and (Yan, 2016).

If you omit everything between parentheses, the sentence that remains is incomplete:

A number of R-packages exist that implement self-organizing maps including and.

lutzhamel / popsom Goto Github PK

popsom's People

Contributors

Stargazers

Watchers

Forkers

popsom's Issues

Installation

Recommend Projects

Recommend Topics

Recommend Org