mhahsler / seriation Goto Github PK

View Code? Open in Web Editor NEW

74.0 7.0 16.0 28.58 MB

Infrastructure for Ordering using Seriation - R Package

License: GNU General Public License v3.0

R 82.35% C 7.93% Fortran 6.18% TeX 3.54%

seriation cran r combinatorial-optimization ordination

seriation's Introduction

R package seriation - Infrastructure for Ordering Objects Using Seriation

Introduction

Seriation arranges a set of objects into a linear order given available data with the goal of revealing structural information. This package provides the infrastructure for ordering objects with an implementation of many seriation/ordination techniques to reorder data matrices, dissimilarity matrices, correlation matrices, and dendrograms (see below for a complete list). The package provides several visualizations (grid and ggplot2) to reveal structural information, including permuted image plots, reordered heatmaps, Bertin plots, clustering visualizations like dissimilarity plots, and visual assessment of cluster tendency plots (VAT and iVAT).

Here are some quick guides on applications of seriation:

Implemented seriation methods and criteria:

The following R packages use seriation: adepro, arulesViz, baizer, ChemoSpec, ClusteredMutations, corrgram, corrplot, corrr, dendextend, DendSer, dendsort, disclapmix, flexclust, ggraph, heatmaply, MEDseq, ockc, protti, qlcVisualize, RMaCzek, SFS, tidygraph, treeheatr, vcdExtra

To cite package ‘seriation’ in publications use:

Hahsler M, Hornik K, Buchta C (2008). “Getting things in order: An introduction to the R package seriation.” Journal of Statistical Software, 25(3), 1-34. ISSN 1548-7660, doi:10.18637/jss.v025.i03 https://doi.org/10.18637/jss.v025.i03.

@Article{,
  title = {Getting things in order:  An introduction to the R package seriation},
  author = {Michael Hahsler and Kurt Hornik and Christian Buchta},
  year = {2008},
  journal = {Journal of Statistical Software},
  volume = {25},
  number = {3},
  pages = {1--34},
  doi = {10.18637/jss.v025.i03},
  month = {March},
  issn = {1548-7660},
}

Available seriation methods to reorder dissimilarity data

Seriation methods for dissimilarity data reorder the set of objects in the data. The methods fall into several groups based on the criteria they try to optimize, constraints (like dendrograms), and the algorithmic approach.

Dendrogram leaf order

These methods create a dendrogram using hierarchical clustering and then derive the seriation order from the leaf order in the dendrogram. Leaf reordering may be applied.

DendSer - Dendrogram seriation heuristic to optimize various criteria
GW - Hierarchical clustering reordered by the Gruvaeus and Wainer heuristic
HC - Hierarchical clustering (single link, avg. link, complete link)
OLO - Hierarchical clustering with optimal leaf ordering

Dimensionality reduction

Find a seriation order by reducing the dimensionality to 1 dimension. This is typically done by minimizing a stress measure or the reconstruction error.

MDS - classical metric multidimensional scaling
MDS_angle - order by the angular order in the 2D MDS projection space split by the larges gap
isoMDS - 1D Krusakl’s non-metric multidimensional scaling
isomap - 1D isometric feature mapping ordination
monoMDS - order along 1D global and local non-metric multidimensional scaling using monotone regression (NMDS)
metaMDS - 1D non-metric multidimensional scaling (NMDS) with stable solution from random starts
Sammon - Order along the 1D Sammon’s non-linear mapping
smacof - 1D MDS using majorization (ratio MDS, interval MDS, ordinal MDS)
TSNE - Order along the 1D t-distributed stochastic neighbor embedding (t-SNE)
UMAP - Order along the 1D embedding produced by uniform manifold approximation and projection

Optimization

These methods try to optimize a seriation criterion directly, typically using a heuristic approach.

ARSA - optimize the linear seriation critreion using simulated annealing
Branch-and-bound to minimize the unweighted/weighted column gradient
GA - Genetic algorithm with warm start to optimize any seriation criteria
GSA - General simulated annealing to optimize any seriation criteria
SGD - stochastic gradient descent to find a local optimum given an initial order and a seriation criterion.
QAP - Quadratic assignment problem heuristic (optimizes 2-SUM, linear seriation, inertia, banded anti-Robinson form)
Spectral seriation to optimize the 2-SUM criterion (unnormalized, normalized)
TSP - Traveling salesperson solver to minimize the Hamiltonian path length

Other Methods

Identity permutation
OPTICS - Order of ordering points to identify the clustering structure
R2E - Rank-two ellipse seriation
Random permutation
Reverse order
SPIN - Sorting points into neighborhoods (neighborhood algorithm, side-to-site algorithm)
VAT - Order of the visual assessment of clustering tendency

A detailed comparison of the most popular methods is available in the paper An experimental comparison of seriation methods for one-mode two-way data. (read the preprint).

Available seriation methods to reorder data matrices, count tables, and data.frames

For matrices, rows and columns are reordered.

Seriating rows and columns simultaneously

Row and column order influence each other.

BEA - Bond Energy Algorithm to maximize the measure of effectiveness (ME)
BEA_TSP - TSP to optimize the measure of effectiveness
CA - calculates a correspondence analysis of a matrix of frequencies (count table) and reorders according to the scores on a correspondence analysis dimension

Seriating rows and columns separately using dissimilarities

Heatmap - reorders rows and columns independently by calculating row/column distances and then applying a seriation method for dissimilarities (see above)

Seriate rows in a data matrix

These methods need access to the data matrix instead of dissimilarities to reorder objects (rows). The same approach can be applied to columns.

PCA_angle - order by the angular order in the 2D PCA projection space split by the larges gap
LLE reorder along a 1D locally linear embedding
Means - reorders using row means
PCA - orders along the first principal component
TSNE - Order along the 1D t-distributed stochastic neighbor embedding (t-SNE)
UMAP - Order along the 1D embedding produced by uniform manifold approximation and projection

Other methods

AOE - order by the angular order of the first two eigenvectors for correlation matrices.
Identity permutation
Random permutation
Reverse order

Installation

Stable CRAN version: Install from within R with

install.packages("seriation")

Current development version: Install from r-universe.

install.packages("seriation",
    repos = c("https://mhahsler.r-universe.dev". "https://cloud.r-project.org/"))

Usage

The used example dataset contains the joint probability of disagreement between Supreme Court Judges from 1995 to 2002. The goal is to reveal structural information in this data. We load the library, read the data, convert the data to a distance matrix, and then use the default seriation method to reorder the objects.

library(seriation)
data("SupremeCourt")

d <- as.dist(SupremeCourt)
d

##           Breyer Ginsburg Kennedy OConnor Rehnquist Scalia Souter Stevens
## Ginsburg   0.120                                                         
## Kennedy    0.250    0.267                                                
## OConnor    0.209    0.252   0.156                                        
## Rehnquist  0.299    0.308   0.122   0.162                                
## Scalia     0.353    0.370   0.188   0.207     0.143                      
## Souter     0.118    0.096   0.248   0.220     0.293  0.338               
## Stevens    0.162    0.145   0.327   0.329     0.402  0.438  0.169        
## Thomas     0.359    0.368   0.177   0.205     0.137  0.066  0.331   0.436

order <- seriate(d)
order

## object of class 'ser_permutation', 'list'
## contains permutation vectors for 1-mode data
## 
##   vector length seriation method
## 1             9         Spectral

Here is the resulting permutation vector.

get_order(order)

##    Scalia    Thomas Rehnquist   Kennedy   OConnor    Souter    Breyer  Ginsburg 
##         6         9         5         3         4         7         1         2 
##   Stevens 
##         8

Next, we visualize the original and permuted distance matrix.

pimage(d, main = "Judges (original alphabetical order)")
pimage(d, order, main = "Judges (reordered by seriation)")

Darker squares around the main diagonal indicate groups of similar objects. After seriation, two groups are visible.

We can compare the available seriation criteria. Seriation improves all measures. Note that some measures are merit measures while others represent cost. See the manual page for details.

rbind(
 alphabetical = criterion(d),
 seriated = criterion(d, order)
)

##              2SUM AR_deviations AR_events BAR Gradient_raw Gradient_weighted
## alphabetical  872        10.304        80 1.8            8              0.54
## seriated      811         0.064         5 1.1          158             19.76
##              Inertia Lazy_path_length Least_squares LS MDS_stress  ME
## alphabetical     267              6.9           967 99       0.62  99
## seriated         364              4.6           942 86       0.17 101
##              Moore_stress Neumann_stress Path_length RGAR   Rho
## alphabetical          7.0            3.9         1.8 0.48 0.028
## seriated              2.5            1.3         1.1 0.03 0.913

Some seriation methods also return a linear configuration where more similar objects are located closer to each other.

get_config(order)

##    Breyer  Ginsburg   Kennedy   OConnor Rehnquist    Scalia    Souter   Stevens 
##      0.24      0.28     -0.15     -0.11     -0.27     -0.42      0.21      0.61 
##    Thomas 
##     -0.41

plot_config(order)

We can see a clear divide between the two groups in the configuration.

References

Michael Hahsler, Kurt Hornik and Christian Buchta, Getting Things in Order: An Introduction to the R Package seriation, Journal of Statistical Software, 25(3), 2008. DOI: 10.18637/jss.v025.i03
Michael Hahsler. An experimental comparison of seriation methods for one-mode two-way data. European Journal of Operational Research, 257:133-143, 2017. DOI: 10.1016/j.ejor.2016.08.066 (read the preprint)
Hahsler, M. and Hornik, K. (2011): Dissimilarity plots: A visual exploration tool for partitional clustering. Journal of Computational and Graphical Statistics, 10(2):335–354. doi:10.1198/jcgs.2010.09139 (read the preprint; code examples)
Reference manual for package seriation.

seriation's People

Contributors

Stargazers

Watchers

Forkers

irenge u-u-h cerebis nemochina2008 behnido laixn libo70 dirkseidensticker yusufaltinok kassambara clinicopath dhtc pantheracorp david-barnett dawnsong xinming-dai

seriation's Issues

Feature request for sparse matrix

I would like to see two new seriate methods to permute a sparse matrix into doubly-bordered block-diagonal (DBBD) form and singly-bordered block-diagonal (SBBD) form.
Permuting Sparse Rectangular Matrices into Block-Diagonal Form (2002)
Cevdet Aykanat , Ali Pinar , Ümit V. Catalyürek

fill color in bertinplot has not effect

In the code:

bertinplot(fmat, fser,  options=list(panel=panel.squares, spacing=0,
                                                    gp_panels=gpar(col="green", fill="red")))

The col setting results in green lines, but the fill option has no effect (i.e., the fill color remains black)

Looking at the code, the variable hl does not appears to get its value from the highlight variable.

register_*() functions no longer exported

It seems the different variants of register_*() are no longer being exported (except for register_GA()). However, the documentation still links to its use. Have they been removed in error, or are they no longer needed?

tidygraph is being flagged by CRAN because it relies on register_DendSer() so if it has been removed in error I hope it can be fixed quickly so I don't have to momentarily remove seriation as a dependency

Coordinate of sample in Hamiltonian path

Hi everyone,

First, I would like to say that your package is really nice, I'm a biologist who try to understand how statistic and bio-informatics works together so sorry if I ask some basic questions.
I have a time course analysis of RNAsequencing from naive toward differentiated cells, and I would like to class my sample by the differentiation path. I have 8 time-points with 3 replicates by time point. I sorted a count_matrix for all the genes, and extract the matrix distance for them:

df<-dist(t(Count_matrix)
I choose the "OLO" method to order my sample :
res <- dissplot(d, method = "OLO",
options = list(main = "Seriation OLO method"))
I understand that the Hamiltonian path is the minimum distance path that can relies all my sample. So I would like to extract the Hamiltonian length:
criterion(d, method = "Path_length")
Path_length
4635.551

And know, I would like to take the position of my sample on this path but I cannot find the way to do that.
Is there a possibility to do that?

Thank a lot,
Nicolas

Feature suggestion for seriate: Value labels

Very useful command! One thing that would make it even more useful is if it was possible to add the values to the bars in some way. See the example attached from the SPAD package (sorry about the screenshot only showing part of the plot).

Jan Fredrik Hovden

Save output of pimage (to plot along with other plots)

Hi,

I'm using pimage() and seriate() to visualize adjacency matrices of my networks and they are great. However, I need to put pimage plots next to each other (and other plots) using, for instance, par() or screen().

Do you have an idea how to do this? Or, could you modify the function so that it does the job?
Thanks.

Unable to use "reorder.hclust" function

Hi,
for some reason (unknown to me) I can't use the "reorder.hclust" function.
I installed seriation package correctly, but there is no option as "reorder" at all. other functions do appear and I can use them (seriate for example).

I tried reinstalling the package, and even R itself (including Rstudio) but nothing changes.
any ideas?

in this picture you can see that there is no "reorder.hclust" option in the package functions.

thanks!

hmap throws an error: expecting integer, sees a list (apparently)

Unexpectedly, my package ChemoSpec broke where it uses seriation::hmap. The error can be seen via:

data("Wood")
hmap(Wood)

I don't see anything in the release notes for the current version that would explain this. sessionInfo() below; basically everything is up-to-date. Thanks.

> sessionInfo()
R Under development (unstable) (2021-12-03 r81290)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.0.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] seriation_1.3.1

loaded via a namespace (and not attached):
[1] colorspace_2.0-2 compiler_4.2.0   TSP_1.1-11       tools_4.2.0     
[5] codetools_0.2-18 grid_4.2.0       iterators_1.0.13 foreach_1.5.1   
[9] registry_0.5-1

README

The README claims that get_order shows names but it does not

What README claims

library(seriation)
data("SupremeCourt")

SupremeCourt
## ...snip...

d <- as.dist(SupremeCourt)
order <- seriate(d)
order
## ...snip...

get_order(order)
##    Scalia    Thomas Rehnquist   Kennedy   OConnor    Souter    Breyer  Ginsburg 
##         6         9         5         3         4         7         1         2 
##   Stevens 
##         8

How it actually works:

library(seriation)
data("SupremeCourt")

SupremeCourt
## ...snip...

d <- as.dist(SupremeCourt)
order <- seriate(d)
order
## ...snip...

get_order(order)
## [1] 6 9 5 3 4 7 1 2 8   <--------------- no names

Feature request: seriate a frequency matrix using correspondence analysis

With mosaic displays, I often want to permute the rows/columns of a frequency matrix to show the pattern of association, a task that is easily done using correspondence analysis, permuting by the scores on the first CA dimension.

This is analogous to your PCA method, but CA uses an SVD of the matrix of residuals from an independence model. It would be a useful and welcome addition to seriate.

Before I remembered the seriate package, I cobbled up a quick function reorder.matrix(),

#' Reorder a matrix according to a correspondence analysis dimension
#' 
#' This function is designed to help simplify a mosaic plot or other displays of a matrix of
#' frequencies.  It calculates a correspondence analysis of the matrix and reorders rows or columns
#' or both according to the scores on a correspondence analysis dimension.
#'
#' @param mat    A numeric matrix
#' @method reorder matrix
#' @param which  Which dimension(s) to reorder? Either \code{1} (rows) or \code{2} (columns) or \code{1:2} (both)
#' @param dim    Correspondence analysis dimension to reorder upon
#' @param ...    Other options, passed to \code{ca()}.
#'
#' @return       The matrix, with its rows and or columns reordered
#' @author Michael Friendly
#' @export
#'
#' @examples
#' data(HairEyeColor)
#' HairEye <- margin.table(HairEyeColor, 2:1)
#' HairEye
#' # Reordering by eye color gives a nicer mosaic display
#' reorder.matrix(HairEye, 1)

reorder.matrix <- function(x, which, dim=1, ...) {
  library(ca)
  mat.ca <- ca::ca(x)
  rord <- 1:nrow(x)
  cord <- 1:ncol(x)
  if (1 %in% which) {
    rcoord <- mat.ca$rowcoord    # row coordinates
    rord <- order(rcoord[, dim]) 
  }
  if (2 %in% which) {
    ccoord <- mat.ca$colcoord    # col coordinates
    cord <- order(ccoord[, dim])
  }
  x <- x[rord, cord]
  x
}

I took a look at the code in seriate, and what I'm proposing is similar in intent to seriate_PCA.R,
but there is too much infrastructure there for me to know how to write an analogous seriate_CA.R

Ideally, the method I'd like would be applicable to matrix, table and data.frame objects.

Can someone help?
If you don't want to put it in the seriate package, I could include it in vcdExtra, if only I knew how
to write it as a method to add to the seriate generic.

Dependency error

R

R version 3.5.3 (2019-03-11) -- "Great Truth"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-mageia-linux-gnu (64-bit)

install.packages("seriation")

Warnung: Abhängigkeit ‘caTools’ nicht verfügbar
installiere auch Abhängigkeit ‘gplots’

versuche URL 'https://ftp.gwdg.de/pub/misc/cran/src/contrib/gplots_3.0.1.2.tar.gz'
Content type 'application/octet-stream' length 638102 bytes (623 KB)
downloaded 623 KB

versuche URL 'https://ftp.gwdg.de/pub/misc/cran/src/contrib/seriation_1.2-8.tar.gz'
Content type 'application/octet-stream' length 898343 bytes (877 KB)
downloaded 877 KB

ERROR: dependency ‘caTools’ is not available for package ‘gplots’

removing ‘/usr/lib64/R/library/gplots’
ERROR: dependency ‘gplots’ is not available for package ‘seriation’
removing ‘/usr/lib64/R/library/seriation’

Die heruntergeladenen Quellpakete sind in
‘/tmp/RtmplhSKyj/downloaded_packages’
Aktualisiere HTML Index der Pakete in '.Library'
Making 'packages.html' ... fertig
Warnmeldungen:
1: In install.packages("seriation") :
Installation des Pakets ‘gplots’ hatte Exit-Status ungleich 0
2: In install.packages("seriation") :
Installation des Pakets ‘seriation’ hatte Exit-Status ungleich 0

update.packages(checkBuild=TRUE, ask=FALSE)

Doesn't help.

Support for xlab and ylab

seriation is a great package, but not being able to add labels (in the sense of xlab="From file", ylab="To file" is very frustrating).

Please add support for the functionality provided by xlab and ylab, in other plotting functions.

seration to return a dendrogram object after reodering

Hi,

Many thanks for this package, I am playing around with it.
I usually use ComplexHeatmap for making heatmaps.

The following code works fine.

library(ComplexHeatmap)
library(seriation)

data("iris")
x <- as.matrix(iris[-5])

seri_order<- c(seriate(dist(x, method = "euclidean"), method = "OLO_ward"),
  seriate(dist(t(x), method = "euclidean"), method = "OLO_ward"))

Heatmap( x[get_order(seri_order, 1), get_order(seri_order, 2)], cluster_rows = F, cluster_columns = F, show_row_names = F)

but I want to show the dendrograms in the heatmap as well, I know in the package hmap function can do it, but I want to stick with Complexheatmap.

I need seriation to return back a dendrogram object (reordered with OLO) and pass it to Complexheatmap

something like below so the reordered dendrogram can be shown.

rowdend<- seriate(dist(x, method = "euclidean"), method = "OLO_ward")
col_dend <- seriate(dist(t(x), method = "euclidean"), method = "OLO_ward")

Heatmap( x,  cluster_rows = row_dend,
        cluster_columns = col_dend)

Thanks!
Tommy

Redundant write in loop

In the file bea.f there are two loops on line 38 and 180. It writes the first row/column to the reordered matrix. The following lines are contained inside the loop but shouldn't be, given subsequent calls make no change to the ifin and ib or jb variables.

 ifin(istart) = 0
 ib(nplace) = istart

The loop should only encompass the assignment to b(1,j) and b(i,1) respectively.

new function: seriate.dataframe

Hi Michael,

First, thanks for this excellent package! I use it a lot and it works well. Thanks!

I just wanted to suggest that users might enjoy having a new function like seriate.dataframe() for simultaneously seriating 2 factors in a dataframe.

I work with genomics, so I often have data in the "long" format instead of a matrix format. Below, I wrote my own function that converts the long format to a matrix and then calls the seriate.matrix() function on that. Would you consider adding such a function to the seriation package?

Here is a notebook that demonstrates the new function:

https://gist.github.com/slowkow/e0f86b6944db58019a4573e96eb04f59

What do you think?

Here's a first draft of the function copied from the notebook:

# Return a dataframe where 'col1' and 'col2' are factors with levels in order.
seriate_dataframe <- function(
  d, col1, col2, value.var = "percent", fun.aggregate = NULL, method = "BEA_TSP"
) {
  mat <- as.data.frame(data.table::dcast(
    data          = d,
    formula       = as.formula(sprintf("%s ~ %s", col1, col2)),
    value.var     = value.var,
    fun.aggregate = fun.aggregate
  ))
  rownames(mat) <- mat[[1]]
  mat[[1]] <- NULL
  mat <- as.matrix(mat)
  mat[is.na(mat)] <- 0
  mat_order <- seriation::seriate(mat, method = method)
  d[[col1]] <- factor(as.character(d[[col1]]), rownames(mat)[mat_order[[1]]])
  d[[col2]] <- factor(as.character(d[[col2]]), colnames(mat)[mat_order[[2]]])
  return(d)
}