Getting Started | Features | Installation | References

Scanpy - Single-Cell Analysis in Python

Efficient tools for analyzing and simulating large-scale single-cell data that aim at an understanding of dynamic biological processes from snapshots of transcriptome or proteome. The draft Wolf, Angerer & Theis (2017) explains conceptual ideas of the package. Any comments are appreciated!

Getting started

Download or clone the repository - green button on top of the page - and cd into its root directory. With Python 3.5 or 3.6 (preferably Miniconda) installed, type

pip install -e .

Aside from enabling import scanpy.api as sc anywhere on your system, you can also work with the top-level command scanpy on the command-line (more info on installation here).

Then go through the use cases compiled in scanpy_usage, in particular, the recent additions

17-05-22 | link | Visualizing the publicly available 10X 1.3 mio brain cell data set using the command line.
17-05-05 | link | We reproduce parts of the recent Guided Clustering tutorial of Seurat (Macosko et al., Cell 2015).
17-05-03 | link | Analyzing 68 000 cells from (Zheng et al., Nat. Comms. 2017), we find that Scanpy is about a factor 5 to 10 faster and more memory efficient than the Cell Ranger R kit for secondary analysis. For large-scale data, this becomes crucial for an interactive analysis.
17-05-01 | link | Diffusion Pseudotime analysis resolves developmental processes in data of Moignard et al, Nat. Biotechn. (2015), reproducing results of Haghverdi et al., Nat. Meth. (2016). Also, note that DPT has recently been very favorably discussed by the authors of Monocle.

Features

We first give an overview of the toplevel user functions of scanpy.api, followed by a few words on Scanpy's basic features and more details. For usage of the command-line interface, which is parallels usage of the API, see this introductory example.

Overview

Scanpy user functions are grouped into the following modules

sc.tools - Machine Learning and statistics tools. Abbreviation sc.tl.
sc.preprocessing - Preprocessing. Abbreviation sc.pp.
sc.plotting - Plotting. Abbreviation sc.pl.
sc.settings - Settings. Abbreviation sc.sett.

Preprocessing

pp.* - Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization.

Visualization

tl.pca - PCA (Pedregosa et al., 2011).
tl.diffmap - Diffusion Maps (Coifman et al., 2005; Haghverdi et al., 2015; Wolf et al., 2017).
tl.tsne - t-SNE (Maaten & Hinton, 2008; Amir et al., 2013; Pedregosa et al., 2011).
tl.spring - Force-directed graph drawing (Wikipedia; Weinreb et al., 2016).

Branching trajectories and pseudotime, clustering, differential expression

tl.dpt - Infer progression of cells, identify branching subgroups (Haghverdi et al., 2016; Wolf et al., 2017).
tl.dbscan - Cluster cells into subgroups (Ester et al., 1996; Pedregosa et al., 2011).
tl.diffrank - Rank genes according to differential expression (Wolf et al., 2017).

Simulation

tl.sim - Simulate dynamic gene expression data (Wittmann et al., 2009; Wolf et al., 2017).

Basic features

The typical workflow consists of subsequent calls of data analysis tools of the form

sc.tl.tool(adata, **params)

where adata is an AnnData object and params is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. By default, Scanpy tools operate inplace and return None. If you want to copy the AnnData object, pass the copy argument

adata_copy = sc.tl.tool(adata, copy=True, **params)

AnnData objects

Instantiate AnnData via

adata = sc.AnnData(X[, smp][, var][, add])

The instance adata stores X as adata.X and sample annotation as adata.smp, variable annotation as adata.var and additional unstructured annotation as adata.add. While adata.X is array-like and adata.add is a conventional dictionary, adata.smp and adata.var are instances of a Pandas dataframe-like class, which is based on a Numpy array and, respectively. Values can be retrieved and appended via adata.smp['foo_key'] and adata.var['bar_key']. Sample and variable names can be accessed via adata.smp_names and adata.var_names, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]. The AnnData class is similar to R's ExpressionSet (Huber et al., 2015); the latter though is not implemented for sparse data.

Reading and writing data files and AnnData objects

Instead of invoking the explicit constructor, one usually calls

adata = sc.read(filename)

to initialize an AnnData object, and sc.write(filename, adata) to write it back to a file. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Scanpy is smart about file storage and extensions. Instead of providing a full filename, you can provide filekeys. By default, Scanpy writes to ./write/filekey.h5, an hdf5 file, which is configurable by setting sc.sett.writedir and sc.sett.file_format_data.

Plotting

For each tool, there is an associated plotting function

sc.pl.tool(adata)

that retrieves and plots the elements of adata that were previously written by sc.tl.tool(adata). To not display figures interactively but save all plots to default locations, you can set sc.sett.savefigs = True. By default, figures are saved as png to ./figs/. Reset sc.sett.file_format_figs and sc.sett.figdir if you want to change this. Scanpy's plotting module can be seen similar to Seaborn: an extension of matplotlib that allows visualizing certain frequent tasks with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy's plotting functions usually return a Matplotlib.Axes object.

Builtin examples

Show all builtin example data using sc.show_exdata() and all builtin example use cases via sc.show_examples(). Load annotated and preprocessed data using an example key, here 'paul15', via

adata = sc.get_example('paul15')

The key 'paul15' can also be used within sc.read('paul15') and sc.write('paul15', adata) to write the current state of the AnnData object to disk.

Visualization

pca

[source] Computes the PCA representation X_pca of data, principal components and variance decomposition. Uses the implementation of the scikit-learn package (Pedregosa et al., 2011).

tsne

[source] Computes the tSNE representation X_tsne of data.

The algorithm has been introduced by Maaten & Hinton (2008) and proposed for single-cell data by Amir et al. (2013). By default, Scanpy uses the implementation of the scikit-learn package (Pedregosa et al., 2011). You can achieve a huge speedup if you install the Multicore-TSNE package by Ulyanov (2016), which will be automatically detected by Scanpy.

diffmap

[source] Computes the diffusion maps representation X_diffmap of data.

Diffusion maps (Coifman et al., 2005) has been proposed for visualizing single-cell data by Haghverdi et al. (2015). The tool uses the adapted Gaussian kernel suggested by Haghverdi et al. (2016). The Scanpy implementation is due to Wolf et al. (2017).

spring

Beta version.

[source] Force-directed graph drawing is a long-established algorithm for visualizing graphs, see Wikipedia. It has been suggested for visualizing single-cell data by Weinreb et al. (2016).

Here, the Fruchterman & Reingold (1991) algorithm is used. The implementation uses elements of the NetworkX implementation (Hagberg et al., 2008).

Discrete clustering of subgroups and continuous progression through subgroups

dpt

[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by Haghverdi et al. (2016) and implemented for Scanpy by Wolf et al. (2017).

The functionality of diffmap and dpt compare to the R package destiny of Angerer et al. (2015).

Examples: See one of the early examples [notebook/command line] dealing with data of Moignard et al., Nat. Biotechn. (2015).

dbscan

[source] Cluster cells using DBSCAN (Ester et al., 1996), in the implementation of scikit-learn (Pedregosa et al., 2011).

This is a very simple clustering method. A better one - in the same framework as DPT and Diffusion Maps - will come soon.

Differential expression

diffrank

[source] Rank genes by differential expression.

Simulation

sim

[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by Wittmann et al. (2009). The Scanpy implementation is due to Wolf et al. (2017).

The tool compares to the Matlab tool Odefy of Krumsiek et al. (2010).

Installation

If you do not have a current Python distribution (Python 3.5 or 3.6), download and install Miniconda (see below).

Then, download or clone the repository - green button on top of the page - and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call

pip install -e .

and work with the top-level command scanpy or

import scanpy as sc

in any directory.

Installing Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh

and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.

PyPi

The package is registered in the Python Packaging Index, but versioning has not started yet. In the future, installation will also be possible without reference to GitHub via pip install scanpy.

References

Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia Nature Biotechnology 31, 545.

Angerer et al. (2015), destiny - diffusion maps for large-scale single-cell data in R, Bioinformatics 32, 1241.

Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS 102, 7426.

Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226-231.

Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics 31, 2989.

Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods 13, 845.

Krumsiek et al. (2010), Odefy - From discrete to continuous models, BMC Bioinformatics 11, 233.

Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE 6, e22649.

Maaten & Hinton (2008), Visualizing data using t-SNE, JMLR 9, 2579.

Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology 33, 269.

Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python, JMLR 12, 2825.

Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell 163, 1663.

Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology 3, 98.

jhung0 / scanpy Goto Github PK

scanpy's Introduction

Scanpy - Single-Cell Analysis in Python

Getting started

Features

Overview

Preprocessing

Visualization

Branching trajectories and pseudotime, clustering, differential expression

Simulation

Basic features

AnnData objects

Reading and writing data files and AnnData objects

Plotting

Builtin examples

Visualization

pca

tsne

diffmap

spring

Discrete clustering of subgroups and continuous progression through subgroups

dpt

dbscan

Differential expression

diffrank

Simulation

sim

Installation

Installing Miniconda

PyPi

References

Recommend Projects

Recommend Topics

Recommend Org