Coder Social home page Coder Social logo

epigen / dea_seurat Goto Github PK

View Code? Open in Web Editor NEW
11.0 5.0 10.0 343 KB

A Snakemake workflow for performing differential expression analyses (DEA) on (multimodal) sc/snRNA-seq data powered by the R package Seurat.

License: MIT License

Python 32.56% R 67.44%
bioinformatics biomedical-data-science differential-expression-analysis scrna-seq single-cell snakemake snrna-seq visualization workflow volcano-plot

dea_seurat's Introduction

DOI

Single-cell RNA sequencing (scRNA-seq) Differential Expression Analysis & Visualization Snakemake Workflow

A Snakemake workflow for performing differential expression analyses (DEA) of processed (multimodal) scRNA-seq data powered by the R package Seurat's functions FindMarkers and FindAllMarkers.

This workflow adheres to the module specifications of MR.PARETO, an effort to augment research by modularizing (biomedical) data science. For more details, instructions and modules check out the project's repository. Please consider starring and sharing modules that are useful to you, this helps me in prioritizing my efforts!

If you use this workflow in a publication, please don't forget to give credits to the authors by citing it using this DOI 10.5281/zenodo.10689139.

Workflow Rulegraph

Table of contents

Authors

Software

This project wouldn't be possible without the following software and their dependencies:

Software Reference (DOI)
data.table https://r-datatable.com
EnhancedVolcano https://doi.org/10.18129/B9.bioc.EnhancedVolcano
future https://doi.org/10.32614/RJ-2021-048
ggplot2 https://ggplot2.tidyverse.org/
pheatmap https://cran.r-project.org/package=pheatmap
Seurat https://doi.org/10.1016/j.cell.2021.04.048
Snakemake https://doi.org/10.12688/f1000research.29032.2

Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (workflow/envs/*.yaml file) or post execution in the result directory (/envs/scrnaseq_processing_seurat/*.yaml). Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g., [X].

The outlined analyses were performed using the R package Seurat (ver) [ref] unless stated otherwise.

Differential Expression Analysis (DEA). DEA was performed on the assay [X] and data slot [X] with Seurat's [FindMarkers|FindAllMarkers] function using the statistical test [X] with the parameters log2(fold change) threshold of [X] and minimal percentage of expression [X]. The results were filtered for relevant features by adjusted p-value of [X], absolute log2(fold change) of [X] and minimum percentage of expression [X].

Visualization. All filtered result statistics, i.e., number of statistically significant results split by positive (up) and negative (down) effect-sizes, were separately visualized with stacked bar plots using ggplot (ver) [ref]. To visually summarize results of the same analysis the filtered log2(fold change) values of features that were found to be at least in one comparison statistically significantly differentially expressed were visualized in a hierarchically clustered heatmap using pheatmap (ver) [ref]. Volcano plots were generated for each analysis using EnhancedVolcano (ver) [ref] with adjusted p-value threshold of [X] and log2(fold change) threshold of [X] as visual cut-offs for the y- and x-axis, respectively.

The analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow [10.5281/zenodo.10689139].

Features

The workflow performs the following steps to produce the outlined results (dea_seurat/{analysis}/).

  • Differential Expression Analysis (DEA)
    • using Seurat's FindMarkers or FindAllMarkers depending on the configuration (results.csv). This step is parallelized using the R package future.
    • feature list per comparison group and direction (up/down) for downstream analysis (e.g., enrichment analysis) (feature_lists/{group}_{up/down}_features.txt)
    • (optional) feature score tables (with two columns: "feature" and "score") per comparison group using {score_formula} for downstream analyses (e.g., preranked enrichment analysis) (feature_lists/{group}_featureScores.csv).
    • results filtered according to the configured thresholds:
      • statistical significance (adjusted p-value).
      • effect-size (log2 fold change).
      • expression (minimum percentage of expression) in one of the comparison groups.
    • filtered result statistics: number of statistically significant results split by positive (up) and negative (down) change (stats.csv).
  • Visualizations (dea_seurat/{analysis}/plots)
    • filtered result statistics: number of features and direction as bar plot (stats.png).
    • volanco plots per comparison group with effect size on the x-axis and raw p-value(*_rawp)/adjusted p-value (_adjp) on the y-axis (volcano/{feature_list}/{group}.png).
      • highlighting features according to configured cut-offs for statistical significance (pCutoff) and effect size (FCcutoff).
      • (optional) highlighting features according to configured feature lists.
    • hierarchically clustered heatmap of effect sizes (LFC) per comparison (features x comparisons) indicating statistical significance with a star '*' (heatmap/{feature_list}.png).
      • using all filtered features (FILTERED)
      • (optional) using configured feature lists
      • in case of more than 100 features the row labels and significance indicators (*) are removed

Usage

Here are some tips for the usage of this workflow:

  • Perform your first run with loose filtering options/cut-offs and set the same for filtering and plotting to see if further filtering is even necessar or useful.
  • Try one small/simple analysis first before running all desired analyses.

Configuration

Detailed specifications can be found here ./config/README.md

Examples

We selected a scRNA-seq data set consisting of 15 CRC samples from Lee et al (2020) Lineage-dependent gene expression programs influence the immune landscape of colorectal cancer. Nature Genetics. Downloaded from the Weizmann Institute - Curated Cancer Cell Atlas (3CA) - Colorectal Cancer section.

  • samples/patients: 15
  • cells: 21657
  • features (genes): 22276
  • preprocessed using the compatible MR.PARETO module for scRNA-seq data processing & visualization
  • We performed a 1 vs rest analysis using the cell type annotation ("ALL").
  • total runtime on HPC w/ SLURM (32GB RAM; only DEA with 8 cores otherwise 1 core): <25 minutes for 17 jobs in total

A comparison of the cell type marker expression split by cell types visualized as a dot plot with the DEA results as hierarchically clustered heatmap of the effect sizes.

data source/authors Workflow Output
Cell Type Marker Dot plot Cell Type Marker Dot plot

We provide metadata, annotation and configuration files for this data set in ./test. The processed and prepared Seurat RDS object has to be downloaded from Zenodo by following the instructions below.

# download Zenodo records using zenodo_get

# install zenodo_get v1.3.4
conda install -c conda-forge zenodo_get=1.3.4

# download the prepare Seurat RDS object
zenodo_get --record 10688824 --output-dir=test/data/Lee2020NatGenet/

Links

Resources

  • Recommended compatible MR.PARETO modules:
    • for upstream processing (before)
    • for downstream analyses (after)
      • Unsupervised Analysis to understand and visualize similarities and variations between groups using DEA results, including dimensionality reduction and cluster analysis. Useful for both group and gene level analyses.
    • Enrichment Analysis for biomedical interpretation of differential analysis results using prior knoweledge.

Publications

The following publications successfully used this module for their analyses.

  • ...

dea_seurat's People

Contributors

sreichl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

roblehmann

dea_seurat's Issues

DEA statistics barplot split of up and down by sign

To get a better overview of the number of differential features in each direction(!), up and down, the barplot could go literally up and down from zero. "down" features need a negative sign (-).
plot_df are the DEA statistics.

# plot config
direction_col <- list("up"="red", "down"="blue")

# plot
ggplot(plot_df, aes(x=groups, y=n_features, fill=direction)) + 
        geom_bar(stat="identity", position="identity") +
        scale_fill_manual(values=direction_col) +
        custom_theme +
        theme(legend.position = "none",
             axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)
             )

update and extend documentation

  • update current docs (work through current README)
    • rulegraph
    • remove patchwork from env.yamls
    • add package version numbers
  • incorporate changes to latest features
    • restructuring of results and plots
  • add latest features
    • feature lists
    • raw & adj p value volcano plots
    • statistics barplot
    • parallelization
    • example data set
      • provide rds object via Zenodo (3.2GB)
      • LFC heatmap of celltypemarkers
  • update config.yaml

support test-specific parameters

only implement if a use case is presented.
neg. binomial tests use the following parameters, which are currently hard-coded (to default) or not supported:

  • latent.vars
  • min.cells.feature
  • ...

Error in hclust(d, method = method) : must have n >= 2 objects to cluster

If only two cell groups are provided, there is only one column in the pheatmap plot (heatmap.R) and hclust causes an error.

This fixes it:

clusterCols <- ncol(dea_lfc) > 1

make heatmap

lfc_heatmap <- as.ggplot(pheatmap(dea_lfc,
show_rownames=F,
show_colnames=T,
cluster_cols = clusterCols,
fontsize = 5,
angle_col = 45,
treeheight_row = 25,
treeheight_col = 10,

annotation_col = annot,

           breaks=seq(-max(abs(dea_lfc)), max(abs(dea_lfc)), length.out=200),
           color=colorRampPalette(c("blue", "white", "red"))(200),
           annotation_names_col = F,
           silent = TRUE
          ))

NameError: name 'srcdir' is not defined

Hello,
thanks for the effort to turn the Seurat pipeline into a snakemake.
The following line in the Snakefile gives me an error since the function srcdir is not defined:

SDIR = os.path.realpath(os.path.dirname(srcdir("Snakefile")))

Had to comment it out to get to to run.
cheers

consider capping heatmaps

  • Get input from the team.
  • cap heatmaps like in enrichment analysis e.g.,
    • simple cap ie LFC>cap_th -> set to cap_th
    • log? (log again the LFC?)
    • scale the heatmaps?

Consider: it is a breaking change as it requires a new config field...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.