hbctraining / scrna-seq_online Goto Github PK

Home Page: https://hbctraining.github.io/scRNA-seq_online/.

R 57.70% SCSS 3.89% Shell 38.41%

scrna-seq_online's Introduction

Single-cell RNA-seq data analysis workshop

Audience	Computational skills required	Duration
Biologists	Introduction to R	3-session online workshop (~7.5 hours of trainer-led time)

Description

This repository has teaching materials for a hands-on Introduction to single-cell RNA-seq analysis workshop. This workshop will instruct participants on how to design a single-cell RNA-seq experiment, and how to efficiently manage and analyze the data starting from count matrices. This will be a hands-on workshop in which we will focus on using the Seurat package using R/RStudio. Working knowledge of R is required or completion of the Introduction to R workshop.

Note for Trainers: Please note that the schedule linked below assumes that learners will spend between 3-4 hours on reading through, and completing exercises from selected lessons between classes. The online component of the workshop focuses on more exercises and discussion/Q & A.

These materials were developed for a trainer-led workshop, but are also amenable to self-guided learning.

Learning Objectives

Describe best practices for designing a single-cell RNA-seq experiment
Describe steps in a single-cell RNA-seq analysis workflow
Use Seurat and associated tools to perform analysis of single-cell expression data, including data filtering, QC, integration, clustering, and marker identification
Understand practical considerations for performing scRNA-seq, rather than in-depth exploration of algorithm theory

Lessons

Installation Requirements

Applications

Download the most recent versions of R and RStudio for your laptop:

R (version 4.0.0 or above)
RStudio

Packages for R

Note 1: Install the packages in the order listed below.

Note 2: All the package names listed below are case sensitive!

Note 3: If you have a Mac with an M1 chip, download and install this tool before intalling your packages: https://mac.r-project.org/tools/gfortran-12.2-universal.pkg

Note 4: At any point (especially if you’ve used R/Bioconductor in the past), in the console R may ask you if you want to update any old packages by asking Update all/some/none? [a/s/n]:. If you see this, type "a" at the prompt and hit Enter to update any old packages. Updating packages can sometimes take quite a bit of time to run, so please account for that before you start with these installations.

Note 5: If you see a message in your console along the lines of “binary version available but the source version is later”, followed by a question, “Do you want to install from sources the package which needs compilation? y/n”, type n for no, and hit enter.

(1) Install the 4 packages listed below from Bioconductor using the the BiocManager::install() function.

AnnotationHub
ensembldb
multtest
glmGamPoi

Please install them one-by-one as follows:

BiocManager::install("AnnotationHub")
BiocManager::install("ensembldb")
& so on ...

(2) Install the 8 packages listed below from CRAN using the install.packages() function.

tidyverse
Matrix
RCurl
scales
cowplot
BiocManager
Seurat
metap

Please install them one-by-one as follows:

install.packages("tidyverse")
install.packages("Matrix")
install.packages("RCurl")
& so on ...

(3) Finally, please check that all the packages were installed successfully by loading them one at a time using the library() function.

library(Seurat)
library(tidyverse)
library(Matrix)
library(RCurl)
library(scales)
library(cowplot)
library(metap)
library(AnnotationHub)
library(ensembldb)
library(multtest)
library(glmGamPoi)

(4) Once all packages have been loaded, run sessionInfo().

sessionInfo()

Citation

To cite material from this course in your publications, please use:

Mary Piper, Meeta Mistry, Jihe Liu, William Gammerdinger, & Radhika Khetani. (2022, January 6). hbctraining/scRNA-seq_online: scRNA-seq Lessons from HCBC (first release). Zenodo. https://doi.org/10.5281/zenodo.5826256.

A lot of time and effort went into the preparation of these materials. Citations help us understand the needs of the community, gain recognition for our work, and attract further funding to support our teaching activities. Thank you for citing this material if it helped you in your data analysis.

These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC) RRID:SCR_025373. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

scrna-seq_online's People

Contributors

Stargazers

Watchers

Forkers

kant novapyth amrr101 cynthiamoncadareid xtmgah breme86 liguowang yoonsquared deepsystemspharmacology ashdjhf mayunlong89 zhaowangye anampc biov mira0507 babasaraki li-linr matthieurouland amarinderthind ixxmu darkbreaker0 guangyuguoer cystone histonemark chazdm whf901026 yongbobloomington hongyuanwu builderpie prakashraaz lg72cu johannesnicolaus lipeng999 wangdi0212 tyrev rajanikanthnmsu 1g84 nvrivera taoshengxu chrisyhchiu yilevine yjchen1201 jxshi neu970 ericjcgalvez molafish777 marypiper whats-in-the-box fafaris39 odavalos changrong1023 drmoatif itziarcenzano yirenheihei zhangsiwei2366 assem-metwally animesh procha2 khoon0618 fushun-chen dfgao daikouxiaojun leiyp zekaihuang-pku flowder13 michaelcraige zeh-joe sharonxiaoyuan rajukoorakula rsango6 cdadams-harvard rowling2392 tb1over libingnan11 zhiyil xuanxuanyu-bios mysoresparrow yiluheihei orancl 050114dragon pipes82 sethzima kew24 cthfudan gp10 allisonvmitch hasihays nayrouz109 denvercal1234github ningxie2022 changxin-wang gozdekibar kgzaker yyingchenyy gitcjz prosaddas upendrabhattarai jbiesiada1971 thyagoleal farzanehrah

scrna-seq_online's Issues

Add note for subsetting datasets

seurat developers got back to me on our question: "It's difficult to advise on a specific number of cells below which the integration will not work well, all I can really say is to keep in mind that this can be an issue for small datasets, and you could try setting the value of k slightly lower in those cases."

This also addresses the question somewhat: satijalab/seurat#3868

Simpler way to go from seurat object to DESeq2 analysis?

Thanks for the wonderful tutorial! This, in combination with Seurat vignettes, has been incredibly helpful.

One question I had, as a primary wet lab worker who is learning more R and the computational side: Is there a simpler way to get from the merged Seurat object (e.g. 8 total samples, 4 control and 4 disease) to creating the DESeq2 object? As someone still self-teaching R, some of the pseudobulk tutorial (https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/pseudobulk_DESeq2_scrnaseq.md) assumes a better grasp of base R than I currently have.

Any guidance would be much appreciated! Thanks again for creating this fantastic tutorial for people like me!

Add integration code to homework

link the R script for running integration functions

algorithm theory

Add more links and Resources?
In description of the course, discuss about course as more practical considerations not in-depth in algorithm theory

Add DE as optional self-learning

update project download zip

should include the annotation file (from marker identification) and the new seurat integrated object (maybe saved as a new name in and "additional_files" folder)

all associated lessons should updated with wording accordingly

link to show an example of cell cycle regression

https://satijalab.org/seurat/v3.1/cell_cycle_vignette.html

An example of when you do see an effect

incorrect description of avg_logfc in markers lesson

From email:

I have a query regarding the avg_log fold change mentioned in this link: https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/09_merged_SC_marker_identification.md
The training materials indicate that this is a log2 fold change.
The link from the Seurat development team suggests that this is natural log fold change (satijalab/seurat#741 )

Getting DE lesson back on schedule page

small issue with new function names

update images for pseudobulk

update homework for QC

add questions that require interpretation of the plots rather than just reporting numbers

Update to sctransform requires additional package install

https://satijalab.org/seurat/articles/sctransform_vignette.html

The latest version of sctransform also supports using glmGamPoi package which substantially improves the speed of the learning procedure. It can be invoked by specifying method="glmGamPoi".

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("glmGamPoi")
pbmc <- SCTransform(pbmc, method = "glmGamPoi", vars.to.regress = "percent.mt", verbose = FALSE)

How to split Seurat objec according to sample or condition

Based on an issue for our previous version of this workshop posted by @learnyoung at hbctraining/scRNA-seq#34:

Thank you so much for your sharing,it has benefited me a lot.
I have 10 samples in 2 condition, every conditon has 5 samples. So when to integrate data to analysis.,How to split?
Should I use the condition or the 10 sample in Seurat.
data.list <- SplitObject(data, split.by = "sample") or by condition.
Looking forward to your reply

You should always explore your data prior to performing integration to determine whether any integration is needed. We explain this in more detail in our integration lesson: https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/06_integration.md. So, first explore the data without integration, decide whether or not you need to perform integration across samples or conditions, and move forward from there. If you decide you need to integrate, then we would recommend splitting the object by the variable that you would like to integrate across.

Best,
Mary

add resources

https://broadinstitute.github.io/2020_scWorkshop/

? https://nbisweden.github.io/workshop-scRNAseq/schedule.html

https://azimuth.hubmapconsortium.org/

Single-nucleus and single-cell transcriptomes compared in matched cortical cell types paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6306246/ and https://www.nature.com/articles/s41591-020-0844-1

cellmarker

cellphonedb (https://www.nature.com/articles/s41576-020-00292-x)
spatial transcriptomics
CITE-seq

Cell group 1 has fewer than 3 cells

Hello Guys,
I am facing this error Cell group 1 has fewer than 3 cells when running FindConservedMarkers. Any idea of what could be possibly going wrong?

Thanks in advance.

update the workflow image

SCT and exploring unwanted variation is missing in the boxes

have learners create multiple R scripts

Add instructions to https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/06_SC_SCT_normalization.md to create a new script for normalization and integration.

Add instructions to https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/07_SC_clustering_cells_SCT.md to create new script for cluster and marker ID.

We do this internally and might have instructed learners to do it in the in-person classes, but this is missed in the remote/flipped classroom environment.

assess the effect of mito ratio?

we include this as a vars.to.regress yet we don't have an image to show that we observed an effect

Remove SingleCellExperiment package

Test if SingleCellExperiment package is required for this workshop. And if not, remove it from the installation instruction.

Extend question for clustering lesson

Ask how many clustering for different resolution, and which one should they use.

Some warnings to look over (from student)

I have two warning messages although I could succeeded in following the homework.

-For “Clustering” part, when I run the following code, a warning message popped up.
Even though I found this message, I succeeded in the following tasks.
DimPlot(seurat_integrated,

```
    reduction = "umap",
```
```
    label = TRUE,
```
```
    label.size = 6)
```

Warning message:
Using as.character() on a quosure is deprecated as of rlang 0.3.0.
Please use as_label() or as_name() instead.
This warning is displayed once per session.

-For “Clustering quality control” part, when I run the following code, I found this warning.
As I said earlier, I could see the same result as the textbook.
#Determine metrics to plot present in [email protected]

metrics <- c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")
FeaturePlot(seurat_integrated,

```
        reduction = "umap",
```
```
        features = metrics,
```
```
        pt.size = 0.4,
```
```
        sort.cell = TRUE,
```
```
        min.cutoff = 'q10',
```
```
        label = TRUE)
```

Warning: The sort.cell parameter is being deprecated. Please use the order parameter instead for equivalent functionality.

QC set-up lesson note

Add text about why we use the raw 10X data instead of the filtered (be sure to mention this in class)

add a simple exercise

At the end of https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/06_SC_SCT_normalization.md, we should ask them to report on the output of split_seurat$stim@assays as an exercise.

update R skeleton scripts for homework

Some exercises have been updated and the skeleton scripts need to be updated to reflect that

Update marker ID to be log2 fc

Re: Archana:

In the recent update to Seurat (v4.0.0), the log Fold Change value has been changed to log2 scale.

References: https://github.com/satijalab/seurat/blob/4e868fcde49dc0a3df47f94f5fb54a421bfdf7bc/NEWS.md#changes-1

https://satijalab.org/seurat/reference/findallmarkers

update the pre-clustering RData object link

For clustering we need to update the link so that the object is consistent with the new code from the SCT lesson. This might require figure updates (and possibly other things) in downstream lessons

include the cycle.rda in the project download?

since we are having them download, a suggestion to have everything they need there

update any use of write_rds

Warning message:
The `path` argument of `write_rds()` is deprecated as of readr 1.4.0.
Please use the `file` argument instead.

DefaultAssay for marker indentification

Hello,

Thank you for creating these lessons, they truly are a valuable resource to understanding single cell data processing.
I have a question with regards to setting defaultAssay in lesson '09_merged_SC_marker_identification.md'.

Shouldn't the defaultAssay be set to integrated instead of RNA?

Since we integrated samples from two conditions, wouldn't we want to find conserved markers using integrated dataset? We have performed SCTransform on 3000 variable features (which are anchor features to integrate data), we do not have top variable features identified in our RNA assay. Wouldn't it throw an error while trying to use FindConservedMarkers() on RNA assay?

I will be happy to hear your thoughts on this.

Best,
Khushbu

are we normalizing too many times?

A couple of issues with the SCT lesson.

First, we use the full dataset (merged object with both samples) to check for cellcycle effects. After evaluation, we then decide that we do not need to regress it out. We then perform SCT on the full dataset. Was this all done just for example purposes?

Because next we split the samples into separate objects and run the for loop:

    split_seurat[[i]] <- NormalizeData(split_seurat[[i]], verbose = TRUE)
    split_seurat[[i]] <- CellCycleScoring(split_seurat[[i]], g2m.features=g2m_genes, s.features=s_genes)
    split_seurat[[i]] <- SCTransform(split_seurat[[i]], vars.to.regress = c("mitoRatio"))
    }

if the code described before this is simply for example, we should specify that
Should CellCycle effects be assessed per sample? If so, then it might not be a good idea to run this loop. SCT arguments vars.to.regress should be evaluated for each individual case and then run
If it's fine to assess cell cycle effects (and any other sources of unwanted variation) across all samples and if we are running Cell CycleScoring inside the loop simply to have the columns in our metadata post-integration, can we instead do the following:

normalize filtered_seurat
cellcycle scoring
PCA
split this object to run each through SCT (assuming the cellcycle scores columns will persist in the split objects)
integrate

Revise the first day poll question

For the first poll question, change "scRNA-seq generates a lot of data" to "computational requirement for dealing with large data size".
Might need some work with the correct answer too.
Right now, students' answers are widely distributed.

poll question edits

Clarify the responses and the double negative questions i.e Pick the NOT , then nots in the responses -

error when running FindConservedMarkers

Error: Please install the metap package to use FindConservedMarkers.
This can be accomplished with the following commands: 
----------------------------------------
install.packages('BiocManager')
BiocManager::install('multtest')
install.packages('metap')
----------------------------------------

Add answer key

Add answer key of exercises from this lesson: https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/06_SC_SCT_normalization.md

Wrong formula or description when calculating log10GenesPerUMI

In this file https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/04_SC_quality_control.md either the description or formula should be corrected, because log10(x/y) = log10(x) - log10(y) and not log10(x/y) = log10(x) / log10(y)

Number of genes detected per UMI

This value is quite easy to calculate, as we simply divide the number of genes detected per cell by the number of UMIs per cell. We will log10 transform the result for a better comparison between samples.

# Add number of genes per UMI for each cell to metadata
merged_seurat$log10GenesPerUMI <- log10(merged_seurat$nFeature_RNA) / log10(merged_seurat$nCount_RNA)

look over poll questions

check the language for clarity

Add new viz for PCA

Update and add Jihe's visualization for PCA

add homework question to explore the Seurat object

Reference for the concept of complexity when processing scRNA-seq

Hi!

Thank you very much for the great tutorial for scRNA-seq analysis!

I am interested in the concept of complexity and the novelty score used here as a filtering metric. Normally nGenes and nUMI are used to filter out low quality cells, and I am curious about what kind of additional low quality cells can this complexity metric filter out? If a cell has large enough number of genes (e.g., > 500), in which situation it would have extremely large number of UMI thus a log10GenesPerUMI score < 0.8? I read the explanation below but could not think this through. Could you please give me an example or point me to some references/publications about this concept and the 0.8 cutoff?

Thanks,
Mao

"
We can evaluate each cell in terms of how complex the RNA species are by using a measure called the novelty score. The novelty score is computed by taking the ratio of nGenes over nUMI. If there are many captured transcripts (high nUMI) and a low number of genes detected in a cell, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again. These low complexity (low novelty) cells could represent a specific cell type (i.e. red blood cells which lack a typical transcriptome), or could be due to some other strange artifact or contamination. Generally, we expect the novelty score to be above 0.80 for good quality cells."

Default assay to use for finding marker

Based on these two threads: satijalab/seurat#1501, satijalab/seurat#1836, one could use sctransform data to perform marker identification, or the normalized and scaled data. The note in our current material says "the functions for finding markers will automatically pull the raw counts".

next dataset for revamping workshop

The dataset from Vascular smooth muscle-derived Trpv1 + progenitors are a source of cold-induced thermogenic adipocytes.

GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE160585
SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRP290688

edit language in intro lesson

In the intro lesson, add some language about situations in which bulk RNA-seq could also work for biomarker discovery:

"This can be the best choice of method if looking for disease biomarkers "

Suggestion for variable.features.n argument in SCTransform()

Thanks a lot for providing this beautiful tutorial here! I have a question regarding setting variable.features.n argument in SCTransform() function. You said,

NOTE: By default, after normalizing, adjusting the variance, and regressing out uninteresting sources of variation, SCTransform will rank the genes by residual variance and output the 3000 most variant genes. If the dataset has larger cell numbers, then it may be beneficial to adjust this parameter higher using the variable.features.n argument.

I have a dataset containing ~70,000 cells. In your opinion, what would be a good value to set for variable.features.n ? Should I use default 3000 or I should increase? Sorry for this basic question as I am new to the scRNAseq analysis!

day 1 poll language

replace the word complexity with challlenges in scrnaseq?

Have learners save sessionInfo

Create instructions in https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/09_merged_SC_marker_identification.md to run sessionInfo and save it as a text file.

writeLines(capture.output(sessionInfo()), "sessionInfo.txt")
OR

sink("sessionInfo.txt")
sessionInfo()
sink()