Coder Social home page Coder Social logo

scrna-seq_online's Introduction

Single-cell RNA-seq data analysis workshop

Audience Computational skills required Duration
Biologists Introduction to R 3-session online workshop (~7.5 hours of trainer-led time)

Description

This repository has teaching materials for a hands-on Introduction to single-cell RNA-seq analysis workshop. This workshop will instruct participants on how to design a single-cell RNA-seq experiment, and how to efficiently manage and analyze the data starting from count matrices. This will be a hands-on workshop in which we will focus on using the Seurat package using R/RStudio. Working knowledge of R is required or completion of the Introduction to R workshop.

Note for Trainers: Please note that the schedule linked below assumes that learners will spend between 3-4 hours on reading through, and completing exercises from selected lessons between classes. The online component of the workshop focuses on more exercises and discussion/Q & A.

These materials were developed for a trainer-led workshop, but are also amenable to self-guided learning.

Learning Objectives

  • Describe best practices for designing a single-cell RNA-seq experiment
  • Describe steps in a single-cell RNA-seq analysis workflow
  • Use Seurat and associated tools to perform analysis of single-cell expression data, including data filtering, QC, integration, clustering, and marker identification
  • Understand practical considerations for performing scRNA-seq, rather than in-depth exploration of algorithm theory

Lessons

Installation Requirements

Applications

Download the most recent versions of R and RStudio for your laptop:

Packages for R

Note 1: Install the packages in the order listed below.

Note 2:  All the package names listed below are case sensitive!

Note 3: If you have a Mac with an M1 chip, download and install this tool before intalling your packages: https://mac.r-project.org/tools/gfortran-12.2-universal.pkg

Note 4: At any point (especially if you’ve used R/Bioconductor in the past), in the console R may ask you if you want to update any old packages by asking Update all/some/none? [a/s/n]:. If you see this, type "a" at the prompt and hit Enter to update any old packages. Updating packages can sometimes take quite a bit of time to run, so please account for that before you start with these installations.

Note 5: If you see a message in your console along the lines of “binary version available but the source version is later”, followed by a question, “Do you want to install from sources the package which needs compilation? y/n”, type n for no, and hit enter.

(1) Install the 4 packages listed below from Bioconductor using the the BiocManager::install() function.

  1. AnnotationHub
  2. ensembldb
  3. multtest
  4. glmGamPoi

Please install them one-by-one as follows:

BiocManager::install("AnnotationHub")
BiocManager::install("ensembldb")
& so on ...

(2) Install the 8 packages listed below from CRAN using the install.packages() function.

  1. tidyverse
  2. Matrix
  3. RCurl
  4. scales
  5. cowplot
  6. BiocManager
  7. Seurat
  8. metap

Please install them one-by-one as follows:

install.packages("tidyverse")
install.packages("Matrix")
install.packages("RCurl")
& so on ...

(3) Finally, please check that all the packages were installed successfully by loading them one at a time using the library() function.

library(Seurat)
library(tidyverse)
library(Matrix)
library(RCurl)
library(scales)
library(cowplot)
library(metap)
library(AnnotationHub)
library(ensembldb)
library(multtest)
library(glmGamPoi)

(4) Once all packages have been loaded, run sessionInfo().

sessionInfo()

Citation

To cite material from this course in your publications, please use:

Mary Piper, Meeta Mistry, Jihe Liu, William Gammerdinger, & Radhika Khetani. (2022, January 6). hbctraining/scRNA-seq_online: scRNA-seq Lessons from HCBC (first release). Zenodo. https://doi.org/10.5281/zenodo.5826256.

A lot of time and effort went into the preparation of these materials. Citations help us understand the needs of the community, gain recognition for our work, and attract further funding to support our teaching activities. Thank you for citing this material if it helped you in your data analysis.


These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC) RRID:SCR_025373. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

scrna-seq_online's People

Contributors

amelie-tghn avatar gammerdinger avatar hackdna avatar hwick avatar jihe-liu avatar kant avatar kew24 avatar mariasimoneau avatar marypiper avatar mistrm82 avatar nsohail19 avatar rkhetani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scrna-seq_online's Issues

suggested homework question for SCT lesson

Extract the entire section on exploring mito ratio effects as an exercise.

Have them plot and decide on whether or not there is an effect. Explain why or why not.

Later in the lesson include a note to say that we observed an effect and so we include it as a variable to regress out.

Add note for subsetting datasets

seurat developers got back to me on our question: "It's difficult to advise on a specific number of cells below which the integration will not work well, all I can really say is to keep in mind that this can be an issue for small datasets, and you could try setting the value of k slightly lower in those cases."

This also addresses the question somewhat: satijalab/seurat#3868

Simpler way to go from seurat object to DESeq2 analysis?

Thanks for the wonderful tutorial! This, in combination with Seurat vignettes, has been incredibly helpful.

One question I had, as a primary wet lab worker who is learning more R and the computational side: Is there a simpler way to get from the merged Seurat object (e.g. 8 total samples, 4 control and 4 disease) to creating the DESeq2 object? As someone still self-teaching R, some of the pseudobulk tutorial (https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/pseudobulk_DESeq2_scrnaseq.md) assumes a better grasp of base R than I currently have.

Any guidance would be much appreciated! Thanks again for creating this fantastic tutorial for people like me!

algorithm theory

Add more links and Resources?
In description of the course, discuss about course as more practical considerations not in-depth in algorithm theory

update project download zip

should include the annotation file (from marker identification) and the new seurat integrated object (maybe saved as a new name in and "additional_files" folder)

all associated lessons should updated with wording accordingly

suggested QC homework questions

Perform all of the same plots using the filtered data as we had done with the unfiltered data and answer the following questions:

  • Report the number of cells left for each sample.

  • Did we lose a lot of cells per sample? If the cell numbers remaining are much lower than the number of cells we loaded, how can we explain this loss?

  • After filtering for nGene per cell, you should still observe a small shoulder to the right of the main peak. What might this shoulder represent?

Question for us --would this apply to nUMI? what does the shoulder represent in the nUMI plot (i.e if a cell has a lot of transcripts associated, it's sequenced alot)

  • The normal range of values for genes detected is ______ to _____. We set a fairly liberal threshold and keep cells that have as few as 250 genes detected. How do we explain cells that have so few genes being detected, considering the large number of genes present in the genome? (this one could use some work)

  • After filtering, when plotting the nGene against nUMI do you observe any data points in the bottom right quadrant of the plot? If you don't see anything, what can you say about these cells that have been removed?

A good exercise would be to provide a nUMI or nGene plot (from a previous consult?) and ask them to choose the threshold. Explain why you chose the threshold

Update to sctransform requires additional package install

https://satijalab.org/seurat/articles/sctransform_vignette.html

The latest version of sctransform also supports using glmGamPoi package which substantially improves the speed of the learning procedure. It can be invoked by specifying method="glmGamPoi".

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

BiocManager::install("glmGamPoi")
pbmc <- SCTransform(pbmc, method = "glmGamPoi", vars.to.regress = "percent.mt", verbose = FALSE)

How to split Seurat objec according to sample or condition

Based on an issue for our previous version of this workshop posted by @learnyoung at hbctraining/scRNA-seq#34:

Thank you so much for your sharing,it has benefited me a lot.
I have 10 samples in 2 condition, every conditon has 5 samples. So when to integrate data to analysis.,How to split?
Should I use the condition or the 10 sample in Seurat.
data.list <- SplitObject(data, split.by = "sample") or by condition.
Looking forward to your reply


You should always explore your data prior to performing integration to determine whether any integration is needed. We explain this in more detail in our integration lesson: https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/06_integration.md. So, first explore the data without integration, decide whether or not you need to perform integration across samples or conditions, and move forward from there. If you decide you need to integrate, then we would recommend splitting the object by the variable that you would like to integrate across.

Best,
Mary

Cell group 1 has fewer than 3 cells

Hello Guys,
I am facing this error Cell group 1 has fewer than 3 cells when running FindConservedMarkers. Any idea of what could be possibly going wrong?

Thanks in advance.

have learners create multiple R scripts

Add instructions to https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/06_SC_SCT_normalization.md to create a new script for normalization and integration.

Add instructions to https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/07_SC_clustering_cells_SCT.md to create new script for cluster and marker ID.

We do this internally and might have instructed learners to do it in the in-person classes, but this is missed in the remote/flipped classroom environment.

Some warnings to look over (from student)

I have two warning messages although I could succeeded in following the homework.

-For “Clustering” part, when I run the following code, a warning message popped up.
Even though I found this message, I succeeded in the following tasks.
DimPlot(seurat_integrated,

  •     reduction = "umap",
    
  •     label = TRUE,
    
  •     label.size = 6)
    

Warning message:
Using as.character() on a quosure is deprecated as of rlang 0.3.0.
Please use as_label() or as_name() instead.
This warning is displayed once per session.

-For “Clustering quality control” part, when I run the following code, I found this warning.
As I said earlier, I could see the same result as the textbook.
#Determine metrics to plot present in [email protected]

metrics <- c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")
FeaturePlot(seurat_integrated,

  •         reduction = "umap",
    
  •         features = metrics,
    
  •         pt.size = 0.4,
    
  •         sort.cell = TRUE,
    
  •         min.cutoff = 'q10',
    
  •         label = TRUE)
    

Warning: The sort.cell parameter is being deprecated. Please use the order parameter instead for equivalent functionality.

QC set-up lesson note

Add text about why we use the raw 10X data instead of the filtered (be sure to mention this in class)

update the pre-clustering RData object link

For clustering we need to update the link so that the object is consistent with the new code from the SCT lesson. This might require figure updates (and possibly other things) in downstream lessons

update any use of write_rds

Warning message:
The `path` argument of `write_rds()` is deprecated as of readr 1.4.0.
Please use the `file` argument instead.

DefaultAssay for marker indentification

Hello,

Thank you for creating these lessons, they truly are a valuable resource to understanding single cell data processing.
I have a question with regards to setting defaultAssay in lesson '09_merged_SC_marker_identification.md'.

Shouldn't the defaultAssay be set to integrated instead of RNA?

Since we integrated samples from two conditions, wouldn't we want to find conserved markers using integrated dataset? We have performed SCTransform on 3000 variable features (which are anchor features to integrate data), we do not have top variable features identified in our RNA assay. Wouldn't it throw an error while trying to use FindConservedMarkers() on RNA assay?

I will be happy to hear your thoughts on this.

Best,
Khushbu

are we normalizing too many times?

A couple of issues with the SCT lesson.

First, we use the full dataset (merged object with both samples) to check for cellcycle effects. After evaluation, we then decide that we do not need to regress it out. We then perform SCT on the full dataset. Was this all done just for example purposes?

Because next we split the samples into separate objects and run the for loop:

    split_seurat[[i]] <- NormalizeData(split_seurat[[i]], verbose = TRUE)
    split_seurat[[i]] <- CellCycleScoring(split_seurat[[i]], g2m.features=g2m_genes, s.features=s_genes)
    split_seurat[[i]] <- SCTransform(split_seurat[[i]], vars.to.regress = c("mitoRatio"))
    }
  1. if the code described before this is simply for example, we should specify that
  2. Should CellCycle effects be assessed per sample? If so, then it might not be a good idea to run this loop. SCT arguments vars.to.regress should be evaluated for each individual case and then run
  3. If it's fine to assess cell cycle effects (and any other sources of unwanted variation) across all samples and if we are running Cell CycleScoring inside the loop simply to have the columns in our metadata post-integration, can we instead do the following:
  • normalize filtered_seurat
  • cellcycle scoring
  • PCA
  • split this object to run each through SCT (assuming the cellcycle scores columns will persist in the split objects)
  • integrate

Revise the first day poll question

For the first poll question, change "scRNA-seq generates a lot of data" to "computational requirement for dealing with large data size".
Might need some work with the correct answer too.
Right now, students' answers are widely distributed.

poll question edits

Clarify the responses and the double negative questions i.e Pick the NOT , then nots in the responses -

error when running FindConservedMarkers

Error: Please install the metap package to use FindConservedMarkers.
This can be accomplished with the following commands: 
----------------------------------------
install.packages('BiocManager')
BiocManager::install('multtest')
install.packages('metap')
----------------------------------------

Wrong formula or description when calculating log10GenesPerUMI

In this file https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/04_SC_quality_control.md either the description or formula should be corrected, because log10(x/y) = log10(x) - log10(y) and not log10(x/y) = log10(x) / log10(y)

Number of genes detected per UMI

This value is quite easy to calculate, as we simply divide the number of genes detected per cell by the number of UMIs per cell. We will log10 transform the result for a better comparison between samples.

# Add number of genes per UMI for each cell to metadata
merged_seurat$log10GenesPerUMI <- log10(merged_seurat$nFeature_RNA) / log10(merged_seurat$nCount_RNA)

Reference for the concept of complexity when processing scRNA-seq

Hi!

Thank you very much for the great tutorial for scRNA-seq analysis!

I am interested in the concept of complexity and the novelty score used here as a filtering metric. Normally nGenes and nUMI are used to filter out low quality cells, and I am curious about what kind of additional low quality cells can this complexity metric filter out? If a cell has large enough number of genes (e.g., > 500), in which situation it would have extremely large number of UMI thus a log10GenesPerUMI score < 0.8? I read the explanation below but could not think this through. Could you please give me an example or point me to some references/publications about this concept and the 0.8 cutoff?

Thanks,
Mao

"
We can evaluate each cell in terms of how complex the RNA species are by using a measure called the novelty score. The novelty score is computed by taking the ratio of nGenes over nUMI. If there are many captured transcripts (high nUMI) and a low number of genes detected in a cell, this likely means that you only captured a low number of genes and simply sequenced transcripts from those lower number of genes over and over again. These low complexity (low novelty) cells could represent a specific cell type (i.e. red blood cells which lack a typical transcriptome), or could be due to some other strange artifact or contamination. Generally, we expect the novelty score to be above 0.80 for good quality cells."

edit language in intro lesson

In the intro lesson, add some language about situations in which bulk RNA-seq could also work for biomarker discovery:

"This can be the best choice of method if looking for disease biomarkers "

Suggestion for variable.features.n argument in SCTransform()

Thanks a lot for providing this beautiful tutorial here! I have a question regarding setting variable.features.n argument in SCTransform() function. You said,

NOTE: By default, after normalizing, adjusting the variance, and regressing out uninteresting sources of variation, SCTransform will rank the genes by residual variance and output the 3000 most variant genes. If the dataset has larger cell numbers, then it may be beneficial to adjust this parameter higher using the variable.features.n argument.

I have a dataset containing ~70,000 cells. In your opinion, what would be a good value to set for variable.features.n ? Should I use default 3000 or I should increase? Sorry for this basic question as I am new to the scRNAseq analysis!

add a"what's next?" markdown

In this lesson we can outline things one might do after marker identfication (DE, subclustering, trajectory analysis)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.