Coder Social home page Coder Social logo

vertesy / pseudobulk Goto Github PK

View Code? Open in Web Editor NEW
9.0 4.0 2.0 331 KB

Cluster-specific pseudo-bulk analysis of 10X single-cell RNA-seq data by connecting Seurat to the VBC RNA-seq pipeline.

Home Page: https://vertesy.github.io/pseudoBulk

License: MIT License

R 41.65% Shell 48.40% Perl 9.96%
single-cell bulk rna-seq seurat split-bam pseudo-bulk

pseudobulk's Introduction

Cluster-specific pseudo-bulk analysis of 10X single-cell RNA-seq data

Cluster-specific pseudo-bulk analysis of 10X single-cell RNA-seq data by connecting to the VBC RNA-seq pipeline.

Introduction

See Document on pseudo-bulk analysis and my presentation on pseudo-bulk analysis. In brief, pseudo-bulk analysis allows

  1. top-level overview & comparison of all single-cell datasets generated in the lab
  2. comparison single-cell-clusters vs bulk (sorted) RNA-seq samples
  3. in depth study of sequencing libraries (read position distributions, extensive quality control).
  4. Separation of .bam files per sample (in case of sample pooling, e.g. MULTI-SEQ)

Pseudo-bulk analysis for top-level overview

img

Top-level overview is very simple and does not require preprocessing in R.

Pseudo-bulk analysis for cluster-specific comparisons

img

Cluster-specific overview requires preprocessing in R and the export of barcodes. The exported barcodes then will be processed by

Steps for the cluster-specific analysis

1. Run SeuratFind clusters, etc.

[explained elsewhere]

##2 Export clusters

Depends: CodeAndRoll.R, and MarkdownReports

Use get.ls.of.CBCs and write.out.CBCs.per.cl to write out the barcodes per dataset, per cluster. Script found in: /Users/abel.vertesy/GitHub/Seurat.multicore/Seurat3.Write.Out.CBCs.for.subset-bam.R

get.ls.of.CBCs <- **function**(scobj = combined.obj, ident = 'integrated_snn_res.0.3', 
              plotit=T, trim.libName=F) { # 
 Idents(scobj) <- ident
 id_x = Idents(scobj)
 dsets = unique(stringr::str_split_fixed(names(id_x), pattern = '_', n = 2)[,2])
 iprint("Returns a list of lists (",l(dsets),"libraries [",dsets,"] /",
     l(levels(id_x)),"clusters [",levels(id_x),"] )")
 ls_CBCs = list.fromNames(levels(id_x))
 print("Cluster:")

 **for** (cl **in** 1:l(levels(id_x))) { 
  print(cl)
  cells_x = WhichCells(combined.obj, idents = id_x[cl])
  cells_perLib = stringr::str_split_fixed(cells_x, pattern = '_', n = 2)
  cells_perLib = cbind(cells_perLib,scobj$orig.ident[cells_x])
  ls_CBCs[[cl]] <- ls_cells_clX_perLib <- split(x = cells_perLib, f = cells_perLib[,3])
 }
 revlist = reverse.list.hierarchy(ls_CBCs)

 **if** (!isFALSE(trim.libName)) { # remove .WT **and** .TSC2
  names(revlist) = stringr::str_split_fixed(names(revlist), pattern = "\\.", n=2)[,1]
 }
 **if** (plotit) {
  **for** (i **in** 1:l(revlist)) {
   ClusterSizes = unlapply(revlist[[i]], l)
   wpie(ClusterSizes, savefile = F, plotname = paste('ClusterSizes, cl.', i))
  }
 }
 **return**(revlist)
}
# ls.of.CBCs = get.ls.of.CBCs(trim.libName = T); names(ls.of.CBCs)

# *------------------------------------------------------------*

write.out.CBCs.per.cl <- **function**(ls_CBCs = ls.of.CBCs, add.suffix="-1", openOutDir=T) { # take the output of get.ls.of.CBCs() as input, **and** write out as csv
 (depth = l(ls_CBCs))
 (dsets = names(ls_CBCs))
 **for** (i **in** 1:l(dsets)) {
  outputDir = p0(OutDir,"CBCs/", dsets[i])
  dir.create(outputDir, recursive = T)
  inside.ls = ls_CBCs[[i]]
  **for** (j **in** 1:l(inside.ls)) {
   CBCs = inside.ls[[j]]
   **if** (!isFALSE(add.suffix)) CBCs = p0(CBCs,add.suffix)
   write.simple.vec(input_vec = CBCs, ManualName =  p0(outputDir,"/Cl.", j, ".csv"))
  }
 }
 **if** (openOutDir) system(paste("open", outputDir))
}
# write.out.CBCs.per.cl()

3. Copy output folder next to bam folder

Example folder structure:

img

4 Subset bam by run.split-bam.sh

Barcoded BAM is described here. Alternative method is described here

Depends: 10XGenomics/subset-bam

Script found in: /Users/abel.vertesy/GitHub/Seurat.multicore/split.bam.files.scripts/run.split-bam.sh:

# https:*//github.com/10XGenomics/subset-bam*
# alternative scripts: https:*//divingintogeneticsandgenomics.rbind.io/post/split-a-10xscatac-bam-file-by-cluster/*

# --bam (-b): **Input** 10x Genomics BAM. This BAM must have the CB tag to define the barcodes of cell barcodes (or the tag defined **by** --bam-tag). Must also have **an** index (.bai) **file**. REQUIRED.
# --cell-barcodes (-c): A cell barcodes **file** **as** produced **by** Cell Ranger that defines **which** barcodes were called **as** cells. **One** barcode per **line**. **In** Cell Ranger runs, this can be found **in** the sub-folder **outs**/filtered_gene_bc_matrices_mex/${refGenome}/barcodes.tsv where ${refGenome} is the name of the reference genome used **in** your Cell Ranger **run**. This **file** can be used **as** column labels **for** the output **matrix**. REQUIRED.
# --**out**-bam (-o): A path to write the subsetted BAM **file** to. REQUIRED.
# --cores: Number of parallel cores to **use**. DEFAULT: 1.
# --**log**-level: **One** of info, **error** or debug. Increasing **levels** of logging. DEFAULT: **error**.
# --bam-tag: Change this to **use** **an** alternative tag to the default CB tag. This can be useful **for** subsetting BAMs from LongRanger.


"on server"
tmux 
cdd /groups/knoblich/users/abel/Data/

# lib="101147"
lib="101146"

bamdir="bam.files.cellranger/"$lib
bamfile=$bamdir/$(**ls** $bamdir | grep ".bam$")
outdir=$bamdir"/bam.per.cl/"
**mkdir** $outdir
BCdir="CBCs/"$lib"/"

**for** BC **in** $(**ls** $BCdir | grep ".csv")
**do**
  BCfile=$BCdir$BC
  echo $BCfile
  echo $bamfile
  outbam=$outdir$BC".bam"
  echo $outbam
  subset-bam --cores 8 --bam $bamfile --cell-barcodes $BCfile --**out**-bam $outbam --**log**-level debug
done

# samtools **view** /Volumes/abel/Data/bam.files.cellranger/101146/bam.per.**cl**/**Cl**.1.csv.bam | head
# samtools **view** /Volumes/abel/Data/bam.files.cellranger/101146/Oli.d110.101146.WT.bam | head -100

##5 Reanalyzing with the VBC RNA-seq pipeline

Described elsewhere

pseudobulk's People

Contributors

vertesy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pseudobulk's Issues

Strand specific mapping location of reads via stranded bigwig files

Motivation

Sense and antisense transcripts have overlapping regions for some interesting genes, such as TSC2 or MALAT1.
You need stranded bigwig to look if reads map to the sense or the antisense strand.

How to

You can get both stranded bigwig and unstranded bigwig with the standard pipeline

Results

load stranded .bw to igv:
/Volumes/abel-1/Data/pseudobulk/iiRNAseq_ii.GRCh38_20191120162110/bigwig/
or:
/groups/knoblich/users/abel/Data/pseudobulk/iiRNAseq_ii.GRCh38_20191120162110/bigwig/

Interpretation

DeepTools somehow gives you the opposite strand by some weird naming.

Biological findings from pseudoBulk: intronic content and MALAT1

General

  • Check if you see the same in our other sc datasets
    • rerun with Oli all 6
  • Check if you see the same in bulk

MALAT1

  • Is the MALAT1 correlation to RC true in general?
  • Is the MALAT1 fraction correlation to TSS-RC true in general?

intronic content and

  • Where do we find high intronic content?

Better PCA analysis

Motivation

To better understand how similar samples are, PCA is a good start.
However, there the current analysis is insufficient:

  • first 3 PCs often do not separate all identified clusters,
  • We need to scale each axis in proportion to the variance explained by each PC

Output needed

  • Get the variance explained by each PC
  • Plot more PC's

turn pseudobulk from scRNAseq with logCPM value back to a seurat object with counts value?

May i ask 1 question:

if data is already pseudobulk object from scRNAseq data with logCPM value, how can i change it back to a seurat object with counts value? Can i still use above method to turn data back to a seurat object?

My data is normalized to become a pseudobulk data as following:
"Normalizing count data
After excluding poor quality cells, we normalized the sequencing depth of each cell by dividing
each cell’s counts by the total counts in that cell, resulting in a matrix where the entries represent
the proportion of a cell’s reads allocated to each gene (i.e. values in the range [0,1]). To estimate
a library size for each dataset, we summed the total counts in each cell, and then we took the
median as the library size for the dataset. Next, we multiplied the proportions by the library size
to get a count matrix that was normalized for sequencing depth. Finally, we transformed the
normalized count matrix with log2(1 + count). We referred to this log-transformed quantity in the
figures as log2CPM.

We created pseudobulk expression (L. Lun, Bach, and Marioni 2016) for the cells in a
cluster for each donor such that the pseudobulk matrix had one row for each gene and one column
for each cluster from each patient.
We normalized the pseudobulk counts to log2CPM as
described in the previous section. Then we use limma::lmFit() to test for differential gene
expression with the log2CPM pseudobulk matrix (Ritchie et al. 2015). We also use
presto::wilcoxauc() to compute the area under receiver operator curve (AUROC or AUC) for the
log2CPM value of each gene as a predictor of the cluster membership for each cluster
(Korsunsky, Nathan, et al. 2019).
"
thanks best wishes
J.

To Do

Milestones to finalize the pipeline

  • Barcode export from Seurat
  • Barcode de-duplications
  • Find the cause of repeated qNames in the 10X aligned bam file

Barcode de-duplications

  • Explore: is it done by 10x ??
  • Find solution via 10X or alternatively via a python script

Possible future improvements

  • Use the exact same index & reference to what 10X cell ranger uses.
  • Make it directly executable directly on a 10X output folder (using the clustering that 10X does).

Using the exact same index & reference

→ This is actually needed if you want to analyze 10X runs that have been mapped to the pre-mRNA reference

When you run the VBC RNAseq pipeline, provide the gene model (GTF) and the exact same human genome the use (you have to directly link both).

  • --gtf to use exactly the same names
    • if you provide both --gtf and --fasta it recreates the reference, see:
    • both --gtf and --fasta and--saveReference but not --genome to save the custom reference
    • Pass on to Maria → add it to the cextflow config

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.