Coder Social home page Coder Social logo

rna-seq-variant-calling's Introduction

RNA-Seq Variant Calling Pipeline

This workflow is based on calling variants on RNA-Seq data using GATK4. the pipeline starts all the way from raw Fastq files and end up with VCF file with the joint calling.

Main Steps

Mapping to the Reference

Tools involved:STAR

The pipeline begin with mapping RNA reads to a reference, we have used STAR aligner because it increased sensitivity compared to other alligner(especially for INDELS), as well as use STAR’s two-pass mode to get better alignments around novel splice junctions.

Add read groups, sort, mark duplicates, and create index using Picard and Samtools

The Star Mapping step produces a BAM/SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing for downstream processing.

Split'N'Trim and Reassign mapping qualities

Tools Involved:SplitNCigarReads

This step splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions as well reassign mapping qualities to the alligned reads because STAR Napping assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK)

DAG

Base Quality Recalibration

This step correct any systematic bias observed in the data. These Biases can originate from biochemical processes occured during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration step involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model.

Variant Calling

Tools involved: HaplotypeCaller

The step calls the SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region

Required Tools

  • FastQC (A quality control tool for high throughput sequence data)

  • Trim-galore (Automates quality control and adapter trimming of fastq files)

  • STAR (Spliced aware ultrafast transcript alligner to refernece genome)

  • Picard (Cammand line set tool to manipulate high-throughput sequencing data)

  • Samtools (Tool for manipulating alignments in the SAM/BAM format, including sorting, merging, indexing and generating alignments in a per-position format)

  • GATK4 (software package to covers all major variant classes from NGS Dataset)

Index the genome for 1st pass allignment, the 2nd pass allignemnt uses the new index from merged SJ.out.tab files from the script

 STAR  --runMode genomeGenerate --runThreadN 24 --genomeDir ./ --genomeFastaFiles hg38.fa   --sjdbGTFfile gencode.v30.annotation.gtf 

To Run the pipeline on cluster using this command 'modify cluster.json parameters according to your cluster configuration

snakemake -j 999 --configfile config.yaml --use-conda --nolock --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -p {cluster.partition}  -c {cluster.ncpus} -n {cluster.ntasks }  -t {cluster.time} --mem {cluster.mem}"

rna-seq-variant-calling's People

Contributors

khandaud15 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

rna-seq-variant-calling's Issues

unindent does not match any outer indentation level (<tokenize>, line 230)

I am getting the following error when running your pipeline on HPC cluster. Our cluster uses slurm for scheduling jobs. What could be the main cause of this problem?

IndentationError in line 230 of :
unindent does not match any outer indentation level (, line 230)
File "/opt/exp_soft/anaconda3/lib/python3.7/tokenize.py", line 572, in _tokenize

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.