Coder Social home page Coder Social logo

bcgsc / impala Goto Github PK

View Code? Open in Web Editor NEW
1.0 4.0 0.0 20.52 MB

Integrated Mapping and Profiling of Allelically-expressed Loci with Annotations

License: GNU General Public License v3.0

Python 30.17% Shell 0.15% R 67.78% Perl 1.40% Dockerfile 0.50%
allele-specific-expression bioinformatics cancer snakemake-workflow

impala's Introduction

Integrated Mapping and Profiling of Allelically-expressed Loci with Annotations

DOI example workflow Snakemake

This Snakemake workflow calls allele-specific expression genes using short-read RNA-seq. Phasing information derived from long-read data by tools such as WhatsHap can be provided to increase the performance of the tool, and to link results to features of interest. Copy number variant data, allelic methylation data and somatic variant data can also be provided to analyze genes with allele specific expression.

Table of Contents

Overall Workflow


Installation

This will clone the repository. You can run the IMPALA within this directory.

git clone https://github.com/bcgsc/IMPALA.git

Dependencies

To run this workflow, you must have snakemake (v6.12.3) and singularity (v3.5.2-1.1.el7). You can install snakemake using this guide and singularity using this guide. The remaining dependencies will be downloaded automatically within the snakemake workflow.

Input Files

Method 1: RNA reads:

  • RNA paired end reads (R1 & R2 fastq file)

Method 2§: RNA alignment:

  • RNA alignment alignment (bam file)
  • Expression Matrix
    • Expression in RPKM/TPM
    • Gene name must be in HGNC format
    • Column name is "Gene" and sample names

Optional Inputs:

  • Phase VCF
    • Can be obtained using WhatsHap with DNA long reads
    • Significantly improves precision of ASE calling
    • Adds TFBS mutation and stop gain/loss information
  • Copy Number Variant Data
  • Allelic Methylation
  • Somatic mutations
    • Finds somatic mutations in ASE gene and promoters
  • Tumor Content
    • Used to calcualte the expected major allele frequency
    • Assumes 1.0 if not specified
  • Tissue type
    • Include data for average MAF in normal tissue in summary table
    • Otained from GTex database which ran phASER to calcualte allelic expression

Running Workflow

Edit the config files

Example parameters.yaml:

Config files to specify parameters and paths needed for the workflow. The main parameter to include is the genome name, path to expression matrix, major allele frequency threshold and threads as well as settings for using phased vcf and doing cancer analysis.

# genome_name should match bams
genome_name: hg38/hg19/hg38_no_alt_TCGA_HTMCP_HPVs

# RPKM matrix of the samples
matrix: /path/to/expression/matrix.tsv

# Major allele frequency threshold for ASE (0.5 - 0.75)
maf_threshold: 0.65

# Threads for STAR, RSEM, Strelka and MBASED
threads: 72

# Use phased vcf (True or False)
# Uses pseudphasing algorithm if False
phased: True

# Perform cancer analysis 
# Intersect with optional input
cancer_analysis: True


# Paths for annotation
annotationPath:
    snpEff_config:
        /path/to/snpEff/config
    snpEff_datadir:
        /path/to/snpEff/binaries/data
    snpEff_genomeName:
        GRCh38.100
    snpEff_javaHeap:
        64g

# Paths for references
# Only needed if RNA read is provided instead of RNA bam
starReferencePath:
    /path/to/star/ref
rsemReferencePath:
    /path/to/rsem/ref

Example samples.yaml:

Main config file to specify input files. For input method 1 using R1 and R2 fastq file, use R1 and R2 tag. For input method 2 using RNA bam file, use rna tag. All other tags are optional.

samples:
    # Sample Name must match expression matrix
    sampleName_1: # Method 1
        R1:
            /path/to/RNA/R1.fq
        R2:
            /path/to/RNA/R2.fq
        somatic_snv:
            /path/to/somatic/snv.vcf
        somatic_indel:
            /path/to/somatic/indel.vcf
        tissueType:
            Lung
    sampleName_2: # Method 2
        rna:
            /path/to/RNA/alignment.bam
        phase:
            /path/to/phase.vcf.gz
        cnv:
            /path/to/cnv/data
        methyl:
            /path/to/methyl/data.bed
        tumorContent:
            0.80

Example defaults.yaml:

Config file for specify path for reference genome, annotation bed file and centromere bed file. Annotation and centromere bed file for hg38 are included in the repository.

genome:
    hg19:
        /path/to/hg19/ref.fa
    hg38:
        /path/to/hg38/ref.fa
    hg38_no_alt_TCGA_HTMCP_HPVs:
        /path/to/hg38_no_alt_TCGA_HTMCP_HPVs/ref.fa

annotation:
    hg19:
        /path/to/hg19/annotation.fa
    hg38:
        annotation/biomart_ensembl100_GRCh38.sorted.bed  
    hg38_no_alt_TCGA_HTMCP_HPVs:
        annotation/biomart_ensembl100_GRCh38.sorted.bed

centromere:
    hg19:
        /path/to/hg19/centromere.bed
    hg38:
        annotation/hg38_centromere_positions.bed
    hg38_no_alt_TCGA_HTMCP_HPVs:
        annotation/hg38_centromere_positions.bed

Run snakemake

This is the command to run it with singularity. The -c parameter can be used to specify maximum number of threads. The -B parameter is used to speceify paths for the docker container to bind.

snakemake -c 30 --use-singularity --singularity-args "-B /projects,/home,/gsc"

Output Files

All output and intermediary files is found in output/{sample} directory. The workflow has four main section, alignment, variant calling, mbased and cancer analysis and their outputs can be found in the corrosponding directories. The key outputs from the workflow is located below

  1. MBASED related outputs (found in output/{sample}/mbased)
    • The tabular results of the output MBASED_expr_gene_results.txt
    • The rds object of the MBASED raw output MBASEDresults.rds
  2. Summary table of all outputs
    • Found in output/{sample}/summaryTable.tsv
    • Data of all phased genes with ASE information along potential causes based on optional inputs
  3. Figures
    • Found in output/{sample}/figures
    • Example figure shown below

Summary Table Description

Column Description
gene HGNC gene symbol
Expression Expression level
allele1IsMajor T/F if allele 1 is the major allele (allele 1 = HP1)
majorAlleleFrequency Major allele frequency
padj Benjamini-Hochberg adjusted pvalue
aseResults ASE result based on MAF threshold (and pval)
cnv.A1 Copy Number for allele 1
cnv.B1 Copy Number for allele 2
expectedMAF1 Expect Major Allele Frequency based on CNV
cnv_state1 Allelic CNV state (Loss of Heterozygosity, Allelic balance/imbalabnce)
methyl_state2 Methylation difference in promter region (Allele 1 - Allele 2)
tf_allele3 Allele where there is gain of transcription factor binding site
transcriptionFactor3 Transcription Factor for gain TFBS
stop_variant_allele3 Allele where stop gain/stop loss variant is found
somaticSNV4 Somatic SNV found in (or around) gene (T/F)
somaticIndel4 Somatic Indel found in (or around) gene (T/F)
normalMAF5 Add MAF for gene in normal tissue
cancer_gene T/F if gene is a known cancer gene (based on annotation/cancer_gene.txt)
sample Sample Name

Columns only included if optional input is included:

1 Copy number variant 2 Allelic methylation 3 Phased vcf 4 Somatic SNV and Indel 5 Tissue type

Example Figures

Several figures are automatically generate based on the optional inputs. They can be found in output/{sample}/figures. The main figure is karyogram.pdf which show co-locationzation of ASE genes with allelic methylation and somatic copy number alteration. Example figures can be found here.

Contributors

The contributors of this project are Glenn Chang, Vannessa Porter, and Kieran O'Neill.

License

IMPALA is licensed under the terms of the GNU GPL v3.

License: GPL v3

impala's People

Contributors

glenn032787 avatar oneillkza avatar vanessa-porter avatar vanessaporter avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

impala's Issues

Potential issues from simulated data

Hey! One thing I forgot to mentioned during the meeting is that I am in the process of doing titration testing using the simulated data. Specifically, I am changing tumor content (ratio between biallelic normal and allelic cancer genes) and ASE fold change. The results could potentially be used as a option to automatically adjust the major allele frequency threshold based on tumor content in IMPALA. It could also be a good figure in the IMPALA paper.

However, many of the data still have low recall rate of 0.45 (a lot of false negatives) but higher precision (0.85-0.9). After investigating, I noticed some issues or limitations with IMPALA:

  1. SnpEff annotates using ensembl106 rather than ensembl100 so some gene names do not match with the rest of the workflow. It is not easy to make snpEff use ensembl100 instead but this shouldn't affect too many genes.
  2. The biomart_ensembl100_GRCh38.sorted.bed annotation file is missing pseudogenes (should be an easy fix)
  3. Most of the false negatives comes from overlapping genes. When there is a variant in overlapping region, SnpEff will annotated as one of the genes only. Overlapping genes seems more common in pseudogenes so I am rerunning the simulator to exclude pseudogenes to see if the issue with false negative mainly comes from this issue. However, I think it is important to note than this is one of the limitation for IMPALA.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.