Coder Social home page Coder Social logo

corbinq / gambit Goto Github PK

View Code? Open in Web Editor NEW
11.0 2.0 6.0 4.66 MB

Tool for integrative gene-based association analysis using GWAS summary stats

Makefile 0.03% Shell 0.02% CMake 0.19% C++ 93.85% C 4.36% Fortran 1.56%
gwas gwas-summary-statistics gene-based genetic-analysis twas integrative-analysis functional-annotation genetic-epidemiology regulatory-elements

gambit's Introduction

GAMBIT

A C++ tool for Gene-based Analysis with oMniBus, Integrative Tests

  • Implements several gene-based test forms (quadratic: weighted sum of Zsq, linear: weighted sum of Z, and maximum Zsq) to aggregate GWAS single-variant summary statistics cross-referenced with variant- or region-based functional annotations
  • Calculates annotation-stratified gene-based tests (e.g., TWAS/PrediXcan tests using eSNPs, gene-based tests using only coding variants, and gene-based tests using enhancer-to-target-gene maps), and omnibus tests by combining p-values for each gene
  • Inputs: GWAS association summary statistics file (chromosome, position, ref/alt allele, and z-score or beta-hat + se), annotation files, and LD reference panel

GWAS Summary Statistics

  • GWAS summary statistics files can be specified via --gwas my_summary_stats.txt.gz. Input files must be ordered by chromosome and genomic position, with input fields as shown below:
#CHR  POS     REF  ALT  SNP_ID      N         ZSCORE   ANNO
1     721290  C    G    rs12565286  58663.62  0.86661  Intergenic
1     752566  G    A    rs3094315   57135     0.5521   Intergenic
1     775659  A    G    rs2905035   54570     1.12098  Intron:LOC643837
1     777122  A    T    rs2980319   54570     1.11906  Exon:LOC643837
  • The first four fields and ZSCORE are required, while SNP_ID, ANNO and N (effective sample size) are optional.
  • See format_gwas_summary_stats.sh for annotating GWAS summary statistics files using EPACTS/TabAnno.

Annotation-Stratified Gene-Based Tests

Gene-Based Analysis with Regulatory Elements

  • To compute gene-based tests using regulatory element annotations, specify an annotation bed file with regulatory-element-to-target-gene weights via --anno-bed my_reg_elems.txt.gz, formatted
#CHR  START   END     CLASS     ELEMENT_ID          TARGET_GENES                     ANNO
chr1  567400  567600  Enhancer  chr1:567400:567600  MIB2:4.12|CPTP:2.53|GLTPD1:2.53  .
chr1  568000  568200  Enhancer  chr1:568000:568200  ATAD3A:2.75                      .
chr1  758600  758800  Enhancer  chr1:758600:758800  C1orf170:2.57|PERM1:2.57         .
chr1  769200  769400  Enhancer  chr1:769200:769400  C1orf170:3.36|PERM1:3.36         .
  • Association tests for individual regulatory elements is reported in *.stratified_out.txt files, and gene-based p-values (aggregating across regulatory elements for each gene) in *.summary_out.txt files.

  • Aggregation Methods for Regulatory Elements. By default, GAMBIT aggregates test statistics across variants in regulatory elements using a weighted sum of single-variant chi-squared statistics (SKAT gene-based test). To instead use weighted ACAT or HMP to combine single-variant p-values, specify --no-skat and a p-value combination method via --pcomb.

Gene-Based Analysis with Coding and Other Annotated Variants

  • To compute gene-based tests using coding and other variants, GAMBIT relies on the ANNO field in GWAS summary statistics and an annotation hierarchy definitions file specified via --anno-defs my_defs.txt, formatted as below:
#CLASS    SUBCLASS          ANNO_TERMS
Coding    Protein_Altering  Nonsynonymous,Start_Loss,Stop_Gain,Stop_Loss,CodonGain,CodonLoss,Frameshift
Coding    Splice_Site       Essential_Splice_Site,Normal_Splice_Site
Coding    Exon_Other        Exon,Synonymous
UTR       UTR3              Utr3
UTR       UTR5              Utr5
  • The ANNO_TERMS field specifies a comma-separated list of annotation terms (matching terms from the GWAS summary statistics file's ANNO field), and CLASS and SUBCLASS determine the annotation hierarchy and classes reported in output files.

  • Gene-Based Test Output. Test statistics stratified by gene and annotation subclass are provided in *.stratified_out.txt files, and gene-based p-values (aggregating across annotation classes for each gene) in *.summary_out.txt files.

  • Variant Aggregation Methods. By default, GAMBIT aggregates test statistics across variants using a weighted sum of single-variant chi-squared statistics (SKAT gene-based test). To instead use weighted ACAT or HMP to combine single-variant p-values, specify --no-skat and a p-value combination method via --pcomb.

TWAS Analysis

  • To compute TWAS/PrediXcan gene-based tests using GAMBIT, specify an eWeight file via --eweights my_eWeights.txt.gz, formatted
##TISSUE_IDS=0:Adipose_Subcutaneous,1:Adipose_Visceral_Omentum,2:Adrenal_Gland,3:Artery_Aorta
#CHR  POS     RSID       REF  ALT  BETAS
1     752566  rs3094315  G    A    C1orf159=3.92e-02@0|UBE2J2=-1.49e-02@0|FAM87B=2.75e-01@1;1.25e-01@2;1.17e-01@3
1     752721  rs3131972  A    G    LINC00115=1.15e-01@0;1.75e-02@3;4.90e-02@4|RP11-206L10.8=3.21e-02@1
1     754182  rs3131969  A    G    LINC00115=-2.1e-02@1|RP5-857K21.2=-8.27e-02@2|RP11-206L10.9=-1.11e-01@2
1     760912  rs1048488  C    T    C1orf159=3.35e-04@0|TTLL10=-1.4e-02@3|FAM87B=1.75e-01@1;1.12e-01@2;9.51e-02@3|SAMD11=-1.27e-02@2
  • The BETAS field format is eGene_A=Weight_A1@Tissue_A1;Weight_A2@Tissue_A2|eGene_B=Weight_B1@Tissue_B1, and labels for tissue IDs can be specified in the header.
  • Subsetting tissues. To restrict analysis to a subset of tissues/cell-types, specify a comma-separated list of tissues following the --tissues flag. By default, GAMBIT includes all tissues/cell-types present in the eWeight file.
  • Tissue Aggregation for Omnibus tests. GAMBIT reports both single-tissue TWAS/PrediXcan analysis results, and omnibus tests results aggregating across all specified tissues/cell-types for each eGene. Omnibus p-values for multi-tissue TWAS/PrediXcan analysis can be calculated in GAMBIT using either 1) the maximum single-tissue test statistic based on the joint distribution of single-tissue statistics, 2) the sum of squared single-tissue z-scores (analogous to SKAT), or 3) PCOMB for ACAT or HMP [default]. Omnibus test method for multi-tissue analysis can be specified via --tissue-aggreg (PCOMB, MinP, SKAT, or ALL). P-value combination method can be specified via --pcomb (ACAT or HMP).
  • Single-tissue and omnibus test output. Gene-based tests and p-values for each eGene-tissue pair are reported in *.stratified_out.txt files, and omnibus p-values (aggregating across all tissues for each eGene) in *.summary_out.txt files.

dTSS-Weighted Gene-Based Tests

  • To incorporate un-annotated regulatory variants in gene-based analysis, GAMBIT implements a dTSS (distance to Transcription Start Site) weighted gene-based test, which aggregates all single-variant p-values within a specified window from each gene's TSS using weighted ACAT or HMP and assigns higher weight to variants nearer the TSS using an exponential decay function.
  • To compute dTSS-weighted gene-based tests, specify a TSS bed file via --tss-bed my_tss_bed.bed.gz, fomatted
#CHR  START   END     SYMBOL      GENE             GENE_ANNO
1     11868   11869   DDX11L1     ENSG00000223972  transcribed_unprocessed_pseudogene
1     62947   62948   OR4G11P     ENSG00000240361  transcribed_unprocessed_pseudogene
1     69090   69091   OR4F5       ENSG00000186092  protein_coding
1     131024  131025  CICP27      ENSG00000233750  processed_pseudogene
  • Window size. The window size for dTSS-weighted gene-based tests can be modified by specifying --tss-window BASEPAIRS (500 Kbp by default).
  • dTSS decay function. The relative weight assigned to variants nearer/farther from the TSS can be modified by specifying --tss-alpha ALPHA, where alpha=0 implies all variants receive equal weight, and larger values confer more weight to variants nearer the TSS. --tss-alpha also accepts comma-separated lists of alpha values, in which case GAMBIT computes global test p-values across all specified values (individual p-values are reported in INFO output field). By default, GAMBIT uses dTSS alpha values 1e-4,5e-5,1e-5,5e-6.

Methods References

Statistical methods implemented in GAMBIT:

Software References

Libraries and resources used or adapted in GAMBIT:

Feedback and bug reports

  • Feel free to contact Corbin Quick ([email protected]) with questions, bug reports, or feedback

gambit's People

Contributors

corbinq avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

gambit's Issues

LD Reference parameter use is mixed up

The Issue

Whilst trying to run GAMBIT on other (non default) LD reference panels for other populations I noticed that the software kept defaulting to "G1K_EUR_3V5/chr$.vcf.gz". This was because I was trying to pass in the --ldref parameter rather than --ldfile. This incorrect usage is also referenced here: https://xqwen.github.io/ptwas/ - the "Quick Start" use case I was following.

Lines of interest

cerr << " --ldref [data/chr*.vcf.gz] : LD reference panel (\"*\" as wildcard when split by chr)\n";

TLDR;

  • in the line above:
    Should be --ldfile rather than --ldref

GAMBIT is restricted to Chr1-22

GAMBIT/src/Main.cpp

Lines 196 to 198 in 1644b80

for(int i = 22; i>0; --i){
ldfile = gsubstr(ldfile, "chr" + to_string(i), "chr*");
}

Currently the LD References are restricted to chromsomes 1 - 22 and does not include the ability to provide X, Y or MT chromosomes. This would be a really handy feature to have when generating and providing alternative LD reference datasets, inclusive of other populations. Linked to #1 for limitations in LD reference data to use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.