Coder Social home page Coder Social logo

chigp's Introduction

#CHiGP

CHiGP stands for Capture Hi-C Gene Prioritisation. It's purpose is to take a bunch of p-vals from a genome-wide association study and integrate these with capture Hi-C data to prioritise genes. The computational engine is written in R and whilst it should be possible to process on a laptop, things will go considerably faster if you have access to a compute cluster. Please note that at the moment we don't attempt to do anything sensible with MHC regions which we define as a region on chr6(25-35Mb) GRCh37 coords.

##Credits

The method and software was co-developed by Chris Wallace and Olly Burren. Olly maintains this code. Test GWAS summary statistics come from Okada et al. 2014. “Genetics of Rheumatoid Arthritis Contributes to Biology and Drug Discovery.” Nature 506 (7488): 376–81. and CHiCAGO called interactions from Mifsud et al. 2015. “Mapping Long-Range Promoter Contacts in Human Cells with High-Resolution Capture Hi-C.” Nature Genetics, May. Nature Publishing Group. doi:10.1038/ng.3286. were kindly supplied by Mikhail Spivakov and Paula Freire Pritchett from the Babraham Institute, Cambridge UK.

#Prerequisites

  • Linux or Mac OS X (preferably the former)
  • R (developed with 3.2.2)
    • data.table
    • snpStats (bioconductor)
    • GenomicRanges (bioconductor)
    • reshape2
    • yaml
    • data.tree
  • (optional) make
  • (optional) PERL
    • Macd queue software and dependencies.
  • htslib for tabix

#Data files

  • capture Hi-C data we support PeakMatrix format as generated by CHiCAGO
  • 1000 Genomes VCF files.
  • coding SNP bed file

#Initial setup

##Tabix installation

Follow instructions available from htslib to install tabix. Test software assumes that the location of the tabix binary is available on the path.

First try

which tabix

if this returns a valid path then you are OK. If you have a different path then you should do the following

TABIX_BIN=/path/to/tabix
export TABIX_BIN

##Setup Environment

All scripts and instructions assume that you have a git repository directory in which you have cloned CHIGP called /home/USERNAME/gitr/CHIGP. We need to setup a shell environment variable as to where that is

## assume git repo is in your home dir as CHIGP and is called gitr, you need to add your own USERNAME.
GRPATH="/home/USERNAME/gitr/"
export GRPATH

##Install R dependencies

This assumes that you have R installed and have the ability to install R packages.

Install bioconductor dependencies

source("http://bioconductor.org/biocLite.R")
biocLite(c("snpStats","GenomicRanges")) 

After you have these install data.table and reshape2

install.packages(c("data.table","reshape2","yaml","data.tree"))
q()

Get Test data

Next grab the test file bundle. This contains a cut down set of real data taken from Misfud et al., GWAS from Okada et al. and reference genotypes from EUR 1000 Genomes Project.

cd /home/USERNAME/gitr/CHIGP
curl -s https://www.immunobase.org/downloads/CHIGP/CHIGP.databundle.tgz | tar -xvz

The above command should have created the DATA directory.

#Test run Once you have all of the above dependencies and data installed you should be able to take CHIGP for a spin. Running CHICGP consists of three steps

  1. Convert p-vals to posterior probabilities using a reference set of genotypes.
  2. Generate support files for gene score algorithm.
  3. Integrate CHiC interaction and other functional data to prioritise genes. For a given GWAS it should only be neccessary to run step 1 only once (unless of course you want to fiddle with parameters). You might want to run step 2 multiple times depending on what capture HiC datasets are available to you.

##Quickstart

See below for directions to run each of these step individually. For convenience and to test whether you have installed all dependencies you can do

cd CHIGP/sh
make -f Makefile.ppi
## if you want to start again at anytime run
## make -f Makefile.ppi clean

This should create gene scores for all of chr22 based on Mifsud interactions and Okada GWAS for Rheumatoid Arthritis.

Converting p-vals to posterior probabilities.

For a detailed discussion on how we do this please see Wakefield(2009), Giombartelli(2014) and Pickrell(2014). In fact the code used here is adapted from Giombartelli. Briefly, we split the genome into blocks based on a recombination frequency of 0.1cM, for each region we then estimate the minor allele frequency (MAF) of each SNP within that region available in the reference genotyping dataset. Note that if a SNP in your GWAS is not in your reference genotyping set then it will be ommited from future steps. We use this to estimate an approximate Bayes factor for that SNP to be causal given the data, we can then for a given region estimate the posterior probability (PPi) that a SNP is causal.

By way of illustration using our test dataset we can compute PPi for all variants on chr22

cd CHIGP/sh
./test_ppi.sh
# results are returned in CHIGP/data/out/

If you have access to high performance computing the above step is easily parallelised, see section below on how to do this.

Generating support files.

We next need to do some housekeeping to generate files that the algorithm can use to compute the gene scores and thus prioritise the genes. Again we illustrate this process using data from :-

  1. Interactions Mifsud et al.(2015)
  2. Recombination freq data from HapMap project need URL
  3. Functional annotatation taken from Ensembl e75 using VEP Need citation.
cd CHICGP/sh ## if not already there from previous step
./test_gen_resource_files.sh

Computing gene scores

Next we compute final gene scores using chr22 as an example using data taken from Okada et al. 2014

cd CHICGP/sh ## if not already there from previous step
./test_compute_gene_scores_ppi.sh 
# results are returned in CHIGP/data/out/

Interpreting the results

Here is a results table (split over two tables with headings truncated for readability) of the top results by 'all' gene score

disease ensg name biotype strand baitChr
0.1cM_chr22.imp ENSG00000100321 SYNGR1 protein_coding + 22
0.1cM_chr22.imp ENSG00000161180 CCDC116 protein_coding + 22
0.1cM_chr22.imp ENSG00000161179 YDJC protein_coding - 22
0.1cM_chr22.imp ENSG00000100023 PPIL2 protein_coding + 22
0.1cM_chr22.imp ENSG00000128228 SDF2L1 protein_coding + 22
name all CD34 coding GM12878 promoter
SYNGR1 0.6919344 0.6919344 0.00e+00 0.6917125 0.6917124
CCDC116 0.5201607 0.5062241 1.07e-05 0.5128092 0.3676899
YDJC 0.5064901 0.4987388 NA 0.5064809 0.3825915
PPIL2 0.3698667 0.0007144 NA 0.3698667 0.0006832
SDF2L1 0.3680642 0.3679398 NA 0.3680642 0.0003168

This is just an example analysis for true analysis we would probably want to set a cut off of 0.5 on all_gene_score (things below this are likely to be highly speculative) for manual interpretation.

For this example we can infer that the SYNGR1 prioritisation is driven by a signal within the promoter region of this gene. CCDC116 and YDJC genes have some evidence for non tissue specific interaction (we see similar scores across cell types). However the promoter region seem to be a major driver of these prioritisations. PPIL2 appears to be driven by a tissue specific interaction in GM12878 cell line. Finally the SDF2L1 appears to be driven by a non tissue specific interaction.

#PMI

PMI stands for Poor Mans Imputation and it attempts using reference genotype set and a set of GWAS summary statistics to infer association p-values for SNPs not covered in a study. It does this by computing LD and then for missing p-values it assigns the same value as a snp that has a p-value with the maximum LD. Users can assign a threshold (default is r^2>0.6), if a SNP has a max LD score less than this with a SNP with a p-value then it is dropped. It's a simple method that is implemented to boost resolution when attempting to compute posterior probabilities.

##Quickstart

It follows a similar method to the ppi method previously described excepting a parameter chance when computing initial posterior probabilities.

cd CHIGP/sh
make -f Makefile.pmi
## if you want to start again at anytime run
## make -f Makefile.pmi clean

You can run both and compare the results by using ./R/testPPiVsPMIGeneScores.R script

##BLOCKSHIFTER

Blockshifter is software that allows us to compare the distribution of GWAS signals between two sets of 'prey' regions. Due to the underlying technology that detects restriction fragments there is significant correlation between fragments, which we need to take into account when looking for GWAS signal enrichment. The GWAS data are also correlated due to linkage disequilibrium, we have chosen to implement a technique based on GOSHIFTER, whereby the correlation structure due to LD is preserved but instead we permute the annotations by shifting their location but maintain order. We adapt the method to take into account the block structure of interactions also and thus shift blocks of interactions over the snp data to estimate the null distribution of the difference in weighted means between the test and control set.

cd CHICGP/sh ## if not already there from previous step
./test_blockshifter_pmi.sh 
# results are returned in CHIGP/data/out/

##Interpreting the results

gwas test control perm p.emp z p.val.z delta
0.1cM_chr22.ppi GM12878 CD34 100 0.97 -1.781786 0.0747841 -2.13e-05

For our test data set we have compared GM12878 vs CD34 for enrichment of RA GWAS signals from OKADA rescaled as posterior probabilities for chromosome 22. Unsuprisingly given 100 perms we find no evidence for enrichment. p.emp=0.97 which is the empirically computed p value and resistant to the distribution of permuted nulls. The Z scores and p.val.z are computed under the assumption that the set of permuted null delta's are distributed normally, which might inflate significance if violated. It's included to facillitate comparison between different gwas.

TODO

  • Add in part for parallelising over HPC cluster.
  • Describe set based and hierachical Bayes factor comparison method.
  • Citations for HapMap recombination data and Ensembl VEP tool.
  • Add code for annotating interaction sets (adding gene names and biotypes) for peakMatrix and WashU formats.

chigp's People

Contributors

ollyburren avatar

Stargazers

 avatar YanYan avatar Thomas Sandmann avatar  avatar Asif Zubair avatar Tim Triche, Jr. avatar Eivind Gard Lund avatar Caleb Lareau avatar

Watchers

James Cloos avatar Caleb Lareau avatar

chigp's Issues

Test data no longer available

  1. The test data set for CHIGP is no longer available at https://www.immunobase.org/downloads/CHIGP/. Could you make it available on another site?

  2. I am wondering about the format for the --region_file used in computePPi.R and I am unclear on how it is generated. As the test data set is no longer available, I am unsure on how to proceed. Could you provide information on how to generate this.

  3. I have a capture Hi-C peak matrix, full summary statistics from a GWAS study, and VCF files for each chromosome downloaded from the 1000 genomes project. As detailed in prerequisites of the readme.md, this should be sufficient to run CHIGP. However, the Rscripts seem to require additional supporting files (--region_file used in computePPi.R, --cSNP_file for generateResourceFiles.R, Recombination freq data from HapMap project [which is also no longer available] etc.). Could you provide links for where to source all the supporting files or instructions on how to generate them? I am sure some of this will become clearer once I look at the test data.

Thank you!

TNFAIP3 has very different gene score between DIRECT and IMB

The gene score for TNFAIP3 gene is significantly different between two GWAS sources even though things should be similar. We need to understand if this is an issue with immunobase curated data or an issue with CHIGP data processing and analysis

trouble running tutorial

Hi,

I'm trying to run the test for COGS. Unfortunately, I'm unable to complete the tutorial, as I'm finding the following error during the test_ppi.sh:

""" Error in fread(paste(tabix_bin, kg_file, region, " | grep PASS"), sep = "\t", :
File is empty: /dev/shm/file48bccbe4682 """

any help is fully appreciated.

Ariel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.