#CHiGP
CHiGP stands for Capture Hi-C Gene Prioritisation. It's purpose is to take a bunch of p-vals from a genome-wide association study and integrate these with capture Hi-C data to prioritise genes. The computational engine is written in R and whilst it should be possible to process on a laptop, things will go considerably faster if you have access to a compute cluster. Please note that at the moment we don't attempt to do anything sensible with MHC regions which we define as a region on chr6(25-35Mb) GRCh37 coords.
##Credits
The method and software was co-developed by Chris Wallace and Olly Burren. Olly maintains this code. Test GWAS summary statistics come from Okada et al. 2014. “Genetics of Rheumatoid Arthritis Contributes to Biology and Drug Discovery.” Nature 506 (7488): 376–81. and CHiCAGO called interactions from Mifsud et al. 2015. “Mapping Long-Range Promoter Contacts in Human Cells with High-Resolution Capture Hi-C.” Nature Genetics, May. Nature Publishing Group. doi:10.1038/ng.3286. were kindly supplied by Mikhail Spivakov and Paula Freire Pritchett from the Babraham Institute, Cambridge UK.
#Prerequisites
- Linux or Mac OS X (preferably the former)
- R (developed with 3.2.2)
- data.table
- snpStats (bioconductor)
- GenomicRanges (bioconductor)
- reshape2
- yaml
- data.tree
- (optional) make
- (optional) PERL
- Macd queue software and dependencies.
- htslib for tabix
#Data files
- capture Hi-C data we support PeakMatrix format as generated by CHiCAGO
- 1000 Genomes VCF files.
- coding SNP bed file
#Initial setup
##Tabix installation
Follow instructions available from htslib to install tabix. Test software assumes that the location of the tabix binary is available on the path.
First try
which tabix
if this returns a valid path then you are OK. If you have a different path then you should do the following
TABIX_BIN=/path/to/tabix
export TABIX_BIN
##Setup Environment
All scripts and instructions assume that you have a git repository directory in which you have cloned CHIGP called /home/USERNAME/gitr/CHIGP. We need to setup a shell environment variable as to where that is
## assume git repo is in your home dir as CHIGP and is called gitr, you need to add your own USERNAME.
GRPATH="/home/USERNAME/gitr/"
export GRPATH
##Install R dependencies
This assumes that you have R installed and have the ability to install R packages.
Install bioconductor dependencies
source("http://bioconductor.org/biocLite.R")
biocLite(c("snpStats","GenomicRanges"))
After you have these install data.table and reshape2
install.packages(c("data.table","reshape2","yaml","data.tree"))
q()
Next grab the test file bundle. This contains a cut down set of real data taken from Misfud et al., GWAS from Okada et al. and reference genotypes from EUR 1000 Genomes Project.
cd /home/USERNAME/gitr/CHIGP
curl -s https://www.immunobase.org/downloads/CHIGP/CHIGP.databundle.tgz | tar -xvz
The above command should have created the DATA directory.
#Test run Once you have all of the above dependencies and data installed you should be able to take CHIGP for a spin. Running CHICGP consists of three steps
- Convert p-vals to posterior probabilities using a reference set of genotypes.
- Generate support files for gene score algorithm.
- Integrate CHiC interaction and other functional data to prioritise genes. For a given GWAS it should only be neccessary to run step 1 only once (unless of course you want to fiddle with parameters). You might want to run step 2 multiple times depending on what capture HiC datasets are available to you.
##Quickstart
See below for directions to run each of these step individually. For convenience and to test whether you have installed all dependencies you can do
cd CHIGP/sh
make -f Makefile.ppi
## if you want to start again at anytime run
## make -f Makefile.ppi clean
This should create gene scores for all of chr22 based on Mifsud interactions and Okada GWAS for Rheumatoid Arthritis.
For a detailed discussion on how we do this please see Wakefield(2009), Giombartelli(2014) and Pickrell(2014). In fact the code used here is adapted from Giombartelli. Briefly, we split the genome into blocks based on a recombination frequency of 0.1cM, for each region we then estimate the minor allele frequency (MAF) of each SNP within that region available in the reference genotyping dataset. Note that if a SNP in your GWAS is not in your reference genotyping set then it will be ommited from future steps. We use this to estimate an approximate Bayes factor for that SNP to be causal given the data, we can then for a given region estimate the posterior probability (PPi) that a SNP is causal.
By way of illustration using our test dataset we can compute PPi for all variants on chr22
cd CHIGP/sh
./test_ppi.sh
# results are returned in CHIGP/data/out/
If you have access to high performance computing the above step is easily parallelised, see section below on how to do this.
We next need to do some housekeeping to generate files that the algorithm can use to compute the gene scores and thus prioritise the genes. Again we illustrate this process using data from :-
- Interactions Mifsud et al.(2015)
- Recombination freq data from HapMap project need URL
- Functional annotatation taken from Ensembl e75 using VEP Need citation.
cd CHICGP/sh ## if not already there from previous step
./test_gen_resource_files.sh
Next we compute final gene scores using chr22 as an example using data taken from Okada et al. 2014
cd CHICGP/sh ## if not already there from previous step
./test_compute_gene_scores_ppi.sh
# results are returned in CHIGP/data/out/
Here is a results table (split over two tables with headings truncated for readability) of the top results by 'all' gene score
disease | ensg | name | biotype | strand | baitChr |
---|---|---|---|---|---|
0.1cM_chr22.imp | ENSG00000100321 | SYNGR1 | protein_coding | + | 22 |
0.1cM_chr22.imp | ENSG00000161180 | CCDC116 | protein_coding | + | 22 |
0.1cM_chr22.imp | ENSG00000161179 | YDJC | protein_coding | - | 22 |
0.1cM_chr22.imp | ENSG00000100023 | PPIL2 | protein_coding | + | 22 |
0.1cM_chr22.imp | ENSG00000128228 | SDF2L1 | protein_coding | + | 22 |
name | all | CD34 | coding | GM12878 | promoter |
---|---|---|---|---|---|
SYNGR1 | 0.6919344 | 0.6919344 | 0.00e+00 | 0.6917125 | 0.6917124 |
CCDC116 | 0.5201607 | 0.5062241 | 1.07e-05 | 0.5128092 | 0.3676899 |
YDJC | 0.5064901 | 0.4987388 | NA | 0.5064809 | 0.3825915 |
PPIL2 | 0.3698667 | 0.0007144 | NA | 0.3698667 | 0.0006832 |
SDF2L1 | 0.3680642 | 0.3679398 | NA | 0.3680642 | 0.0003168 |
This is just an example analysis for true analysis we would probably want to set a cut off of 0.5 on all_gene_score (things below this are likely to be highly speculative) for manual interpretation.
For this example we can infer that the SYNGR1 prioritisation is driven by a signal within the promoter region of this gene. CCDC116 and YDJC genes have some evidence for non tissue specific interaction (we see similar scores across cell types). However the promoter region seem to be a major driver of these prioritisations. PPIL2 appears to be driven by a tissue specific interaction in GM12878 cell line. Finally the SDF2L1 appears to be driven by a non tissue specific interaction.
#PMI
PMI stands for Poor Mans Imputation and it attempts using reference genotype set and a set of GWAS summary statistics to infer association p-values for SNPs not covered in a study. It does this by computing LD and then for missing p-values it assigns the same value as a snp that has a p-value with the maximum LD. Users can assign a threshold (default is r^2>0.6), if a SNP has a max LD score less than this with a SNP with a p-value then it is dropped. It's a simple method that is implemented to boost resolution when attempting to compute posterior probabilities.
##Quickstart
It follows a similar method to the ppi method previously described excepting a parameter chance when computing initial posterior probabilities.
cd CHIGP/sh
make -f Makefile.pmi
## if you want to start again at anytime run
## make -f Makefile.pmi clean
You can run both and compare the results by using ./R/testPPiVsPMIGeneScores.R script
##BLOCKSHIFTER
Blockshifter is software that allows us to compare the distribution of GWAS signals between two sets of 'prey' regions. Due to the underlying technology that detects restriction fragments there is significant correlation between fragments, which we need to take into account when looking for GWAS signal enrichment. The GWAS data are also correlated due to linkage disequilibrium, we have chosen to implement a technique based on GOSHIFTER, whereby the correlation structure due to LD is preserved but instead we permute the annotations by shifting their location but maintain order. We adapt the method to take into account the block structure of interactions also and thus shift blocks of interactions over the snp data to estimate the null distribution of the difference in weighted means between the test and control set.
cd CHICGP/sh ## if not already there from previous step
./test_blockshifter_pmi.sh
# results are returned in CHIGP/data/out/
##Interpreting the results
gwas | test | control | perm | p.emp | z | p.val.z | delta |
---|---|---|---|---|---|---|---|
0.1cM_chr22.ppi | GM12878 | CD34 | 100 | 0.97 | -1.781786 | 0.0747841 | -2.13e-05 |
For our test data set we have compared GM12878 vs CD34 for enrichment of RA GWAS signals from OKADA rescaled as posterior probabilities for chromosome 22. Unsuprisingly given 100 perms we find no evidence for enrichment. p.emp=0.97 which is the empirically computed p value and resistant to the distribution of permuted nulls. The Z scores and p.val.z are computed under the assumption that the set of permuted null delta's are distributed normally, which might inflate significance if violated. It's included to facillitate comparison between different gwas.
- Add in part for parallelising over HPC cluster.
- Describe set based and hierachical Bayes factor comparison method.
- Citations for HapMap recombination data and Ensembl VEP tool.
- Add code for annotating interaction sets (adding gene names and biotypes) for peakMatrix and WashU formats.