Coder Social home page Coder Social logo

vntr-eqtl's Introduction

DOI

VNTR-eQTL

Scripts for analyzing VNTR-eQTLs in GTEx cohort (link to the paper) and genotyping performance comparisons

Inputs of the pipeline

  1. VNTR genotypes of the entire GTEx cohort as outputted by adVNTR in the default format (one text file for each individual) using WGS data aligned to GRCh38 (available from dbGaP phs000424.v7.p2).
  2. Phenotype file for GTEx cohort (ethnicity and sex are used). accession: pht002742.v7.p2 and exact version used in paper: phs000424.v7.pht002742.v7.p2.c1.GTEx_Subject_Phenotypes.GRU.txt
  3. SNP genotypes file for GTEx cohort (phg001219.v1.GTEx_v8_WGS.genotype-calls-vcf.c1). This file is used to compare effect of VNTR variants and SNPs, and identifying population structure.
  4. RNA-expression data for GTEx cohort in different tissues as a table (phe000020.v1.GTEx_RNAseq.expression-data-matrixfmt.c1).
  5. SRA run table for GTEx expreiments downloaded from dbGaP. (used to identify corresponding tissue for each expreiment id)

Output of the pipeline

Linear regression results showing association of the length of each VNTR with the expression level of its nearest gene in each of the 46 tissues. The pipeline will create a directory called regression_results for the output. It contains 46 subdirectories (one for each tissue) and each of them have a tab separated file for each VNTR showing linear test for the VNTR in the corresponding tissue.

regression_results
└───Whole-Blood
│   │   100374.txt
│   │   100436.txt
│   │   ...
│   
└───Lung
│   │   100374.txt
│   │   100436.txt
│   │   ...
│   
└───etc 

Each text file contains 6 tab separated values:

Genename  Chromosome  VNTR Coordinate Start   Effect Size (B)   P-value   Standard Error (Bse)

For example, regression_results/Whole-Blood/423956.txt (result of the association test for VNTR 423956 in Whole Blood) will have the following content:

POMC    chr2    25161573        0.21217799507799368     1.1501572764235368e-05  0.047746755235986996

Requirements

  1. adVNTR: this can be installed with conda config --add channels bioconda; conda install advntr
  2. eigensoft: this can be installed with apt install eigensoft on linux or conda install -c bioconda/label/cf201901 eigensoft using conda package manager on supported systems.
  3. PEER: to correct for non genetic factors affecting gene expression level.
  4. Python libraries: statsmodels, pandas, numpy
    These can be installed with pip install pandas numpy statsmodels using pip or with conda install -c conda-forge statsmodels using conda.
  5. TRTools: for filtering target loci
  6. CAVIAR: for fine-mapping the causal variant.

How to run

Preprocessing

  1. From sra run table, we first extract the entries related to RNA-seq experiments:
zcat SraRunTable.txt.gz | grep "RNA-Seq" >  Sra_table_RNA-Seq_only
  1. Using Sra_table_RNA-Seq_only, we then break up the rpkm expression matrix (phe000020.v1.GTEx_RNAseq.expression-data-matrixfmt.c1/GTEx_Data_20160115_v7_RNAseq_RNASeQCv1.1.8_gene_rpkm.gct) for different tissues:
python extract_expression_by_tissue.py Sra_table_RNA-Seq_only rpkm_file.gct Expression_by_Subtissue/

Expression_by_Subtissue will contain Whole Blood.rpkm, Brain - Cortex.rpkm, etc. This command usually takes an hour to complete.

  1. To convert microarray genotypes to plink format and keep common variants to infer population structure, run following:
vcftools --gzvcf GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_866Indiv.vcf.gz --recode --maf 0.05 --remove-filtered-all --out filtered_snps
plink --vcf filtered_snps.vcf --biallelic-only --maf 0.05 --recode --out plink_866Ind_filtered_biallelic_snps_0.05maf

Since the vcf files are large, this step may take up to one day depending on the system.

Finding population structure

set GTEXDIR to parent directory of GTEx dataset in the principal_component_identification.sh scripts and run PCA and store results in PCA_results directory:

./principal_component_identification.sh PCA_results

Computing PEER factors

Run PEER factor identification scripts for each tissue (Expression_by_Subtissue directory was specified in preprocessing-part2).

python peer_factor_identification.py Expression_by_Subtissue PEER_results

This step may take 1-2 days to finish.

Running association test

This step will take less than a day to generate the regression_results

python run_regression.py VNTR_genotypes/ Expression_by_Subtissue/

VNTR_genotypes should contain 652 files (one per individual) as outputted by adVNTR for each sample.

Identifying significance threshold (5% FDR)

Run following script to read regression_results and identify tissue-specific significance thresholds.

compute_significance_cutoff.py

vntr-eqtl's People

Contributors

mehrdadbakhtiari avatar jong-hun-park avatar

Stargazers

Ricardo A. Vialle avatar Sander W. van der Laan avatar

Watchers

 avatar  avatar

Forkers

wangdi2016

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.