Coder Social home page Coder Social logo

pbk-qc-pipeline's Introduction

Partners Biobank Genomics Data QC

This repository details the quality control (QC) pipeline of the Partners Biobank genotype data, which largely follows the recommendations described in:

Peterson, R. E., Kuchenbaecker, K., Walters, R. K., Chen, C.-Y., Popejoy, A. B., Periyasamy, S., et al. (2019). Genome-wide Association Studies in Ancestrally Diverse Populations: Opportunities, Methods, Pitfalls, and Recommendations. Cell, 179(3), 589–603. http://doi.org/10.1016/j.cell.2019.08.051

The current dataset (as of Dec. 2019) includes 36,424 individuals genotyped on Illumina’s Multi-Ethnic Global array in hg19 coordinates.

Quality control pipeline

  • QC for each genotyping batch (metrics not affected by ancestry; scripts 01-04)

    • SNP-level call rate >0.95
    • Sample-level call rate >0.98
    • SNP-level call rate >0.98
    • Maximum SNP-level missing rate difference between two batches < an empirical threshold (e.g., 1%)
  • Merge genotyping batches (scripts 05-06)

    • Remove duplicated SNPs
    • Remove monomorphic SNPs
    • Remove SNPs not confidently mapped (chr=0)
  • Population assignment (scripts 07-08)

    • Select common, high-quality SNPs for population inference
      • SNP-level call rate >0.98
      • MAF >5%
      • Remove strand ambiguous SNPs and long-range LD regions (chr6:25-35Mb; chr8:7-13Mb inversion)
      • Prune to <100K independent SNPs
    • Identify individuals with European ancestry using selected SNPs
      • Run PCA combining study samples + 1KG data
      • Use Random Forest to classify genetic ancesty with a prediciton prob. >0.9
  • QC within European samples (metrics may be affected by ancestry; scripts 09-13)

    • Remove samples that fail sex check (--check-sex)
    • Absolute value of autosomal heterozygosity rate deviating from the mean (e.g., 3SD; --het)
    • Identify unrelated Individuals (Pi_hat <0.2) within European samples
    • Remove SNPs that show batch associations
      • Regress each batch indicator on SNPs, adjusting for sex (association P < an empirical threshold)
  • Calculate PCs within unrelated European samples using common, high-quality SNPs (script 14)

  • Final SNP-level QC within unrelated European samples (script 15)

    • SNP-level call rate >0.98
    • HWE >1e-10
    • Retain only autosomal SNPs, excluding indels and monomorphic SNPs (for imputation; HRC: SNP only)
  • Prepare data for HRC/1KG imputation using Michigan server (scripts 16-17)

    • Harmonize study data with HRC/1KG data
    • Convert plink to vcf by chromosome
  • Send unrelated European samples to Michigan server for imputation (using HRC as the reference panel)

  • Post-imputation QC (converting vcf dosages to plink hard-call genotypes; scripts 18-20)

    • INFO score/Imputation R2 >0.8
    • MAF (based on imputed dosages) >1%
    • HWE >1e-10
    • SNP-level call rate >0.98

Summary of pre-imputation QC

Genetic ancestry assignment

  • Random Forest prediction probability > 0.9 based on top 6 PCs, with 1KG samples as the training data
  • "Unclassified" consists of mostly admixed individuals
1KG superpop EUR AMR AFR EAS SAS Unclassified Total
# Samples 26,677 1,840 1,607 504 297 5,499 36,424
% Total 73.2% 5.1% 4.4% 1.4% 0.8% 15.1% 100%

Sample QC

  • Samples are genotyped on 8 batches, with the first 7 batches each containing ~5K individuas and the 8th batch around 900 individuals
Sample QC metric # Samples % Total
Initial sample size 36,424 100%
Batch QC:
Sample-level call rate < 0.98 0 0.0%
Merged QC:
Non-European 9,747 26.8%
EUR (pop-specific) QC:
Failing sex check (reported != imputed sex, using F < 0.25
for female & >0.75 for male)
25 0.07%
Outlying heterozygosity rate (>5SD from the mean) 50 0.14%
IBD relatedness > 0.2 908 2.5%
Any of the above three 979 2.7%
Post-QC 25,698 70.6%

Variant QC

Variant QC metric # Variants % Total
Initial variant count (avg. across batches) 1.7M -
Batch QC:
SNP call rate < 0.95 and then < 0.98 (avg. across batches) 20K -
Common across batches 1,370,695 -
Missing rate diff > 0.0075 between any two batches 48,929 -
Merged QC:
Total 1,321,766 100%
Monomorphic SNPs 122,742 9.3%
Duplicated SNPs 19,973 1.5%
Not confidently mapped (chr & pos=0) 2,183 0.17%
Any of the above three 144,456 10.9%
EUR (pop-specific) QC:
Showing batch association (p < 1e-04) 2,316 0.18%
Final QC:
SNP-level call rate < 0.98 1 7e-05%
pHWE < 1e-10 2375 0.18%
Non-autosomal, indel, or monomorphic 125,030 9.5%
Post-QC 1,047,588 79.3%
HRC QC (for imputation):
Not in HRC or mismatched info 138,359 10.5%
Send to Michigan imputation server 909,229 68.8%

Summary of post-imputation QC

Variant filter (1) # Variants % Total Variant filter (2) # Variants % Total
Total imputed 33,822,636 100% Total imputed 33,822,636 100%
MAF (dosage) < 0.01 26,051,805 77.0% MAF (dosage) < 0.005 24,946,776 73.8%
Imputation R2 < 0.8 17,443,564 51.6% Imputation R2 < 0.6 9,993,397 29.5%
Any of the above two 26,468,624 78.3% Any of the above two 25,100,024 74.2%
Call rate < 0.98 0 0.0% Call rate < 0.98 0 0.0%
pHWE < 1e-10 167 5e-04% pHWE < 1e-10 213 6e-04%
Post-QC 7,353,845 21.7% Post-QC 8,722,399 25.8%

pbk-qc-pipeline's People

Contributors

annefeng avatar getian107 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.