The ezprepro pipeline provides the preprocessing described in S.M. Smith et al. 2021. An expanded set of genome-wide association studies of brain imaging phenotypes in UK Biobank for brain imaging genetics on UK Biobank. This pipeline has been used for the following work:
The scripts for the ezprepro pipeline are provided here. To reproduce this pipeline, the scripts must be run on a high performance computer with a SLURM-like cluster job scheduler, and with a local (to the high performance computer) copy of the UKB genetic data (i.e., UKB Data-Field 22828 and associated Data-Fields). The input to this pipeline is a list of sample IDs (the study sample). The output is a preprocessed version of the UKB genetic data with the study sample split into a discovery cohort and a replication cohort. ezprepro is released under the BSD 2-Clause License. That license, and the copyright for ezprepro, are listed at the bottom of this README file.
The preprocessing done by this pipeline consists of the following ordered list of operations:
-
Subset the UKB genetic data to a provided list of samples (these are the study samples for which the brain phenotype of interest is measured).
-
Further subset to samples with white British ancestry (using the UKB variable in.white.British), without aneuploidy and with genetic sex matching reported sex.
-
Further subset to a maximal subset of unrelated samples.
-
Split the remaining samples into a discovery cohort with approximately 2/3 of the samples, and a replication cohort with approximately 1/3 of the samples.
-
Subset the variants in the discovery cohort according to the filters: MAF > 0.1% and INFO > 0.3 and HWE -log10 p-value <7.
-
Subset the variants in the replication cohort to match the variants passing the filters for the discovery cohort.
This process results in files containing genotypes for the discovery cohort and the replication cohort on the autosomal chromosomes, the X chromosome, and the pseudoautosomal regions of the Y chromosomes. In addition to these genotypes, the pipeline also forms a version of these files stratified by genetic sex.
To run the ezprepro script, follow these steps:
-
Step 1: Modify
ezprepro/code/global.sh
and set the environmental variables to your system specific values. -
Step 2: Populate
ezprepro/bin/
with binaries (or symlinks to binaries) following the inventory inezprepro/bin/index.txt
. -
Step 3: With your current working directory set to
ezprepro/
, run the shell command./A10.sh <STUDY.TXT>
. Here<STUDY.TXT>
should be the path to a file containing the sample IDs of your study sample, with one sample ID per line. -
Step 4: Run all the remaining shell commands that end in
10.sh
inezprepro/
in lexigraphical order, with your current working directory still set toezprepro/
. These remaining shell commands (afterA10.sh
) have no command line arguments. (That is, run./B10.sh
, ...,./N10.sh
).
Each shell command will batch out jobs on SLURM, and then block until the jobs end. After running each shell command, you must examine the error and output logs of the jobs. For the shell command ./X10.sh
, the logs will be in ezprepro/logs/X10/*
. Here X
is the letter of the shell command that was run. If the error logs or output logs indicate a problem, fix the problem before proceeding. Errors may indicate incorrect settings in ezprepro/bin/global.sh
, or oom-killer events or walltime-exceeded events.
-
For
./E10.sh
, the plink call is intended to produce an error message. In addition to this error message, a filemerged-merge.missnp
containing multiallelic SNPs should also be created. If this file is correct, then the error messages in the log files for./E10.sh
do not need to be further resolved. -
For
./I10.sh
, the batched job should run in an environment with a recent version of R and the R package igraph available. On some high performance computers, this may be done by using themodule load R
command (as is done in line 16 of./I10.sh
). In this case,install.packages('igraph')
should be run once in an R session in the environment loaded bymodule load R
.
Note that not all of the software used in this pipeline can operate with set RAM limits. So, if RAM usage exceeds the default limits on SLURM, you may need to modify the SLURM flags in the shell commands to specify limits. These modifications (to *10.sh
) may be system-specific: providing names for projects, queues, or numbers for nodes. In general, modifying the SLURM limits can be done by modifying the lines in each of the *10.sh
scripts before each occurrence of the following echo command:
echo "T1=\$(date \"+%s\");" >> $file
Note that there is some variance in the syntax SLURM implementations use to submit or query jobs, or to investigate failed jobs. Errors (likely appearing in the shell command's standard output) may indicate that the SLURM syntax must be modified.
The ezprepro shell commands are approximately idempotent. Should you rerun a failed shell command, ezprepro will attempt to resume and only batch out portions of the shell command that failed. Should you rerun a successful shell command, ezprepro will attempt to not batch out any new jobs. By cross-referencing output and error logs with the conditional statements governing the resume-logic, you can determine if it is safe to rerun and resume the shell command. If it is not safe to resume (or if you are unable to determine the safety), then delete the intermediate files mentioned in the conditional statements governing the resume-logic before rerunning.
After all of the ezprepro shell commands run successfully, the output will be structured as follows:
-
ezprepro/genetics/disco2/chr01.{bgen,bgi,sample}
...ezprepro/genetics/disco2/chr22.{bgen,bgi,sample}
,ezprepro/genetics/disco2/chrX.{bgen,bgi,sample}
,ezprepro/genetics/disco2/chrXY.{bgen,bgi,sample}
: Genotypes for the discovery cohort for autosomes, pseudoautosomal X chromosome, and non-pseudoautosomal X chromosome (respectively), for variants passing the filters mentioned in the Preprocessing section above. -
ezprepro/genetics/repro2/chr01.{bgen,bgi,sample}
...ezprepro/genetics/repro2/chr22.{bgen,bgi,sample}
,ezprepro/genetics/repro2/chrX.{bgen,bgi,sample}
,ezprepro/genetics/repro2/chrXY.{bgen,bgi,sample}
: Genotypes for the reproduction cohort. -
ezprepro/genetics/disco3/chr01m.{bgen,bgi,sample}
...ezprepro/genetics/disco3/chr22m.{bgen,bgi,sample}
,ezprepro/genetics/disco3/chrXm.{bgen,bgi,sample}
,ezprepro/genetics/disco3/chrXYm.{bgen,bgi,sample}
,ezprepro/genetics/repro3/chr01f.{bgen,bgi,sample}
...ezprepro/genetics/repro3/chr22f.{bgen,bgi,sample}
,ezprepro/genetics/repro3/chrXf.{bgen,bgi,sample}
,ezprepro/genetics/repro3/chrXYf.{bgen,bgi,sample}
: Genotypes for the discovery cohort (in the directorydisco3
) and reproduction cohort (in the directoryrepro3
) stratified by genetic sex. The male genetic sex is indicated by the postfixm
before the extension, and the female genetic sex is indicated by the postfixf
before the extension.
All genotype files are provided in indexed bgen v1.1 format, with indexing provided by bgenix.
-
Due to a limitation in qctool v1.4, the INFO filter might not be applied to the X chromosome genotype files. If univariate GWAS is subsequently performed (for example, through ezgwas), then the resulting summary statistics files may be further subset to apply the INFO filter.
-
To prevent confounding, studies on UK Biobank often subset to participants with recent white British ancestry (the largest ancestry in the consortium). As the sample size for imaged subjects in UK Biobank increases, more power is available for meta-analysis across ancestries. Consider modifying the second step described in the Preprocessing section to stratify by ancestry, as is done for sex in the output directories
disco3
andrepro3
. In addition, with further methods this pipeline may be improved to relax the exclusion for genetic sex/self reported sex mismatch.
After the shell script X10.sh
completes, the runtime for each batched-out job are recorded in seconds in the files ezprepro/runtimes/X10/*
(here, X
ranges over A
... N
). The total runtime in days for the pipeline can thus be determined using the following bash command:
cat ezprepro/runtimes/*/* | awk '{ sum += $1 } END { printf "%.2f\n", sum/60/60/24 }'
Note that if a step is rerun, the previous runtimes are overwritten (and so while the result is the number of days of compute used to produce the output, this is an underestimate of the amount of compute used to execute the pipeline).
ezprepro
Copyright 2023 Lloyd T. Elliott, Stephen M. Smith, Gwenaëlle Douaud
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.