Coder Social home page Coder Social logo

ezprepro2's Introduction

ezprepro

The ezprepro pipeline provides the preprocessing described in S.M. Smith et al. 2021. An expanded set of genome-wide association studies of brain imaging phenotypes in UK Biobank for brain imaging genetics on UK Biobank. This pipeline has been used for the following work:

The scripts for the ezprepro pipeline are provided here. To reproduce this pipeline, the scripts must be run on a high performance computer with a SLURM-like cluster job scheduler, and with a local (to the high performance computer) copy of the UKB genetic data (i.e., UKB Data-Field 22828 and associated Data-Fields). The input to this pipeline is a list of sample IDs (the study sample). The output is a preprocessed version of the UKB genetic data with the study sample split into a discovery cohort and a replication cohort. ezprepro is released under the BSD 2-Clause License. That license, and the copyright for ezprepro, are listed at the bottom of this README file.

Preprocessing

The preprocessing done by this pipeline consists of the following ordered list of operations:

  • Subset the UKB genetic data to a provided list of samples (these are the study samples for which the brain phenotype of interest is measured).

  • Further subset to samples with white British ancestry (using the UKB variable in.white.British), without aneuploidy and with genetic sex matching reported sex.

  • Further subset to a maximal subset of unrelated samples.

  • Split the remaining samples into a discovery cohort with approximately 2/3 of the samples, and a replication cohort with approximately 1/3 of the samples.

  • Subset the variants in the discovery cohort according to the filters: MAF > 0.1% and INFO > 0.3 and HWE -log10 p-value <7.

  • Subset the variants in the replication cohort to match the variants passing the filters for the discovery cohort.

This process results in files containing genotypes for the discovery cohort and the replication cohort on the autosomal chromosomes, the X chromosome, and the pseudoautosomal regions of the Y chromosomes. In addition to these genotypes, the pipeline also forms a version of these files stratified by genetic sex.

Execution

To run the ezprepro script, follow these steps:

  • Step 1: Modify ezprepro/code/global.sh and set the environmental variables to your system specific values.

  • Step 2: Populate ezprepro/bin/ with binaries (or symlinks to binaries) following the inventory in ezprepro/bin/index.txt.

  • Step 3: With your current working directory set to ezprepro/, run the shell command ./A10.sh <STUDY.TXT>. Here <STUDY.TXT> should be the path to a file containing the sample IDs of your study sample, with one sample ID per line.

  • Step 4: Run all the remaining shell commands that end in 10.sh in ezprepro/ in lexigraphical order, with your current working directory still set to ezprepro/. These remaining shell commands (after A10.sh) have no command line arguments. (That is, run ./B10.sh, ..., ./N10.sh).

Each shell command will batch out jobs on SLURM, and then block until the jobs end. After running each shell command, you must examine the error and output logs of the jobs. For the shell command ./X10.sh, the logs will be in ezprepro/logs/X10/*. Here X is the letter of the shell command that was run. If the error logs or output logs indicate a problem, fix the problem before proceeding. Errors may indicate incorrect settings in ezprepro/bin/global.sh, or oom-killer events or walltime-exceeded events.

  • For ./E10.sh, the plink call is intended to produce an error message. In addition to this error message, a file merged-merge.missnp containing multiallelic SNPs should also be created. If this file is correct, then the error messages in the log files for ./E10.sh do not need to be further resolved.

  • For ./I10.sh, the batched job should run in an environment with a recent version of R and the R package igraph available. On some high performance computers, this may be done by using the module load R command (as is done in line 16 of ./I10.sh). In this case, install.packages('igraph') should be run once in an R session in the environment loaded by module load R.

Note that not all of the software used in this pipeline can operate with set RAM limits. So, if RAM usage exceeds the default limits on SLURM, you may need to modify the SLURM flags in the shell commands to specify limits. These modifications (to *10.sh) may be system-specific: providing names for projects, queues, or numbers for nodes. In general, modifying the SLURM limits can be done by modifying the lines in each of the *10.sh scripts before each occurrence of the following echo command:

echo "T1=\$(date \"+%s\");" >> $file

Note that there is some variance in the syntax SLURM implementations use to submit or query jobs, or to investigate failed jobs. Errors (likely appearing in the shell command's standard output) may indicate that the SLURM syntax must be modified.

The ezprepro shell commands are approximately idempotent. Should you rerun a failed shell command, ezprepro will attempt to resume and only batch out portions of the shell command that failed. Should you rerun a successful shell command, ezprepro will attempt to not batch out any new jobs. By cross-referencing output and error logs with the conditional statements governing the resume-logic, you can determine if it is safe to rerun and resume the shell command. If it is not safe to resume (or if you are unable to determine the safety), then delete the intermediate files mentioned in the conditional statements governing the resume-logic before rerunning.

Output

After all of the ezprepro shell commands run successfully, the output will be structured as follows:

  • ezprepro/genetics/disco2/chr01.{bgen,bgi,sample} ... ezprepro/genetics/disco2/chr22.{bgen,bgi,sample}, ezprepro/genetics/disco2/chrX.{bgen,bgi,sample}, ezprepro/genetics/disco2/chrXY.{bgen,bgi,sample}: Genotypes for the discovery cohort for autosomes, pseudoautosomal X chromosome, and non-pseudoautosomal X chromosome (respectively), for variants passing the filters mentioned in the Preprocessing section above.

  • ezprepro/genetics/repro2/chr01.{bgen,bgi,sample} ... ezprepro/genetics/repro2/chr22.{bgen,bgi,sample}, ezprepro/genetics/repro2/chrX.{bgen,bgi,sample}, ezprepro/genetics/repro2/chrXY.{bgen,bgi,sample}: Genotypes for the reproduction cohort.

  • ezprepro/genetics/disco3/chr01m.{bgen,bgi,sample} ... ezprepro/genetics/disco3/chr22m.{bgen,bgi,sample}, ezprepro/genetics/disco3/chrXm.{bgen,bgi,sample}, ezprepro/genetics/disco3/chrXYm.{bgen,bgi,sample}, ezprepro/genetics/repro3/chr01f.{bgen,bgi,sample} ... ezprepro/genetics/repro3/chr22f.{bgen,bgi,sample}, ezprepro/genetics/repro3/chrXf.{bgen,bgi,sample}, ezprepro/genetics/repro3/chrXYf.{bgen,bgi,sample}: Genotypes for the discovery cohort (in the directory disco3) and reproduction cohort (in the directory repro3) stratified by genetic sex. The male genetic sex is indicated by the postfix m before the extension, and the female genetic sex is indicated by the postfix f before the extension.

All genotype files are provided in indexed bgen v1.1 format, with indexing provided by bgenix.

Errors

  • Due to a limitation in qctool v1.4, the INFO filter might not be applied to the X chromosome genotype files. If univariate GWAS is subsequently performed (for example, through ezgwas), then the resulting summary statistics files may be further subset to apply the INFO filter.

  • To prevent confounding, studies on UK Biobank often subset to participants with recent white British ancestry (the largest ancestry in the consortium). As the sample size for imaged subjects in UK Biobank increases, more power is available for meta-analysis across ancestries. Consider modifying the second step described in the Preprocessing section to stratify by ancestry, as is done for sex in the output directories disco3 and repro3. In addition, with further methods this pipeline may be improved to relax the exclusion for genetic sex/self reported sex mismatch.

Profiling

After the shell script X10.sh completes, the runtime for each batched-out job are recorded in seconds in the files ezprepro/runtimes/X10/* (here, X ranges over A ... N). The total runtime in days for the pipeline can thus be determined using the following bash command:

cat ezprepro/runtimes/*/* | awk '{ sum += $1 } END { printf "%.2f\n", sum/60/60/24 }'

Note that if a step is rerun, the previous runtimes are overwritten (and so while the result is the number of days of compute used to produce the output, this is an underestimate of the amount of compute used to execute the pipeline).

License (BSD 2-Clause)

ezprepro

Copyright 2023 Lloyd T. Elliott, Stephen M. Smith, Gwenaëlle Douaud

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

ezprepro2's People

Contributors

lell avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.