Coder Social home page Coder Social logo

nci-cgr / gwasqcpipeline Goto Github PK

View Code? Open in Web Editor NEW
0.0 9.0 2.0 89.3 MB

The CGR GWAS QC processing workflow.

Home Page: https://nci-cgr.github.io/GwasQcPipeline/

License: MIT License

Python 82.33% R 0.12% Shell 0.94% Jinja 4.63% Dockerfile 0.10% Perl 9.45% HTML 2.44%
snakemake gwas cancer-genomics illumina-bead-chip plink

gwasqcpipeline's Introduction

read-the-docs docs-by-sphinx-doc

default-branch-status built-with-poetry

snakemake python

testing-with-pytest pre-commit linting-with-flake8 typing-with-mypy

code-style-black code-style-snakefmt

GwasQcPipeline

QC pipeline for Illumina SNP Array data generated at CGR.

Usage

If you use this workflow in a paper, please cite the URL of this repository (https://github.com/NCI-CGR/GwasQcPipeline).

The documentation of this pipeline is at https://nci-cgr.github.io/GwasQcPipeline/

Deploying with Docker

Build docker image from within GwasQcPipeline directory

docker build -t gwas_qc_pipe .

Test docker image if you have test data

docker run -v $(pwd):/home/data -i -t gwas_qc_pipe snakemake -k --use-conda -npr

Installing GwasQcPipeline on ccad2

  • Install miniconda and then create GwasQcPipeline production environment
$ mkdir /scratch/myfolder/GwasQcPipeline_v1.4.0
$ cd /scratch/myfolder/GwasQcPipeline_v1.4.0
$ wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh
$ bash Miniconda3-py39_4.12.0-Linux-x86_64.sh -p /scratch/myfolder/GwasQcPipeline_v1.4.0/conda -b
$ source conda/bin/activate base
(base) $ conda update -n base -c defaults conda
(base) $ conda install -n base -c conda-forge mamba
(base) $ conda create -n GwasQcPipeline -y python=3.8 pip
(base) $ conda deactivate
  • Install GwasQcPipeline source code
$ source /scratch/myfolder/GwasQcPipeline_v1.4.0/conda/bin/activate GwasQcPipeline
(GwasQcPipeline) $ pip install https://github.com/NCI-CGR/GwasQcPipeline/releases/download/v1.4.0/cgr_gwas_qc-1.4.0-py3-none-any.whl
(GwasQcPipeline) $ cgr --help
(GwasQcPipeline) $ cgr version
v1.4.0

Usage

cgr submit --ccad2 --local-mem-mb 8000

gwasqcpipeline's People

Contributors

1teaspoon avatar carynwillis avatar dependabot[bot] avatar dmh-bah avatar dorks81 avatar erickarlins avatar jaamarks avatar jfear avatar jsanjak avatar kliao12 avatar n8sean avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gwasqcpipeline's Issues

Fix flaky test

The test test_wrangle_sample_sheet runs fine on its own, but randomly fails when running the entire suite. I think it is related to which sample sheet is being used. I feel like on individual tests it is using the short manifest and on all tests it is using the full manifest. Need to investigate.

Add grouping of filter/stat rules

I have a number of linked plink steps that I want to group into single jobs. I need to ensure that the group directive has prefix wildcards.

Add real test data to touch more edge cases

My current test data set (206 samples) does not touch all QC corner cases. I am probably missing bugs associated with these corner cases.

Needed Corner Cases

  • Unexpected replicates (i.e., two samples from different subjects are highly correlated IBS/IBD)

Add Unexpected Replicate Filter to subject level workflow

I am not correctly handling unexpected replicates in the subject/population-level analyses.

After filtering by Subject_Representative I should remove any pairs of subjects representatives that are unexpected replicates. Right now I am thinking of the following approach.

  1. Add a list of unexpected replicates to the qc_summary.csv.
  2. Filter by Subject_Representative
  3. Filter subjects if their unexpected replicate is also a subject representative.

Pre-commit is misbehaving on cgems

On cgems @1teaspoon is having a confusing problem setting up pre-commit. When she runs pre-commit install && pre-commit run or pre-commit install-hooks pre-commit is freezing on downloading and installing the flake8 environment. However, when I do the same thing everything runs ok.

Things we tried

@jwang has tried:

  • two versions of conda: module load miniconda or downloading and installing miniconda into /scratch/wangj41/miniconda3`.
  • creating the GwasQcPipeline a few times.
  • deleting ~/.cache/pre-commit

Things to try

Change location of pre-commit cache

Luckily pre-commit uses XDG Base Directory Spec, so we can change where the cache is installed by setting the correct environmental variable.

For the conda environment and clone of the repository, we are going to use ones that I did and tested with pre-commit. So the only thing @1teaspoon is doing differently is using a different cache location.

We are doing everything on /scratch to make sure there are not file system latency issues.

conda activate /scratch/fearjm/GwasQcPipelineEnv  # Use my environment which is working
cd /scratch/fearjm/GwasQcPipeline  # Use my clone of the repository which is working

export XDG_CACHE_HOME=/scratch/wangj41/cache  # Use your own cache, which is what we are testing
pre-commit install-hooks

Change handling of related subjects for PCA and HWE

The legacy workflow removes related subjects prior to HWE/PCA. I forgot to add this in the current dev workflow and I do not have test data for this particular case.

HWE

Removal is appropriate but kind of annoying. Computing the optimal removal sets is (NP?) hard; computing a kinda naive removal set (prune sets with IBD PI_HAT > X) is easy but far too conservative for some structured pedigree datasets (if it's a set of trios, you're going to remove a huge chunk of the dataset). However, an overconservative pruning for HWE, doesn't matter as much, as the filter is approximate anyway

PCA

Removal is needed for the actual component estimation but you do still want to estimate the PCs for all subject. smartpca supports PCA projection, which will compute the snpweights for the unrelateds and then using the resulting linear combination on all subjects. To do this you set different population names (i.e. "rel" and "unrel") for the subjects in the .ind file, and then setting "unrel" in this case to your parameter file's parameter "poplistname" 1.

Level of relatedness

Most methods don't support relatives. Some would argue that "relateds" for PCA/HWE go down to IBD PI_HAT == 0.1 or lower. It's standard practice to remove subjects with any degree of relatedness from a population-designed dataset.

Ref #9

Add jinja2 template directory.

I want to switch reporting over to use jinja2 templating. I am already using jinja in the workflow/scripts/sample_qc_report_summary_stats.py but I embeded the template directly in the script. Instead, I want to create a folder src/cgr_gwas_qc/templates and place various templates there (similar to a Flask app).

Refactor VCF interactions

The code surrounding the VCF interaction scripts (bpm2abf, bim_filter_vcf, and update_snps_to_1kg_rsID) have a lot of duplication. I just finished updating all three of these scripts to work with GRCh38, so I think it makes sense to go ahead and spend some time refactoring them while they are fresh in my head.

Add rules to copy files to the delivery folder.

The delivery folder has the following files copied into it:

  • HWP.zip < population_level/{population}/controls_maf{maf}_snps_autosome_cleaned.hwe >
  • samples.{bed,bim,fam} < sample_level/samples.{bed,bim,fam} >
  • subjects.{bed,bim,fam} < subject_level/samples.{bed,bim,fam} >
  • list of samples used as subjects < subject_level/subject_representative.txt >
  • original sample sheet < cfg.sample_sheet_file >

There are 3 additional files that I still need to generate.

  • README
  • QC Report .docx
  • QC Report .xlsx

Add `pysam.VariantFile` interface

To support multiple references we need to add logic to handle annotation differences, specifically contig names ("1" vs "chr1"). The biggest issue is when there are discrepancies between the BIM/BPM and the VCF file.

I think the cleanest solution is to create the VCF API that handles chromosome differences instead of adding lots of logic to various scripts. To do this I can build a thin wrapper around the pysam.VariantFile class. This wrapper can automatically try different versions of contig names.

Rel #33 #34

Create jobscript test data

Snakemake creates a shell script for each job. In the header of the script are all of the job properties. When submitting to the cluster we can use these properties to modify the submission. I want to create a test version of one of these scripts for use in testing.

Detect the number of samples and set grouping options

To improve performance, we need to group fast executing rules. This is particularly true of steps that process individual samples (per_sample_*). To reduce the number of submitted jobs I want to group individual sample steps into batches that take ~1hr. We can easily do this using snakemake's grouping functionality, especially the

snakemake --groups somerule=group0 --group-components group0=5

Which puts 5 group0 jobs together per submission.

We need to update the basic cgr-submit (#50) to:

  • detected the number of samples by reading the sample sheet
  • determine the grouping size
  • add snakemake options to the submission script for all per_sample_* rules.

I need to play with snakemake's grouping functionality and see the best set of options to use.

Fix workflow scripts #! and permissions

I have a number of scripts in workflow/scripts/ that are designed to work with snakemake or run on on their own. I want to clean them up and make sure they have #!/usr/bin/env python and are executable. Snakemake doesn't care but when running on their own it is nice to have this all setup correctly.

Directly use snakemake CLI

cgr snakemake implements a subset of snakemake command-line options. I am wondering if I can use snakemake's argument parser directly. Then cgr snakemake would be a very thin wrapper around the snakemake CLI allowing me to inject the main Snakefile and cluster profiles.

Parallelize GTC checks in pre-flight

The pre-flight command takes a really long time on large data sets (~25 min for 5k samples). The GTC checks take especially long. This is a great place to add some parallelization.

Refactor LD filter to match other plink filters

I have created a rule per plink filter. All of the filters, except LD, are designed to be temp (deleted by snakemake). I then have a cleaned_filter rule which just copies the final filtered data to a permanent dataset. The idea is you could do the following:

  • samples.bed: starting point
  • samples_maf0.2.bed: MAF filter (temp)
  • samples_maf0.2_snps.bed: MAF + --snps-only (temp)
  • samples_maf0.2_snps_autosome.bed: MAF + --snps-only + --autosome (temp)
  • samples_maf0.2_snps_autosome_cleaned.bed MAF + --snps-only + --autosome (final)

It would be nice for LD to fit into this framework.

Population Summary

I think it makes sense to generate subject_level/{poulation}/summary.csv and possibly an aggregated subject_level/summary.csv.

I don't know what values are needed downstream so this needs some thoughts.

Create a separate sample concordance table.

In the workflow we look at sample-level and subject-level concordance. While technically, subjects are a pruned copy of samples, I think it makes more sense to build both a sample and subject concordance table. Currently, known_concordant_samples.py converts the *.genome file to a concordance table by adding concordance = IBS2 / (IBS0 + IBS1 + IBS2). Instead of doing this calculation inside the sample specific known_concordant_samples.py I want to make a separate rule to *.genome --> *.concordance.csv.

Action Items

  • Create rule to create *.concordance.csv.
  • Update known_concordant_samples.py to use this table instead.

Related

Grouping jobs affecting modification time

I am grouping short-running filter jobs using "{prefix}_{name}". The problem is that modification times seem a little messed up on CGEMs. This may be related to snakemake/snakemake#789.

I am running on Biowulf now and will see if I have the same problem. If so, then it is likely a snakemake bug.

Two possible solutions are to only group temp files or switch to named PIPES.

GRAF ancestry not working with GRCh38

The GRCh38 testing VCF I was using to update variant IDs does not actually rsIDs. I changed the script logic to not update the BIM variant ID unless the VCF has an rsID.

Add py3.9 actions

This is a placeholder issue to update GitHub Actions to add python 3.9. Commit db4fc89 removes py3.9 because packages are not all available on conda yet.

Remove related prior to PCA and HWE

The legacy workflow prunes sets of subjects using IBS/IBD and Call Rate [1]. These related subjects are removed prior to running PCA [2] and HWE [3]. There is some discussion about doing this in different ways (#10) but here I just want to replicate the legacy process. Unfortunately, testing is going to be difficult because the current test data does not contain any related sample/subject (#2).

Action Items

  • Create a list of related subjects using IBS/IBD and Call Rate (i.e., refactor [1])
  • Create a rule to filter these subjects and plug them into the PCA (MAF+LD) and HWE (MAF+Autosome+SNPs) filter process.

Links

  • [1]
    rule make_related_list:
    input:
    track = 'subject_level/SampleUsedforSubject.csv',
    ibd = 'ibd/samples.genome',
    fam = 'subject_level/subjects.fam',
    imiss = 'subject_level/subjects_qc.imiss'
    output:
    'remove_related/subjects_to_remove.txt'
  • [2]
    rule extract_ld_prune_pca:
    input:
    bed = 'split_by_pop/{pop}_subjects.bed',
    bim = 'split_by_pop/{pop}_subjects.bim',
    fam = 'split_by_pop/{pop}_subjects.fam',
    prune = 'pca/{pop}_ldPruneList.prune.in',
    related = 'remove_related/subjects_to_remove.txt'
  • [3]
    rule subset_controls:
    input:
    bed = 'split_by_pop/{pop}_subjects.bed',
    bim = 'split_by_pop/{pop}_subjects.bim',
    fam = 'split_by_pop/{pop}_subjects.fam',
    keep = 'HWP/{pop}_controls.txt',
    related = 'remove_related/subjects_to_remove.txt'

Related

  • #6 Add new columns to qc_summary to simplify workflow (i.e., stop making some intermediate outputs?)
  • #9 Generalize the plink pruning
  • #10 Change what we do for PCA and HWE

Update workflow to use unrelated subjects.

In #24 I added the steps required to prune related subjects. I still need to make the workflow to actually use these. In Phase 2 development, I want to change how this all works #10 but for now, I just want to replicate what is done by the legacy workflow.

Change branching model

I have been using a variant of the GitFlow branching model. However, this model is more complicated than our needs; this is a small project with few developers and a limited number of releases. To make things more accessible to developers not as familiar with git, I plan on switching to GitHubFlow which uses a main-feature branching model.

Tasks

  • Merge dev -> default
  • Delete dev

Add more unit tests to bpm2abf conversion script

Regression testing is really difficult b/c there are some bugs in the legacy version of the code. To compensate for this it would be good to add more unit test coverage of the script and its functions. In particular, it would be good to cover all edge cases where the abf would be "NA".

Refactor use_lims_name

I have a little helper function in deliver.smk that will spit out names based on the CGRs typical naming scheme. I really need access to this in the main snakefile. I could move it to the but I really want to keep it as clean as possible. I think it belongs in paths.py.

The function currently uses the global config object. I need to refactor to just pass in the sample sheet file name.

Improve Pre-Flight Messaging

I demonstrated the pre-flight system during our large group meeting. There was a discussion about adding better messaging.

  1. Null rows are extremely rare, so instead of silently ignoring them, it would be better to raise a WARNING during the Pre-Flight checks. This will alert the user that they should check with project managers that everything is ok with the sample table. The workflow itself is OK ignoring null rows.

  2. There is general room for improvement of reporting issues. We need to capture the various exceptions and make clear messaging about what is wrong. I think it makes sense to split this out into functions, but not sure where to put the code. It could go in the pre-flight script, or it could go in the validation library.

Add replicate and unexpected replicate ID column to qc_summary table.

I use the sample_level/qc_summary.csv table extensively to help downstream. Life will be easier if I add two columns:

  • expected_replicate_IDs
  • unexpected_replicate_IDs

These columns will contain a | separated list of IDs that are un/expected replicates of the given sample. I am thinking something like this:

Sample_ID expected_replicate_IDs unexpected_replicate_IDs
S001
S002 S003|S004
S003 S002|S004
S004 S003|S004
S005 S006
S006 S005

Update pre-commit hooks

It has been a while since I updated the pre-commit hooks. I need to go through and bump/test the newest release of each tool. I will also need to bump pyproject.toml for tools that I also have installed (black, isort, flake8,...).

Auto drop samples with missing GTCs

The legacy workflow automatically skipped samples that had missing GTCs. I would like to add a section to the config, to give a list of samples to ignore.

Then I can add a simple method to the Sample sheet parser to drop these samples from the sample sheet.

Add GRCh38 1KG VCF to RealData cache.

From what I can tell, there are two versions of GRCh38 1KG SNV calls that I could use.

  1. The lift-over from B37.
  1. The direct remapping and calls on GRCh38.

I think (2) is a better option because:

  • We don't need to reason about SNVs that did not lift-over
  • It provides a merged (all chromosome) version of the VCF
  • It provides a VCF with only bi-allelic SNVs

Rel #15

Refactor PLINK pruning rules

Plink pruning should be a generalizable problem (to_keep or to_remove). I want to refactor all pruning steps to use a generalized rule.

Auto download 1KG VCF

The one reference data set used by this workflow is the 1KG VCF:

  • ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
  • ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz.tbi

It may make sense to add a rule to download/symlink these data. Right now their paths are set using a config option.

Refactor verifyIDintensity aggregation to remove unnecessary dependency

During contamination score aggregation, I am updating %Mix to NA if the sample is not present after call rate 2 filtering. I think this step makes more sense during QC table building than here.

rule agg_verifyIDintensity_contamination:
"""Aggregate sample contamination scores.
Aggregates sample level contamination scores into a single file (each
row is a sample). The script sets ``%Mix`` to ``NA`` if the intensity
is below the threshold and the file is not in the ``imiss`` file.
"""
input:
contamination=cfg.expand(rules.per_sample_verifyIDintensity_contamination.output),
median_idat_intensity=rules.agg_median_idat_intensity.output[0],
imiss="sample_level/call_rate_2/samples.imiss",
params:
intensity_threshold=cfg.config.software_params.intensity_threshold,
output:
"sample_level/contamination/verifyIDintensity_contamination.csv",
script:
"../scripts/agg_contamination_test.py"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.