nci-cgr / gwasqcpipeline Goto Github PK

View Code? Open in Web Editor NEW

0.0 9.0 2.0 89.3 MB

The CGR GWAS QC processing workflow.

Home Page: https://nci-cgr.github.io/GwasQcPipeline/

License: MIT License

Python 82.33% R 0.12% Shell 0.94% Jinja 4.63% Dockerfile 0.10% Perl 9.45% HTML 2.44%

snakemake gwas cancer-genomics illumina-bead-chip plink

gwasqcpipeline's Introduction

GwasQcPipeline

QC pipeline for Illumina SNP Array data generated at CGR.

Usage

If you use this workflow in a paper, please cite the URL of this repository (https://github.com/NCI-CGR/GwasQcPipeline).

The documentation of this pipeline is at https://nci-cgr.github.io/GwasQcPipeline/

Deploying with Docker

Build docker image from within GwasQcPipeline directory

docker build -t gwas_qc_pipe .

Test docker image if you have test data

docker run -v $(pwd):/home/data -i -t gwas_qc_pipe snakemake -k --use-conda -npr

Installing GwasQcPipeline on ccad2

Install miniconda and then create GwasQcPipeline production environment

$ mkdir /scratch/myfolder/GwasQcPipeline_v1.4.0
$ cd /scratch/myfolder/GwasQcPipeline_v1.4.0
$ wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh
$ bash Miniconda3-py39_4.12.0-Linux-x86_64.sh -p /scratch/myfolder/GwasQcPipeline_v1.4.0/conda -b
$ source conda/bin/activate base
(base) $ conda update -n base -c defaults conda
(base) $ conda install -n base -c conda-forge mamba
(base) $ conda create -n GwasQcPipeline -y python=3.8 pip
(base) $ conda deactivate

Install GwasQcPipeline source code

$ source /scratch/myfolder/GwasQcPipeline_v1.4.0/conda/bin/activate GwasQcPipeline
(GwasQcPipeline) $ pip install https://github.com/NCI-CGR/GwasQcPipeline/releases/download/v1.4.0/cgr_gwas_qc-1.4.0-py3-none-any.whl
(GwasQcPipeline) $ cgr --help
(GwasQcPipeline) $ cgr version
v1.4.0

Usage

cgr submit --ccad2 --local-mem-mb 8000

gwasqcpipeline's People

Contributors

Watchers

Forkers

kliao12 confluence-breast-cancer-consortia

gwasqcpipeline's Issues

Profile per sample rules

I need to memory profile per sample rules to figure out what is the optimal mem_gb to set.

The test test_wrangle_sample_sheet runs fine on its own, but randomly fails when running the entire suite. I think it is related to which sample sheet is being used. I feel like on individual tests it is using the short manifest and on all tests it is using the full manifest. Need to investigate.

Add grouping of filter/stat rules

I have a number of linked plink steps that I want to group into single jobs. I need to ensure that the group directive has prefix wildcards.

Add real test data to touch more edge cases

My current test data set (206 samples) does not touch all QC corner cases. I am probably missing bugs associated with these corner cases.

Needed Corner Cases

Unexpected replicates (i.e., two samples from different subjects are highly correlated IBS/IBD)

Add Unexpected Replicate Filter to subject level workflow

I am not correctly handling unexpected replicates in the subject/population-level analyses.

After filtering by Subject_Representative I should remove any pairs of subjects representatives that are unexpected replicates. Right now I am thinking of the following approach.

Add a list of unexpected replicates to the qc_summary.csv.
Filter by Subject_Representative
Filter subjects if their unexpected replicate is also a subject representative.

Pre-commit is misbehaving on cgems

On cgems @1teaspoon is having a confusing problem setting up pre-commit. When she runs pre-commit install && pre-commit run or pre-commit install-hooks pre-commit is freezing on downloading and installing the flake8 environment. However, when I do the same thing everything runs ok.

Things we tried

@jwang has tried:

two versions of conda: module load miniconda or downloading and installing miniconda into /scratch/wangj41/miniconda3`.
creating the GwasQcPipeline a few times.
deleting ~/.cache/pre-commit

Things to try

Change location of `pre-commit` cache

Luckily pre-commit uses XDG Base Directory Spec, so we can change where the cache is installed by setting the correct environmental variable.

For the conda environment and clone of the repository, we are going to use ones that I did and tested with pre-commit. So the only thing @1teaspoon is doing differently is using a different cache location.

We are doing everything on /scratch to make sure there are not file system latency issues.

conda activate /scratch/fearjm/GwasQcPipelineEnv  # Use my environment which is working
cd /scratch/fearjm/GwasQcPipeline  # Use my clone of the repository which is working

export XDG_CACHE_HOME=/scratch/wangj41/cache  # Use your own cache, which is what we are testing
pre-commit install-hooks

Create a testing suite to better understand job grouping.

Create a series of tests to make sure I understand how snakemake job gouping is working and to future proof incase there are upstream changes.

Change handling of related subjects for PCA and HWE

The legacy workflow removes related subjects prior to HWE/PCA. I forgot to add this in the current dev workflow and I do not have test data for this particular case.

HWE

Removal is appropriate but kind of annoying. Computing the optimal removal sets is (NP?) hard; computing a kinda naive removal set (prune sets with IBD PI_HAT > X) is easy but far too conservative for some structured pedigree datasets (if it's a set of trios, you're going to remove a huge chunk of the dataset). However, an overconservative pruning for HWE, doesn't matter as much, as the filter is approximate anyway

PCA

Removal is needed for the actual component estimation but you do still want to estimate the PCs for all subject. smartpca supports PCA projection, which will compute the snpweights for the unrelateds and then using the resulting linear combination on all subjects. To do this you set different population names (i.e. "rel" and "unrel") for the subjects in the .ind file, and then setting "unrel" in this case to your parameter file's parameter "poplistname" 1.

Level of relatedness

Most methods don't support relatives. Some would argue that "relateds" for PCA/HWE go down to IBD PI_HAT == 0.1 or lower. It's standard practice to remove subjects with any degree of relatedness from a population-designed dataset.

Ref #9

Get tests passing when using GitHub actions

Add jinja2 template directory.

I want to switch reporting over to use jinja2 templating. I am already using jinja in the workflow/scripts/sample_qc_report_summary_stats.py but I embeded the template directly in the script. Instead, I want to create a folder src/cgr_gwas_qc/templates and place various templates there (similar to a Flask app).

Refactor VCF interactions

The code surrounding the VCF interaction scripts (bpm2abf, bim_filter_vcf, and update_snps_to_1kg_rsID) have a lot of duplication. I just finished updating all three of these scripts to work with GRCh38, so I think it makes sense to go ahead and spend some time refactoring them while they are fresh in my head.

Add rules to copy files to the delivery folder.

The delivery folder has the following files copied into it:

HWP.zip < population_level/{population}/controls_maf{maf}_snps_autosome_cleaned.hwe >
samples.{bed,bim,fam} < sample_level/samples.{bed,bim,fam} >
subjects.{bed,bim,fam} < subject_level/samples.{bed,bim,fam} >
list of samples used as subjects < subject_level/subject_representative.txt >
original sample sheet < cfg.sample_sheet_file >

There are 3 additional files that I still need to generate.

README
QC Report .docx
QC Report .xlsx

Add `pysam.VariantFile` interface

To support multiple references we need to add logic to handle annotation differences, specifically contig names ("1" vs "chr1"). The biggest issue is when there are discrepancies between the BIM/BPM and the VCF file.

I think the cleanest solution is to create the VCF API that handles chromosome differences instead of adding lots of logic to various scripts. To do this I can build a thin wrapper around the pysam.VariantFile class. This wrapper can automatically try different versions of contig names.

Rel #33 #34

Create jobscript test data

Snakemake creates a shell script for each job. In the header of the script are all of the job properties. When submitting to the cluster we can use these properties to modify the submission. I want to create a test version of one of these scripts for use in testing.

Add Idat checkts to pre-flight

Add IDAT checking to pre-flight.

Magic Number
Ending bit?

Detect the number of samples and set grouping options

To improve performance, we need to group fast executing rules. This is particularly true of steps that process individual samples (per_sample_*). To reduce the number of submitted jobs I want to group individual sample steps into batches that take ~1hr. We can easily do this using snakemake's grouping functionality, especially the

snakemake --groups somerule=group0 --group-components group0=5

Which puts 5 group0 jobs together per submission.

We need to update the basic cgr-submit (#50) to:

detected the number of samples by reading the sample sheet
determine the grouping size
add snakemake options to the submission script for all per_sample_* rules.

I need to play with snakemake's grouping functionality and see the best set of options to use.

Fix workflow scripts #! and permissions

I have a number of scripts in workflow/scripts/ that are designed to work with snakemake or run on on their own. I want to clean them up and make sure they have #!/usr/bin/env python and are executable. Snakemake doesn't care but when running on their own it is nice to have this all setup correctly.

Directly use snakemake CLI

cgr snakemake implements a subset of snakemake command-line options. I am wondering if I can use snakemake's argument parser directly. Then cgr snakemake would be a very thin wrapper around the snakemake CLI allowing me to inject the main Snakefile and cluster profiles.

Add biowulf cluster profile

As in #52, I need to add a cluster profile for biowulf. This will require tweaking https://github.com/Snakemake-Profiles/slurm

Parallelize GTC checks in pre-flight

The pre-flight command takes a really long time on large data sets (~25 min for 5k samples). The GTC checks take especially long. This is a great place to add some parallelization.

Refactor BIM filter to use different versions of the 1KG reference

The bim_filter_vcf script iterates over the BIM file and updates variant names and strands using 1KG.

Add tests using both GRCh37 and GRCh38 1KG references
Refactor to get scrip to run cleanly on both references

Refactor LD filter to match other plink filters

I have created a rule per plink filter. All of the filters, except LD, are designed to be temp (deleted by snakemake). I then have a cleaned_filter rule which just copies the final filtered data to a permanent dataset. The idea is you could do the following:

samples.bed: starting point
samples_maf0.2.bed: MAF filter (temp)
samples_maf0.2_snps.bed: MAF + --snps-only (temp)
samples_maf0.2_snps_autosome.bed: MAF + --snps-only + --autosome (temp)
samples_maf0.2_snps_autosome_cleaned.bed MAF + --snps-only + --autosome (final)

It would be nice for LD to fit into this framework.

Population Summary

I think it makes sense to generate subject_level/{poulation}/summary.csv and possibly an aggregated subject_level/summary.csv.

I don't know what values are needed downstream so this needs some thoughts.

Create a separate sample concordance table.

In the workflow we look at sample-level and subject-level concordance. While technically, subjects are a pruned copy of samples, I think it makes more sense to build both a sample and subject concordance table. Currently, known_concordant_samples.py converts the *.genome file to a concordance table by adding concordance = IBS2 / (IBS0 + IBS1 + IBS2). Instead of doing this calculation inside the sample specific known_concordant_samples.py I want to make a separate rule to *.genome --> *.concordance.csv.

Action Items

Create rule to create *.concordance.csv.
Update known_concordant_samples.py to use this table instead.

Grouping jobs affecting modification time

I am grouping short-running filter jobs using "{prefix}_{name}". The problem is that modification times seem a little messed up on CGEMs. This may be related to snakemake/snakemake#789.

I am running on Biowulf now and will see if I have the same problem. If so, then it is likely a snakemake bug.

Two possible solutions are to only group temp files or switch to named PIPES.

GRAF ancestry not working with GRCh38

The GRCh38 testing VCF I was using to update variant IDs does not actually rsIDs. I changed the script logic to not update the BIM variant ID unless the VCF has an rsID.

Add py3.9 actions

This is a placeholder issue to update GitHub Actions to add python 3.9. Commit db4fc89 removes py3.9 because packages are not all available on conda yet.

Remove related prior to PCA and HWE

The legacy workflow prunes sets of subjects using IBS/IBD and Call Rate [1]. These related subjects are removed prior to running PCA [2] and HWE [3]. There is some discussion about doing this in different ways (#10) but here I just want to replicate the legacy process. Unfortunately, testing is going to be difficult because the current test data does not contain any related sample/subject (#2).

Action Items

Create a list of related subjects using IBS/IBD and Call Rate (i.e., refactor [1])
Create a rule to filter these subjects and plug them into the PCA (MAF+LD) and HWE (MAF+Autosome+SNPs) filter process.

Links

[1]

GwasQcPipeline/modules/Snakefile_remove_related

Lines 3 to 10 in 1d15a35

    
           rule make_related_list: 
        
               input: 
        
                   track = 'subject_level/SampleUsedforSubject.csv', 
        
                   ibd = 'ibd/samples.genome', 
        
                   fam = 'subject_level/subjects.fam', 
        
                   imiss = 'subject_level/subjects_qc.imiss' 
        
               output: 
        
                   'remove_related/subjects_to_remove.txt'

[2]

GwasQcPipeline/modules/Snakefile_pca

Lines 28 to 34 in 1d15a35

    
           rule extract_ld_prune_pca: 
        
               input: 
        
                   bed = 'split_by_pop/{pop}_subjects.bed', 
        
                   bim = 'split_by_pop/{pop}_subjects.bim', 
        
                   fam = 'split_by_pop/{pop}_subjects.fam', 
        
                   prune = 'pca/{pop}_ldPruneList.prune.in', 
        
                   related = 'remove_related/subjects_to_remove.txt'

[3]

GwasQcPipeline/modules/Snakefile_HWP

Lines 33 to 39 in 1d15a35

    
           rule subset_controls: 
        
               input: 
        
                   bed = 'split_by_pop/{pop}_subjects.bed', 
        
                   bim = 'split_by_pop/{pop}_subjects.bim', 
        
                   fam = 'split_by_pop/{pop}_subjects.fam', 
        
                   keep = 'HWP/{pop}_controls.txt', 
        
                   related = 'remove_related/subjects_to_remove.txt'

#6 Add new columns to qc_summary to simplify workflow (i.e., stop making some intermediate outputs?)
#9 Generalize the plink pruning
#10 Change what we do for PCA and HWE

Update workflow to use unrelated subjects.

In #24 I added the steps required to prune related subjects. I still need to make the workflow to actually use these. In Phase 2 development, I want to change how this all works #10 but for now, I just want to replicate what is done by the legacy workflow.

Change branching model

I have been using a variant of the GitFlow branching model. However, this model is more complicated than our needs; this is a small project with few developers and a limited number of releases. To make things more accessible to developers not as familiar with git, I plan on switching to GitHubFlow which uses a main-feature branching model.

Tasks

Merge dev -> default
Delete dev

Add more unit tests to bpm2abf conversion script

Regression testing is really difficult b/c there are some bugs in the legacy version of the code. To compensate for this it would be good to add more unit test coverage of the script and its functions. In particular, it would be good to cover all edge cases where the abf would be "NA".

Two Software Params should be Workflow Params

The software_params.minimum_pop_subjects and software_params.control_hwp_threshold should be workflow params b/c they are only used for flow control.

Refactor use_lims_name

I have a little helper function in deliver.smk that will spit out names based on the CGRs typical naming scheme. I really need access to this in the main snakefile. I could move it to the but I really want to keep it as clean as possible. I think it belongs in paths.py.

The function currently uses the global config object. I need to refactor to just pass in the sample sheet file name.

Run confluence using new VCF interface.

Submit CLI Interface

Create a CLI interface to submit to CGEMs.

Create a test version of `qstat` from cgems

When checking on job status we use the scheduler outputs. I need an example of CGEMs qstat -pr command for parsing.

Improve Pre-Flight Messaging

I demonstrated the pre-flight system during our large group meeting. There was a discussion about adding better messaging.

Null rows are extremely rare, so instead of silently ignoring them, it would be better to raise a WARNING during the Pre-Flight checks. This will alert the user that they should check with project managers that everything is ok with the sample table. The workflow itself is OK ignoring null rows.
There is general room for improvement of reporting issues. We need to capture the various exceptions and make clear messaging about what is wrong. I think it makes sense to split this out into functions, but not sure where to put the code. It could go in the pre-flight script, or it could go in the validation library.

Add GRCh38 bpm to RealData cache

I need a BPM file that was built using GRCh38.

Refactor BPM to ABF to use different versions of the 1KG reference

The bpm2abf script iterates over a BPM file and pulls out the ABF allele freqs from 1KG.

Add tests using both GRCh37 and GRCh38 1KG references
Refactor to get script to run cleanly on both references

Add replicate and unexpected replicate ID column to qc_summary table.

I use the sample_level/qc_summary.csv table extensively to help downstream. Life will be easier if I add two columns:

expected_replicate_IDs
unexpected_replicate_IDs

These columns will contain a | separated list of IDs that are un/expected replicates of the given sample. I am thinking something like this:

Sample_ID	expected_replicate_IDs	unexpected_replicate_IDs
S001
S002	S003\|S004
S003	S002\|S004
S004	S003\|S004
S005		S006
S006		S005

Add second MAF filter option to config

The MAF threshold used for HWE is hardcoded to 0.05. We want to add a new config option maf_for_hwe to store this value.

Update pre-commit hooks

It has been a while since I updated the pre-commit hooks. I need to go through and bump/test the newest release of each tool. I will also need to bump pyproject.toml for tools that I also have installed (black, isort, flake8,...).

Add conformation message to cgr-submit

It would be nice to have some sort of confirmation message when you run cgr submit.

Auto drop samples with missing GTCs

The legacy workflow automatically skipped samples that had missing GTCs. I would like to add a section to the config, to give a list of samples to ignore.

Then I can add a simple method to the Sample sheet parser to drop these samples from the sample sheet.

Add GRCh38 1KG VCF to RealData cache.

From what I can tell, there are two versions of GRCh38 1KG SNV calls that I could use.

The lift-over from B37.

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/

The direct remapping and calls on GRCh38.

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20181203_biallelic_SNV/

I think (2) is a better option because:

We don't need to reason about SNVs that did not lift-over
It provides a merged (all chromosome) version of the VCF
It provides a VCF with only bi-allelic SNVs

Rel #15

Refactor PLINK pruning rules

Plink pruning should be a generalizable problem (to_keep or to_remove). I want to refactor all pruning steps to use a generalized rule.

Fix test that fails on GH Actions

In 0565925, I disabled one test because it fails on GH Actions but not locally.

Add cgems cluster profile

I need to add a cgems cluster profile. This may require tweaking https://github.com/Snakemake-Profiles/sge to make sure group resources are handled correctly. It would be good to generalize as much as possible so we can add other profiles easily.

This will require a lot of testing and tweaking.

Rel #47 #46

Auto download 1KG VCF

The one reference data set used by this workflow is the 1KG VCF:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz.tbi

It may make sense to add a rule to download/symlink these data. Right now their paths are set using a config option.

Refactor verifyIDintensity aggregation to remove unnecessary dependency

During contamination score aggregation, I am updating %Mix to NA if the sample is not present after call rate 2 filtering. I think this step makes more sense during QC table building than here.

GwasQcPipeline/src/cgr_gwas_qc/workflow/modules/sample_level_qc.smk

Lines 166 to 182 in e0ff43d

    
               rule agg_verifyIDintensity_contamination: 
        
                   """Aggregate sample contamination scores. 
        
                   Aggregates sample level contamination scores into a single file (each 
        
                   row is a sample). The script sets ``%Mix`` to ``NA`` if the intensity 
        
                   is below the threshold and the file is not in the ``imiss`` file. 
        
                   """ 
        
                   input: 
        
                       contamination=cfg.expand(rules.per_sample_verifyIDintensity_contamination.output), 
        
                       median_idat_intensity=rules.agg_median_idat_intensity.output[0], 
        
                       imiss="sample_level/call_rate_2/samples.imiss", 
        
                   params: 
        
                       intensity_threshold=cfg.config.software_params.intensity_threshold, 
        
                   output: 
        
                       "sample_level/contamination/verifyIDintensity_contamination.csv", 
        
                   script: 
        
                       "../scripts/agg_contamination_test.py"

	rule make_related_list:
	input:
	track = 'subject_level/SampleUsedforSubject.csv',
	ibd = 'ibd/samples.genome',
	fam = 'subject_level/subjects.fam',
	imiss = 'subject_level/subjects_qc.imiss'
	output:
	'remove_related/subjects_to_remove.txt'

	rule extract_ld_prune_pca:
	input:
	bed = 'split_by_pop/{pop}_subjects.bed',
	bim = 'split_by_pop/{pop}_subjects.bim',
	fam = 'split_by_pop/{pop}_subjects.fam',
	prune = 'pca/{pop}_ldPruneList.prune.in',
	related = 'remove_related/subjects_to_remove.txt'

	rule subset_controls:
	input:
	bed = 'split_by_pop/{pop}_subjects.bed',
	bim = 'split_by_pop/{pop}_subjects.bim',
	fam = 'split_by_pop/{pop}_subjects.fam',
	keep = 'HWP/{pop}_controls.txt',
	related = 'remove_related/subjects_to_remove.txt'

	rule agg_verifyIDintensity_contamination:
	"""Aggregate sample contamination scores.

	Aggregates sample level contamination scores into a single file (each
	row is a sample). The script sets ``%Mix`` to ``NA`` if the intensity
	is below the threshold and the file is not in the ``imiss`` file.
	"""
	input:
	contamination=cfg.expand(rules.per_sample_verifyIDintensity_contamination.output),
	median_idat_intensity=rules.agg_median_idat_intensity.output[0],
	imiss="sample_level/call_rate_2/samples.imiss",
	params:
	intensity_threshold=cfg.config.software_params.intensity_threshold,
	output:
	"sample_level/contamination/verifyIDintensity_contamination.csv",
	script:
	"../scripts/agg_contamination_test.py"

nci-cgr / gwasqcpipeline Goto Github PK

gwasqcpipeline's Introduction

GwasQcPipeline

Usage

Deploying with Docker

Installing GwasQcPipeline on ccad2

Usage

gwasqcpipeline's People

Contributors

Watchers

Forkers

gwasqcpipeline's Issues

Needed Corner Cases

Things we tried

Things to try

Change location of pre-commit cache

HWE

PCA

Level of relatedness

Action Items

Related

Action Items

Links

Related

Tasks

Recommend Projects

Recommend Topics

Recommend Org

Change location of `pre-commit` cache