Coder Social home page Coder Social logo

broadinstitute / gatk-sv Goto Github PK

View Code? Open in Web Editor NEW
159.0 14.0 71.0 70.33 MB

A structural variation pipeline for short-read sequencing

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 1.22% Shell 7.48% R 17.27% Python 38.68% Jupyter Notebook 0.22% WDL 33.23% Cython 0.03% Java 1.71% JavaScript 0.15% CSS 0.03%
genomics structural-variants structural-variation bioinformatics

gatk-sv's Introduction

GATK-SV

A structural variation discovery pipeline for Illumina short-read whole-genome sequencing (WGS) data.

Table of Contents

Deployment and execution:

  • A Google Cloud account.
  • A workflow execution system supporting the Workflow Description Language (WDL), either:
    • Cromwell (v36 or higher). A dedicated server is highly recommended.
    • or Terra (note preconfigured GATK-SV workflows are not yet available for this platform)
  • Recommended: MELT. Due to licensing restrictions, we cannot provide a public docker image or reference panel VCFs for this algorithm.
  • Recommended: cromshell for interacting with a dedicated Cromwell server.
  • Recommended: WOMtool for validating WDL/json files.

Alternative backends

Because GATK-SV has been tested only on the Google Cloud Platform (GCP), we are unable to provide specific guidance or support for other execution platforms including HPC clusters and AWS. Contributions from the community to improve portability between backends will be considered on a case-by-case-basis. We ask contributors to please adhere to the following guidelines when submitting issues and pull requests:

  1. Code changes must be functionally equivalent on GCP backends, i.e. not result in changed output
  2. Increases to cost and runtime on GCP backends should be minimal
  3. Avoid adding new inputs and tasks to workflows. Simpler changes are more likely to be approved, e.g. small in-line changes to scripts or WDL task command sections
  4. Avoid introducing new code paths, e.g. conditional statements
  5. Additional backend-specific scripts, workflows, tests, and Dockerfiles will not be approved
  6. Changes to Dockerfiles may require extensive testing before approval

We still encourage members of the community to adapt GATK-SV for non-GCP backends and share code on forked repositories. Here are a some considerations:

  • Refer to Cromwell's documentation for configuration instructions.
  • The handling and ordering of glob commands may differ between platforms.
  • Shell commands that are potentially destructive to input files (e.g. rm, mv, tabix) can cause unexpected behavior on shared filesystems. Enabling copy localization may help to more closely replicate the behavior on GCP.
  • For clusters that do not support Docker, Singularity is an alternative. See Cromwell documentation on Singularity.
  • The GATK-SV pipeline takes advantage of the massive parallelization possible in the cloud. Local backends may not have the resources to execute all of the workflows. Workflows that use fewer resources or that are less parallelized may be more successful. For instance, some users have been able to run GatherSampleEvidence on a SLURM cluster.

Data:

  • Illumina short-read whole-genome CRAMs or BAMs, aligned to hg38 with bwa-mem. BAMs must also be indexed.
  • Family structure definitions file in PED format.

The PED file format is described here. Note that GATK-SV imposes additional requirements:

  • The file must be tab-delimited.
  • The sex column must only contain 0, 1, or 2: 1=Male, 2=Female, 0=Other/Unknown. Sex chromosome aneuploidies (detected in EvidenceQC) should be entered as sex = 0.
  • All family, individual, and parental IDs must conform to the sample ID requirements.
  • Missing parental IDs should be entered as 0.
  • Header lines are allowed if they begin with a # character. To validate the PED file, you may use src/sv-pipeline/scripts/validate_ped.py -p pedigree.ped -s samples.list.

We recommend filtering out samples with a high percentage of improperly paired reads (>10% or an outlier for your data) as technical outliers prior to running GatherSampleEvidence. A high percentage of improperly paired reads may indicate issues with library prep, degradation, or contamination. Artifactual improperly paired reads could cause incorrect SV calls, and these samples have been observed to have longer runtimes and higher compute costs for GatherSampleEvidence.

Sample IDs must:

  • Be unique within the cohort
  • Contain only alphanumeric characters and underscores (no dashes, whitespace, or special characters)

Sample IDs should not:

  • Contain only numeric characters
  • Be a substring of another sample ID in the same cohort
  • Contain any of the following substrings: chr, name, DEL, DUP, CPX, CHROM

The same requirements apply to family IDs in the PED file, as well as batch IDs and the cohort ID provided as workflow inputs.

Sample IDs are provided to GatherSampleEvidence directly and need not match sample names from the BAM/CRAM headers. GetSampleID.wdl can be used to fetch BAM sample IDs and also generates a set of alternate IDs that are considered safe for this pipeline; alternatively, this script transforms a list of sample IDs to fit these requirements. Currently, sample IDs can be replaced again in GatherBatchEvidence.

The following inputs will need to be updated with the transformed sample IDs:

Please cite the following publication: Collins, Brand, et al. 2020. "A structural variation reference for medical and population genetics." Nature 581, 444-451.

Additional references: Werling et al. 2018. "An analytical framework for whole-genome sequence association studies and its implications for autism spectrum disorder." Nature genetics 50.5, 727-736.

The following resources were produced using data from the All of Us Research Program and have been approved by the Program for public dissemination:

  • Genotype filtering model: "aou_recalibrate_gq_model_file" in "inputs/values/resources_hg38.json"

The All of Us Research Program is supported by the National Institutes of Health, Office of the Director: Regional Medical Centers: 1 OT2 OD026549; 1 OT2 OD026554; 1 OT2 OD026557; 1 OT2 OD026556; 1 OT2 OD026550; 1 OT2 OD 026552; 1 OT2 OD026553; 1 OT2 OD026548; 1 OT2 OD026551; 1 OT2 OD026555; IAA #: AOD 16037; Federally Qualified Health Centers: HHSN 263201600085U; Data and Research Center: 5 U2C OD023196; Biobank: 1 U24 OD023121; The Participant Center: U24 OD023176; Participant Technology Systems Center: 1 U24 OD023163; Communications and Engagement: 3 OT2 OD023205; 3 OT2 OD023206; and Community Partners: 1 OT2 OD025277; 3 OT2 OD025315; 1 OT2 OD025337; 1 OT2 OD025276. In addition, the All of Us Research Program would not be possible without the partnership of its participants.

WDLs

There are two scripts for running the full pipeline:

  • wdl/GATKSVPipelineBatch.wdl: Runs GATK-SV on a batch of samples.
  • wdl/GATKSVPipelineSingleSample.wdl: Runs GATK-SV on a single sample, given a reference panel

Building inputs

Example workflow inputs can be found in /inputs. Build using scripts/inputs/build_default_inputs.sh, which generates input jsons in /inputs/build. Except the MELT docker image, all required resources are available in public Google buckets.

Some workflows require a Google Cloud Project ID to be defined in a cloud environment parameter group. Workspace builds require a Terra billing project ID as well. An example is provided at /inputs/values/google_cloud.json but should not be used, as modifying this file will cause tracked changes in the repository. Instead, create a copy in the same directory with the format google_cloud.my_project.json and modify as necessary.

Note that these inputs are required only when certain data are located in requester pays buckets. If this does not apply, users may use placeholder values for the cloud configuration and simply delete the inputs manually.

MELT

Important: The example input files contain MELT inputs that are NOT public (see Requirements). These include:

  • GATKSVPipelineSingleSample.melt_docker and GATKSVPipelineBatch.melt_docker - MELT docker URI (see Docker readme)
  • GATKSVPipelineSingleSample.ref_std_melt_vcfs - Standardized MELT VCFs (GatherBatchEvidence)

The input values are provided only as an example and are not publicly accessible. In order to include MELT, these values must be provided by the user. MELT can be disabled by deleting these inputs and setting GATKSVPipelineBatch.use_melt to false.

Execution

We recommend running the pipeline on a dedicated Cromwell server with a cromshell client. A batch run can be started with the following commands:

> mkdir gatksv_run && cd gatksv_run
> mkdir wdl && cd wdl
> cp $GATK_SV_ROOT/wdl/*.wdl .
> zip dep.zip *.wdl
> cd ..
> echo '{ "google_project_id": "my-google-project-id", "terra_billing_project_id": "my-terra-billing-project" }' > inputs/values/google_cloud.my_project.json
> bash scripts/inputs/build_default_inputs.sh -d $GATK_SV_ROOT -c google_cloud.my_project
> cp $GATK_SV_ROOT/inputs/build/ref_panel_1kg/test/GATKSVPipelineBatch/GATKSVPipelineBatch.json GATKSVPipelineBatch.my_run.json
> cromshell submit wdl/GATKSVPipelineBatch.wdl GATKSVPipelineBatch.my_run.json cromwell_config.json wdl/dep.zip

where cromwell_config.json is a Cromwell workflow options file. Note users will need to re-populate batch/sample-specific parameters (e.g. BAMs and sample IDs).

The pipeline consists of a series of modules that perform the following:

  • GatherSampleEvidence: SV evidence collection, including calls from a configurable set of algorithms (Manta, MELT, and Wham), read depth (RD), split read positions (SR), and discordant pair positions (PE).
  • EvidenceQC: Dosage bias scoring and ploidy estimation
  • GatherBatchEvidence: Copy number variant calling using cn.MOPS and GATK gCNV; B-allele frequency (BAF) generation; call and evidence aggregation
  • ClusterBatch: Variant clustering
  • GenerateBatchMetrics: Variant filtering metric generation
  • FilterBatch: Variant filtering; outlier exclusion
  • GenotypeBatch: Genotyping
  • MakeCohortVcf: Cross-batch integration; complex variant resolution and re-genotyping; vcf cleanup
  • Module 07: Downstream filtering, including minGQ, batch effect check, outlier samples removal and final recalibration;
  • AnnotateVcf: Annotations, including functional annotation, allele frequency (AF) annotation and AF annotation with external population callsets;
  • Module 09: Visualization, including scripts that generates IGV screenshots and rd plots.
  • Additional modules to be added: de novo and mosaic scripts

Repository structure:

  • /dockerfiles: Resources for building pipeline docker images
  • /inputs: files for generating workflow inputs
    • /templates: Input json file templates
    • /values: Input values used to populate templates
  • /wdl: WDLs running the pipeline. There is a master WDL for running each module, e.g. ClusterBatch.wdl.
  • /scripts: scripts for running tests, building dockers, and analyzing cromwell metadata files
  • /src: main pipeline scripts
    • /RdTest: scripts for depth testing
    • /sv-pipeline: various scripts and packages used throughout the pipeline
    • /svqc: Python module for checking that pipeline metrics fall within acceptable limits
    • /svtest: Python module for generating various summary metrics from module outputs
    • /svtk: Python module of tools for SV-related datafile parsing and analysis
    • /WGD: whole-genome dosage scoring scripts

A minimum cohort size of 100 is required, and a roughly equal number of males and females is recommended. For modest cohorts (~100-500 samples), the pipeline can be run as a single batch using GATKSVPipelineBatch.wdl.

For larger cohorts, samples should be split up into batches of about 100-500 samples. Refer to the Batching section for further guidance on creating batches.

The pipeline should be executed as follows:

Note: GatherBatchEvidence requires a trained gCNV model.

For larger cohorts, samples should be split up into batches of about 100-500 samples with similar characteristics. We recommend batching based on overall coverage and dosage score (WGD), which can be generated in EvidenceQC. An example batching process is outlined below:

  1. Divide the cohort into PCR+ and PCR- samples
  2. Partition the samples by median coverage from EvidenceQC, grouping samples with similar median coverage together. The end goal is to divide the cohort into roughly equal-sized batches of about 100-500 samples; if your partitions based on coverage are larger or uneven, you can partition the cohort further in the next step to obtain the final batches.
  3. Optionally, divide the samples further by dosage score (WGD) from EvidenceQC, grouping samples with similar WGD score together, to obtain roughly equal-sized batches of about 100-500 samples
  4. Maintain a roughly equal sex balance within each batch, based on sex assignments from EvidenceQC

GATKSVPipelineSingleSample.wdl runs the pipeline on a single sample using a fixed reference panel. An example run with reference panel containing 156 samples from the NYGC 1000G Terra workspace can be found in inputs/build/NA12878/test after building inputs).

Both the cohort and single-sample modes use the GATK-gCNV depth calling pipeline, which requires a trained model as input. The samples used for training should be technically homogeneous and similar to the samples to be processed (i.e. same sample type, library prep protocol, sequencer, sequencing center, etc.). The samples to be processed may comprise all or a subset of the training set. For small, relatively homogenous cohorts, a single gCNV model is usually sufficient. If a cohort contains multiple data sources, we recommend training a separate model for each batch or group of batches with similar dosage score (WGD). The model may be trained on all or a subset of the samples to which it will be applied; a reasonable default is 100 randomly-selected samples from the batch (the random selection can be done as part of the workflow by specifying a number of samples to the n_samples_subsample input parameter in /wdl/TrainGCNV.wdl).

New reference panels can be generated easily from a single run of the GATKSVPipelineBatch workflow. If using a Cromwell server, we recommend copying the outputs to a permanent location by adding the following option to the workflow configuration file:

  "final_workflow_outputs_dir" : "gs://my-outputs-bucket",
  "use_relative_output_paths": false,

Here is an example of how to generate workflow input jsons from GATKSVPipelineBatch workflow metadata:

> cromshell -t60 metadata 38c65ca4-2a07-4805-86b6-214696075fef > metadata.json
> python scripts/inputs/create_test_batch.py \
    --execution-bucket gs://my-exec-bucket \
    --final-workflow-outputs-dir gs://my-outputs-bucket \
    metadata.json \
    > inputs/values/my_ref_panel.json
> # Define your google project id (for Cromwell inputs) and Terra billing project (for workspace inputs)
> echo '{ "google_project_id": "my-google-project-id", "terra_billing_project_id": "my-terra-billing-project" }' > inputs/values/google_cloud.my_project.json
> # Build test files for batched workflows (google cloud project id required)
> python scripts/inputs/build_inputs.py \
    inputs/values \
    inputs/templates/test \
    inputs/build/my_ref_panel/test \
    -a '{ "test_batch" : "ref_panel_1kg", "cloud_env": "google_cloud.my_project" }'
> # Build test files for the single-sample workflow
> python scripts/inputs/build_inputs.py \
    inputs/values \
    inputs/templates/test/GATKSVPipelineSingleSample \
    inputs/build/NA19240/test_my_ref_panel \
    -a '{ "single_sample" : "test_single_sample_NA19240", "ref_panel" : "my_ref_panel" }'
> # Build files for a Terra workspace
> python scripts/inputs/build_inputs.py \
    inputs/values \
    inputs/templates/terra_workspaces/single_sample \
    inputs/build/NA12878/terra_my_ref_panel \
    -a '{ "single_sample" : "test_single_sample_NA12878", "ref_panel" : "my_ref_panel" }'

Note that the inputs to GATKSVPipelineBatch may be used as resources for the reference panel and therefore should also be in a permanent location.

The following sections briefly describe each module and highlights inter-dependent input/output files. Note that input/output mappings can also be gleaned from GATKSVPipelineBatch.wdl, and example input templates for each module can be found in /inputs/templates/test.

Formerly Module00a

Runs raw evidence collection on each sample with the following SV callers: Manta, Wham, and/or MELT. For guidance on pre-filtering prior to GatherSampleEvidence, refer to the Sample Exclusion section.

Note: a list of sample IDs must be provided. Refer to the sample ID requirements for specifications of allowable sample IDs. IDs that do not meet these requirements may cause errors.

Inputs:

  • Per-sample BAM or CRAM files aligned to hg38. Index files (.bai) must be provided if using BAMs.

Outputs:

  • Caller VCFs (Manta, MELT, and/or Wham)
  • Binned read counts file
  • Split reads (SR) file
  • Discordant read pairs (PE) file

Formerly Module00b

Runs ploidy estimation, dosage scoring, and optionally VCF QC. The results from this module can be used for QC and batching.

For large cohorts, this workflow can be run on arbitrary cohort partitions of up to about 500 samples. Afterwards, we recommend using the results to divide samples into smaller batches (~100-500 samples) with ~1:1 male:female ratio. Refer to the Batching section for further guidance on creating batches.

We also recommend using sex assignments generated from the ploidy estimates and incorporating them into the PED file, with sex = 0 for sex aneuploidies.

Prerequisites:

Inputs:

Outputs:

  • Per-sample dosage scores with plots
  • Median coverage per sample
  • Ploidy estimates, sex assignments, with plots
  • (Optional) Outlier samples detected by call counts

The purpose of sample filtering at this stage after EvidenceQC is to prevent very poor quality samples from interfering with the results for the rest of the callset. In general, samples that are borderline are okay to leave in, but you should choose filtering thresholds to suit the needs of your cohort and study. There will be future opportunities (as part of FilterBatch) for filtering before the joint genotyping stage if necessary. Here are a few of the basic QC checks that we recommend:

  • Look at the X and Y ploidy plots, and check that sex assignments match your expectations. If there are discrepancies, check for sample swaps and update your PED file before proceeding.
  • Look at the dosage score (WGD) distribution and check that it is centered around 0 (the distribution of WGD for PCR- samples is expected to be slightly lower than 0, and the distribution of WGD for PCR+ samples is expected to be slightly greater than 0. Refer to the gnomAD-SV paper for more information on WGD score). Optionally filter outliers.
  • Look at the low outliers for each SV caller (samples with much lower than typical numbers of SV calls per contig for each caller). An empty low outlier file means there were no outliers below the median and no filtering is necessary. Check that no samples had zero calls.
  • Look at the high outliers for each SV caller and optionally filter outliers; samples with many more SV calls than average may be poor quality.
  • Remove samples with autosomal aneuploidies based on the per-batch binned coverage plots of each chromosome.

Trains a gCNV model for use in GatherBatchEvidence. The WDL can be found at /wdl/TrainGCNV.wdl. See the gCNV training overview for more information.

Prerequisites:

Inputs:

Outputs:

  • Contig ploidy model tarball
  • gCNV model tarballs

Formerly Module00c

Runs CNV callers (cn.MOPS, GATK-gCNV) and combines single-sample raw evidence into a batch. See above for more information on batching.

Prerequisites:

Inputs:

  • PED file (updated with EvidenceQC sex assignments, including sex = 0 for sex aneuploidies. Calls will not be made on sex chromosomes when sex = 0 in order to avoid generating many confusing calls or upsetting normalized copy numbers for the batch.)
  • Read count, BAF, PE, SD, and SR files (GatherSampleEvidence)
  • Caller VCFs (GatherSampleEvidence)
  • Contig ploidy model and gCNV model files (gCNV training)

Outputs:

  • Combined read count matrix, SR, PE, and BAF files
  • Standardized call VCFs
  • Depth-only (DEL/DUP) calls
  • Per-sample median coverage estimates
  • (Optional) Evidence QC plots

Formerly Module01

Clusters SV calls across a batch.

Prerequisites:

Inputs:

Outputs:

  • Clustered SV VCFs
  • Clustered depth-only call VCF

Formerly Module02

Generates variant metrics for filtering.

Prerequisites:

Inputs:

Outputs:

  • Metrics file

Formerly Module03

Filters poor quality variants and filters outlier samples. This workflow can be run all at once with the WDL at wdl/FilterBatch.wdl, or it can be run in two steps to enable tuning of outlier filtration cutoffs. The two subworkflows are:

  1. FilterBatchSites: Per-batch variant filtration. Visualize SV counts per sample per type to help choose an IQR cutoff for outlier filtering, and preview outlier samples for a given cutoff
  2. FilterBatchSamples: Per-batch outlier sample filtration; provide an appropriate outlier_cutoff_nIQR based on the SV count plots and outlier previews from step 1. Note that not removing high outliers can result in increased compute cost and a higher false positive rate in later steps.

Prerequisites:

Inputs:

Outputs:

  • Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
  • Filtered depth-only call VCF with outlier samples excluded
  • Random forest cutoffs file
  • PED file with outlier samples excluded

Formerly MergeCohortVcfs

Combines filtered variants across batches. The WDL can be found at: /wdl/MergeBatchSites.wdl.

Prerequisites:

Inputs:

Outputs:

  • Combined cohort PESR and depth VCFs

Formerly Module04

Genotypes a batch of samples across unfiltered variants combined across all batches.

Prerequisites:

Inputs:

Outputs:

  • Filtered SV (non-depth-only a.k.a. "PESR") VCF with outlier samples excluded
  • Filtered depth-only call VCF with outlier samples excluded
  • PED file with outlier samples excluded
  • List of SR pass variants
  • List of SR fail variants
  • (Optional) Depth re-genotyping intervals list

Formerly Module04b

Re-genotypes probable mosaic variants across multiple batches.

Prerequisites:

Inputs:

Outputs:

  • Re-genotyped depth VCFs

Formerly Module0506

Combines variants across multiple batches, resolves complex variants, re-genotypes, and performs final VCF clean-up.

Prerequisites:

Inputs:

Outputs:

  • Finalized "cleaned" VCF and QC plots

Module 07 (in development)

Apply downstream filtering steps to the cleaned VCF to further control the false discovery rate; all steps are optional and users should decide based on the specific purpose of their projects.

Filtering methods include:

  • minGQ - remove variants based on the genotype quality across populations. Note: Trio families are required to build the minGQ filtering model in this step. We provide tables pre-trained with the 1000 genomes samples at different FDR thresholds for projects that lack family structures, and they can be found at the paths below. These tables assume that GQ has a scale of [0,999], so they will not work with newer VCFs where GQ has a scale of [0,99].
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.10perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.1perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
gs://gatk-sv-resources-public/hg38/v0/sv-resources/ref-panel/1KG/v2/mingq/1KGP_2504_and_698_with_GIAB.5perc_fdr.PCRMINUS.minGQ.filter_lookup_table.txt
  • BatchEffect - remove variants that show significant discrepancies in allele frequencies across batches
  • FilterOutlierSamplesPostMinGQ - remove outlier samples with unusually high or low number of SVs
  • FilterCleanupQualRecalibration - sanitize filter columns and recalibrate variant QUAL scores for easier interpretation

AnnotateVcf (in development)

Formerly Module08Annotation

Add annotations, such as the inferred function and allele frequencies of variants, to final VCF.

Annotations methods include:

  • Functional annotation - The GATK tool SVAnnotate is used to annotate SVs with inferred functional consequence on protein-coding regions, regulatory regions such as UTR and promoters, and other non-coding elements.
  • Allele Frequency annotation - annotate SVs with their allele frequencies across all samples, and samples of specific sex, as well as specific sub-populations.
  • Allele Frequency annotation with external callset - annotate SVs with the allele frequencies of their overlapping SVs in another callset, eg. gnomad SV callset.

Module 09 (in development)

Visualize SVs with IGV screenshots and read depth plots.

Visualization methods include:

  • RD Visualization - generate RD plots across all samples, ideal for visualizing large CNVs.
  • IGV Visualization - generate IGV plots of each SV for individual sample, ideal for visualizing de novo small SVs.
  • Module09.visualize.wdl - generate RD plots and IGV plots, and combine them for easy review.

CI/CD

This repository is maintained following the norms of continuous integration (CI) and continuous delivery (CD). GATK-SV CI/CD is developed as a set of Github Actions workflows that are available under the .github/workflows directory. Please refer to the workflow's README for their current coverage and setup.

VM runs out of memory or disk

  • Default pipeline settings are tuned for batches of 100 samples. Larger batches or cohorts may require additional VM resources. Most runtime attributes can be modified through the RuntimeAttr inputs. These are formatted like this in the json:
"MyWorkflow.runtime_attr_override": {
  "disk_gb": 100,
  "mem_gb": 16
},

Note that a subset of the struct attributes can be specified. See wdl/Structs.wdl for available attributes.

Calculated read length causes error in MELT workflow

Example error message from GatherSampleEvidence.MELT.GetWgsMetrics:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: The requested index 701766 is out of counter bounds. Possible cause of exception can be wrong READ_LENGTH parameter (much smaller than actual read length)

This error message was observed for a sample with an average read length of 117, but for which half the reads were of length 90 and half were of length 151. As a workaround, override the calculated read length by providing a read_length input of 151 (or the expected read length for the sample in question) to GatherSampleEvidence.

gatk-sv's People

Contributors

curioustim avatar cwhelan avatar epiercehoffman avatar gatk-sv-bot avatar henriqueribeiro avatar illusional avatar kirtanav98 avatar mattwellie avatar mwalker174 avatar nawatts avatar pdiakumis avatar rcollins13 avatar shadizaheri avatar sph17 avatar tedbrookings avatar tedsharpe avatar vjalili avatar xuefzhao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gatk-sv's Issues

split Module05_06 into separate workflows for 05 and 06

Feature request

Module(s) or script(s) involved

Module05_06.wdl

Description

Consider splitting Module05_06.wdl into two workflows because the metadata for the combined workflow is too large to display in Terra & breaks the Job Manager, even on just ~100 samples.
image


Too many json files

Maintaining them all is cumbersome and error-prone. We do have docker updates automated, but it's vandalizing event the smallest of PRs with tons of unnecessary edits.

We need scripts to build jsons locally. This way fixed resources, used by many workflows (e.g. dockers, reference files, etc.), can be located in a central file. If we use consistent variable names across all wdls (see #16), a script would be able to use womtool to retrieve workflow inputs, and then reference both common and workflow-specific resources automatically.

Migrate to 1KGP testing

Testing data currently contain sensitive data that cannot be released publicly. We should migrate over to a 2-batch test dataset built using the 1000 Genome Project (1KGP) high-coverage data that we are using for the cohort mode Terra workspace (not to be confused with the single-sample mode 1KGP reference panel which is a single batch composed of a different set of samples from 1KGP).

Ideally, all our testing should be self-contained, meaning that prerequisite cohort-dependent inputs for all modules (e.g. vcfs, metrics files, etc.) can be generated from the tests of earlier modules. Therefore, we will need separate tests for batch1 and batch2 starting at GenerateSampleMetricsBatch through FilterBatch and GenotypeBatch. Other downstream modules are run on the whole cohort (batch1 and batch2 together).

We will replace small/large test set designations. In the future, we can think about options to run on a subset of chromosomes to speed up testing. The one exception would be GenerateSampleMetrics - currently we test the batch version of this (GenerateSampleMetricsBatch). We should add another template for GenerateSampleMetrics itself to run on one sample, since this workflow is quite expensive.

A few technical notes:

  • New input values need to be defined for batch1 and batch2 in /input_values. For cohort-level steps (mentioned above), let's define a third inputs file for a 1kgp_test cohort (i.e. 1kgp_test.json).
  • Input data and configurations can be found in inputs/terra_workspaces/cohort_mode (after running scripts/inputs/build_default_inputs.sh). This includes CRAM and gVCF paths, batch membership assignments, and cohort-specific resource files (e.g. ped file).
  • Copy and organize workflow inputs/outputs in gs://gatk-sv-resources-public/test, including metrics generated by enabling run_module_metrics.

Error running tasks04.CountPE

I'm running GATKSVPipelineSingleSample.wdl and I'm facing this error in Module04:

A USER ERROR has occurred: An index is required but was not found for file /tmp/scratch/xxxxx-gwf-core/cromwell-execution/GATKSVPipelineSingleSample/4e9e57d2-48ef-4b52-9cb8-5324c64cfddc/call-Module00c/Module00c/d2cdda29-616b-44ea-8e1a-9225c05e73da/call-EvidenceMerging/EvidenceMerging/c4f03868-e71b-452b-83f2-edf54f0f580b/call-MergePEShards/test_na12878.PE.txt.gz. Support for unindexed block-compressed files has been temporarily disabled. Try running IndexFeatureFile on the input.

However, Module00c is also outputting test_na12878.PE.txt.gz.tbi but is not being fed to this task. Is this correct?

Thanks in advance

No coercion from '4' of type java.lang.integer to string WDL

Description

I am trying to test the GATKSVPipelineBatch.wdl with GATKSVPipelineBatch.ref_panel_1kg.json, and meet this log.
logs:

No coercion from '4' of type java.lang.integer to string WDL 

Is there any way to solve this problem?
Thank you very much for discussing this with me.

Cannot find generate_rdtest_bed.R in the gatksv/sv-pipeline-qc docker environment

In Module03Qc.wdl,

task VcfExternalBenchmark_to_Rdtest {
input {
File vcf_stats
File comparator
String sv_pipeline_qc_docker
RuntimeAttr? runtime_attr_override
}

Float input_size = 2 * size(vcf_stats, "GiB")
Float compression_factor = 5.0
Float base_disk_gb = 5.0
Float base_mem_gb = 2.0
RuntimeAttr runtime_default = object {
mem_gb: base_mem_gb + compression_factor * input_size,
disk_gb: ceil(base_disk_gb + input_size * (2.0 + 2.0 * compression_factor)),
cpu_cores: 1,
preemptible_tries: 3,
max_retries: 1,
boot_disk_gb: 10
}
RuntimeAttr runtime_override = select_first([runtime_attr_override, runtime_default])
runtime {
memory: "~{select_first([runtime_override.mem_gb, runtime_default.mem_gb])} GiB"
disks: "local-disk ~{select_first([runtime_override.disk_gb, runtime_default.disk_gb])} HDD"
cpu: select_first([runtime_override.cpu_cores, runtime_default.cpu_cores])
preemptible: select_first([runtime_override.preemptible_tries, runtime_default.preemptible_tries])
maxRetries: select_first([runtime_override.max_retries, runtime_default.max_retries])
docker: sv_pipeline_qc_docker
bootDiskSizeGb: select_first([runtime_override.boot_disk_gb, runtime_default.boot_disk_gb])
}

command <<<
set -eu -o pipefail
# Run benchmarking script
/opt/sv-pipeline/scripts/vcf_qc/compare_callsets_V2.sh
-O ~{vcf_stats}.SENS.bed
~{vcf_stats}
~{comparator}
/opt/sv-pipeline/scripts/vcf_qc/compare_callsets_V2.sh
-O ~{vcf_stats}.SPEC.bed
~{comparator}
~{vcf_stats}
awk '{if ($8=="NO_OVR" && $9=="NO_OVR" && $3-$2>5000) print}'
~{vcf_stats}.SENS.bed
| awk '{if ($5=="DEL" || $5=="DEL_ALU" || $5=="DEL_HERV" || $5=="DEL_LINE1" || $5=="DEL_SVA" || $5=="DUP") print}'
> ~{vcf_stats}.missed.large_cnv.bed
awk '{if ($8=="NO_OVR" && $9=="NO_OVR" && $3-$2>5000) print}'
~{vcf_stats}.SPEC.bed
| awk '{if ($5=="DEL" || $5=="DEL_ALU" || $5=="DEL_HERV" || $5=="DEL_LINE1" || $5=="DEL_SVA" || $5=="DUP") print}'
> ~{vcf_stats}.newUniq.large_cnv.bed
Rscript generate_rdtest_bed.R
-r ~{vcf_stats}.missed.large_cnv.bed
-b ~{comparator}
-o ~{vcf_stats}.missed.large_cnv.rdtest.bed
Rscript generate_rdtest_bed.R
-r ~{vcf_stats}.newUniq.large_cnv.bed
-b ~{vcf_stats}
-o ~{vcf_stats}.newUniq.large_cnv.rdtest.bed
# Prep outputs

output {
File sens = "~{vcf_stats}.SENS.bed"
File spec = "~{vcf_stats}.SPEC.bed"
File rdtest_missed = "~{vcf_stats}.missed.large_cnv.rdtest.bed"
File rdtest_newunique = "~{vcf_stats}.newUniq.large_cnv.rdtest.bed "

}
}

generate_rdtest_bed.R is used in the code, but I did not find this R script in the latest docker image (gatksv/sv-pipeline-qc:b3af2e3 and epiercehoffman/sv-pipeline-qc:eph_03qc_variables-e597d85). Could anyone provide it?

Thanks.

Outlier Sample Removal in Module 2 and 3

We have found certainly outlier samples with significantly more variants can influence variant metrics (module 2) and subsequent filtering (module 3). The challenge here is that given some family based studies have more tolerance for inclusion of these sample we are suggesting allowing the option of filtering out outliers for training purposes in module 2 and 3 but including them in the subsequent downstream steps. This could work as follows: At the end of module 1 generate a list of outlier samples akin to that performed after module 3 though likely more stringent. Exclude the samples from generation of variant metrics in module 2 unless a variant is comprised of only outlier samples. In that case generate variant metrics as normal. Exclude variants comprised of only outlier samples for training purposes in module 3 but include them when using the random forest for assessment.

Update WDL headers

These headers can be deleted from all WDLs:

##########################################################################################

## Base script:   https://portal.firecloud.org/#methods/Talkowski-SV/00_batch_evidence_merging/15/wdl

## Github commit: talkowski-lab/gatk-sv-v1:<ENTER HASH HERE IN FIRECLOUD>

##########################################################################################

For consistency, we should add new headers to all top-level WDLs (GATKSVPipeline*.wdl, Module0*.wdl, etc) and adopt the GATK style, e.g. joint genotyping wdl. Use the description information from the README for each workflow.

Add Module04b to Batch WDL

GATKSVPipelineBatch.wdl currently skips regenotyping (Module04b). This step is important for keeping large, rare depth calls, which are of particular interest, so Module04b should be added to the Batch workflow. This will also require running MergeCohortVcfs to generate some of the intermediate bed files for regenotyping.

Change include/exclude BED nomenclature

Feature request

Module(s) or script(s) involved

Several (TBD)

Description

Per discussion with @mwalker174 + others today, we should change the use of blacklist and whitelist throughout the pipeline to alternative nomenclature, like includelist and excludelist, where appropriate, when someone has bandwidth.

Somatic SV detection

Hello,

I've read the published gnomAD-SV paper and I'd like to use it for my project.
I currently have two projects, one is to detect germline SV from simplex family WGS and
the other one is to detect somatic SV from paired tissues WGS.

For the germline one, I think I can apply this whole workflow straightforwardly.
However, I think I should redesign or optimize the GATK-SV workflow for somatic SV calling.

Which jobs would be recommended if I want to optimize it for somatic SV calling? (e.g., adjusting some parameters)

Thank you for reading this

Dohyeon

Add max gnomAD allele frequency to single sample pipeline output

More feedback from clinical users:

The pipeline currently annotates variants with the overall AF in gnomAD and the AF in each continental subgroup. It would be helpful to provide an additional annotation with the max AF in any of the continental subgroups.

Investigate upgrading Manta to v. 1.6.0

The release notes for v 1.6 (https://github.com/Illumina/manta/releases) say:

This is a significant engineering update from the v1.5.1 release, in which the SV candidate generation step has been changed to a single multi-threaded process to improve task scheduling. This slightly improves average runtime and reduces runtime variability. With this update, support for job distribution over SGE is dropped.

Added
Add configuration option to turn off the evidence signal filter during candidate generation (DRAGEN-1873)
This enables very high sensitivity during high depth tumor-only calling

Changed
Change the SV candidate discovery and genotyping phase from a multi-process to a multi-thread design for better CPU utilization (MANTA-1521)
As a result, runtime is faster and less variable than before.
Runtime improvements vary by workload and server configuration, for typical WGS workloads on a modern server an improvement of ~5-10% may be expected, but improvements of up to 50% have been observed for cases where work previously was poorly distributed across processes.
SGE support is removed with this change.
Update htslib/samtools to 1.9 (MANTA-1483)

It sounds like we might get performance gains by upgrading (we currently use v1.5.0 in the pipeline).

Module04b fails when all regeno_bed files are empty

Bug Report

Affected module(s) or script(s)

Module04b

Description

When all files in the regeno_beds input to Module04b are empty, Module04b.Genotype_2.GetRegenotype fails because it does not produce an output file, so Cromwell fails to de-localize the expected output file.

Expected behavior

Module04b should handle empty inputs rather than failing.

Use GATK to retrieve VCF records in JoinContigFromRemoteVcfs

This almost-working code can replace remote tabix, but there are a few issues:

  1. Downstream scripts expect empty format fields to be present (as ".") but GATK removes them
  2. duplicate records are problematic for bcftools annotate
  3. fields should only be replaced when missing (depth-only calls)
    touch subsetted_vcfs.list
    paste ~{write_lines(batches)} ~{write_lines(vcfs)} | while read BATCH VCF_PATH; do
      java -Xmx~{java_mem_mb}M -jar ${GATK_JAR} SelectVariants \
        -V "${VCF_PATH}" \
        -L "~{contig}" \
        -O "tmp.vcf.gz"

      # GATK removed empty FORMAT fields
      bcftools query -f "%CHROM\t%POS\t%REF\t%ALT\t[.\t]\n" tmp.vcf.gz | bgzip -c > ann.tab.gz
      tabix -s1 -b2 -e2 ann.tab.gz
      bcftools annotate -a ann.tab.gz -c CHROM,POS,REF,ALT,FORMAT/PE_GT tmp.vcf.gz \
        | bcftools annotate -a ann.tab.gz -c CHROM,POS,REF,ALT,FORMAT/PE_GQ \
        | bcftools annotate -a ann.tab.gz -c CHROM,POS,REF,ALT,FORMAT/SR_GT \
        | bcftools annotate -a ann.tab.gz -c CHROM,POS,REF,ALT,FORMAT/SR_GQ \
        | sed "s/AN=[0-9]*;//g" \
        | sed "s/AC=[0-9]*;//g" \
        | bgzip -c \
        > $BATCH.~{contig}.vcf.gz
      rm tmp.vcf.gz ann.tab.gz
      tabix $BATCH.~{contig}.vcf.gz
      echo "$BATCH.~{contig}.vcf.gz" >> subsetted_vcfs.list
    done

[Problem] PreprocessPESR.std_delly_vcf: []

Instructions

I am very interesting to know more about GATKSVPipelineBatch.wdl.
I just plan to test GATKSVPipelineBatch.wdl with HG00096.final.bam and HG0129.final.bam.
During the test, I find an unusual condition.
The generated directory structure of Module00c.call-PreprocessPESR.call-StandardizeVCFs is just like below.

├── execution
│   ├── docker_cid
│   ├── glob-31de091d1940ddb276ce7b6086a74725
│   │   └── cromwell_glob_control_file
│   ├── glob-31de091d1940ddb276ce7b6086a74725.list
│   ├── manta.HG00096_unsorted.vcf
│   ├── manta.HG00129_unsorted.vcf
│   ├── rc
│   ├── script
│   ├── script.background
│   ├── script.submit
│   ├── std_000.manta.HG00096.vcf.gz
│   ├── std_001.manta.HG00129.vcf.gz
│   ├── stderr
│   ├── stderr.background
│   ├── stdout
│   └── stdout.background
├── inputs
│   ├── 1978095992
│   │   └── HG00129.manta.vcf.gz
│   ├── 745429519
│   │   └── contig.fai
│   └── -841028741
│       └── HG00096.manta.vcf.gz
└── tmp.92bdb17f

The related code is just like this.

  output {
    Array[File] std_vcf = glob("std_*.vcf.gz")
  }

Log output:

rocessPESR complete. Final Outputs:
  "PreprocessPESR.std_melt_vcf": null,
  "PreprocessPESR.std_wham_vcf": null,
  "PreprocessPESR.std_delly_vcf": [],
  "PreprocessPESR.std_manta_vcf": []

I used a previous version of the code for a test, only used one BAM file, did not encounter this problem.
Do you have any good suggestions for solving this problem?


Thank you very much.

Single sample mode results benchmarking

Dear @cwhelan,

We have run SS model for two samples - HG00514 and HG002, both have been extensively studied for SV. We compared the results to the truesets, however the outcome looks poor. I'm wondering if there's something we missed:

  • Is there a "best practice" or set of parameters we can tweak for SS? It looks the parameters in the json input file are already tuned.
  • Is the 1KGP reference panel based on low coverage sequences? I noticed someone asked about NYGC high-cov data (#73), will high-cov sequences improve the calling significantly?
  • I assume Broad is running gatk-sv in batch mode using high-cov sequences? Is it the case?

Would you please share your ideas?

Best wishes,
Fengyuan

Running with Docker Image

Hi,

My cluster doesn't support Docker, is there a way to run this on Singularity or is the only way to port the WDLs to change the Docker paths to local paths?

Thanks,
Luke

Upgrade to GATK v1.4.8.0

This will fix a header error encountered during BAF generation with the 1KGP panel and improve gCNV stability.

  • Upgrade GATK docker in all json files
  • Synchronize gCNV WDLs with those from GATK (pending https://github.com/broadinstitute/gatk/pull/6607/files)
  • New gCNV VCF formatting may require changes to src/sv-pipeline/00_preprocessing/scripts/defragment_cnvs.py
  • Rebuild test and 1KGP gCNV models

Manta WDL can break with unexpected cram/bam index extensions

For example, Manta's configuration script always expects the cram index to be <cram_path>.crai, ie ending in .cram.crai. Currently, one can pass non-standard index paths, such as in a different directory or lacking .cram in the file name. This causes Manta to crash.

This bug may apply to the other callers as well. A fix could use mv to ensure the index has a proper path.

Switched order of Number and Type keys in VCF END2 meta info line causes problems in htsjdk

Module0506's postCPX_cleanup.py writes the END2 meta info line as:

##INFO=<ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">

The switch from the usual ordering of Number,Type causes htsjdk to throw an error when trying to parse the file:

htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Tag Type in wrong order (was #2, expected #3) in line <ID=END2,Type=Integer,Number=1,Description="Position of breakpoint on CHR2">
        at htsjdk.variant.vcf.VCF4Parser.parseLine(VCFHeaderLineTranslator.java:172)
        at htsjdk.variant.vcf.VCFHeaderLineTranslator.parseLine(VCFHeaderLineTranslator.java:58)
        at htsjdk.variant.vcf.VCFCompoundHeaderLine.<init>(VCFCompoundHeaderLine.java:215)
        at htsjdk.variant.vcf.VCFInfoHeaderLine.<init>(VCFInfoHeaderLine.java:56)
        at htsjdk.variant.vcf.AbstractVCFCodec.parseHeaderFromLines(AbstractVCFCodec.java:192)
        at htsjdk.variant.vcf.VCFCodec.readActualHeader(VCFCodec.java:111)
        at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:79)
        at htsjdk.tribble.AsciiFeatureCodec.readHeader(AsciiFeatureCodec.java:37)
        at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:261)
        ... 16 more

Error code 1 at DetermineGermlineContigPloidyCaseMode in SingleSample mode

Dear developers,

I'm trying to run the SingleSample pipeline example (inputs/GATKSVPipelineSingleSample.ref_panel_1kg.na12878.no_melt.json), however I got an error at Module00c DetermineGermlineContigPloidyCaseMode step. Please see the detail in the following:

Affected module(s) or script(s)

module: Module00c > gCNVCase > CNVGermlineCaseWorkflow > DetermineGermlineContigPloidyCaseMode
script: wdl/CollectCoverage.wdl

Affected version(s)

  • master branch cloned on 15-Sep-2020

Description

I got the following error message in stderr and stderr.background, the value in rc is 1.

gzip: /cromwell-executions/GATKSVPipelineSingleSample/f4842cc6-581c-4811-9041-bc4f238ad7f7/call-Module00c/Module00c/b866890f-efd3-432b-b517-8c5ec01273e5/call-gCNVCase/CNVGermlineCaseWorkflow/6dbc4a66-fea4-4575-9376-930c883ea04d/call-DetermineGermlineContigPloidyCaseMode/inputs/881714510/condensed_counts.HG02178.tsv.gz has 1 other link -- unchanged

I looked into this and noticed that the potential cause is this line - https://github.com/broadinstitute/gatk-sv/blob/master/wdl/CollectCoverage.wdl#L115,

bgzip condensed_counts.~{sample}.tsv

so I changed it to

bgzip -f condensed_counts.~{sample}.tsv

to force the compression, however it didn't work. At the moment, I force Cromwell to ignore this error code, but it's not clear to me the consequence to the other steps.

Please will you advise for a fix?

Best wishes,
Fengyuan

Module07 task command improvements

Some commands in this wdl can be replaced with faster/more robust bcftools commands, and some commands simply need to be rewritten. See #51 for unaddressed comments.

[ERROR] run GATKSVPipelineBatch.wdl with /gatk-sv/inputs/GATKSVPipelineBatch.ref_panel_1kg.json

Description


I try to tun GATKSVPipelineBatch.wdl with /gatk-sv/inputs/GATKSVPipelineBatch.ref_panel_1kg.json,and get such a error:

java.lang.RuntimeException: Failed to evaluate 'BAF_file' (reason 1 of 1): Evaluating select_first([baf_out]) failed: select_first was called with 1 empty values. We needed at least one to be filled.
        at cromwell.engine.workflow.lifecycle.execution.keys.ExpressionKey.processRunnable(ExpressionKey.scala:29)
        at cromwell.engine.workflow.lifecycle.execution.WorkflowExecutionActor.$anonfun$startRunnableNodes$7(WorkflowExecutionActor.scala:536)
        at cats.instances.ListInstances$$anon$1.$anonfun$traverse$2(list.scala:74)
        at cats.instances.ListInstances$$anon$1.loop$2(list.scala:64)
        at cats.instances.ListInstances$$anon$1.$anonfun$foldRight$1(list.scala:64)
        at cats.Eval$.loop$1(Eval.scala:338)
        at cats.Eval$.cats$Eval$$evaluate(Eval.scala:368)
        at cats.Eval$Defer.value(Eval.scala:257)
        at cats.instances.ListInstances$$anon$1.traverse(list.scala:73)
        at cats.instances.ListInstances$$anon$1.traverse(list.scala:12)
        at cats.Traverse$Ops.traverse(Traverse.scala:19)
        at cats.Traverse$Ops.traverse$(Traverse.scala:19)
        at cats.Traverse$ToTraverseOps$$anon$2.traverse(Traverse.scala:19)

Related documents


How to fix it?


I observe "GATKSVPipelineBatch.GATKSVPipelinePhase1.run_matrix_qc": "true", in GATKSVPipelineBatch.ref_panel_1kg.json without any statement about baf_files or gvcfs or snp_vcfs.
I want to know if I should give baf_files or gvcfs or snp_vcfs to solve this problem.


I am anxiously waiting for your reply. Thank you very much.

1000G callset?

Hi there,

Has the pipeline been run on the high coverage 1000G data released by NYGC? I'd love to get my hands on the callset if available!

Thanks,

Jared

whole exome data

Very useful tool. Thank you. Could I ask if I can use this tool for my whole exome data?

validate json script error with womtool v51

Bug Report

Affected module(s) or script(s)

/scripts/test/validate.sh

Description

/scripts/test/validate.sh reports invalid inputs in /test/single-sample/GATKSVPipelineSingleSampleTest.test_na19240.json for wdl/GATKSVPipelineSingleSampleTest.wdl when run on the master branch with womtool-51.jar. When the validate script is run with womtool-47.jar, it is successful. This appears to be a version 51-specific bug rather than a legitimate issue with the input json. It should be addressed once we move to cromwell v51.

Expected behavior (v47)

java -jar /Users/epierceh/womtool-47.jar validate wdl/GATKSVPipelineSingleSampleTest.wdl -i ./test/single-sample/GATKSVPipelineSingleSampleTest.test_na19240.json
Success!
PASS

Actual behavior (v51)

java -jar /usr/local/Cellar/cromwell/51/libexec/womtool.jar validate wdl/GATKSVPipelineSingleSampleTest.wdl -i ./test/single-sample/GATKSVPipelineSingleSampleTest.test_na19240.json
WARNING: Unexpected input provided: GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.GATKSVPipelineSingleSample.SingleSampleMetrics.baseline_genotyped_pesr_vcf (expected inputs: [GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.depth_merge_sample_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_override_make_cpx_cnv_input_file, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_wham, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.genotype_pesr_pesr_sepcutoff, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.min_large_pesr_depth_overlap_fraction, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.autosome_file, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_split_variants, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.empty_file, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_mapping_error_rate, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.reference_index, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_genotype_pe, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.bam_or_cram_index, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.wgd_score_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_model_tars, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.metrics_intervals, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_cram_to_bam, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.clean_vcf_max_shards_per_chrom, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_convergence_snr_averaging_window, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_num_thermal_advi_iters, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_merge_allo, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_panel_bincov_matrix, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_melt_metrics, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_copy_number_posterior_expectation_mode, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cnmops_mem_gb_override_sample3, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_add_genotypes, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_count_sr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.min_svsize, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_triple_stream_cat, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_integrate_depth_gq, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_panel_del_bed, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.requester_pays_cram, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.min_large_pesr_call_size_for_filtering, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.delly_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_integrate_gq, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.wgd_build_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.clean_vcf_min_sr_background_fail_batches, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.genotype_depth_pesr_sepcutoff, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.wham_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gatk_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_panel_vcf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_depth_vcf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.clean_vcf_samples_per_clean_vcf_step2_shard, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.NONE_STRING_, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_filter_vcf_by_id, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gatk4_jar_override, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.manta_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_pesr_split_files, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.manta_region_bed, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.noncoding_bed, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_depth_merge_pre_01, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.sv_base_mini_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.max_ref_panel_carrier_freq, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_integrate_pesr_gq, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.protein_coding_gtf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_override_merge_pesr_depth, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.batch, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_caller_external_admixing_rate, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_filter_large_pesr, GATKSVPipelineSingleSampleTest.case_sample, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.pesr_svsize, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_qs_cutoff, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_pesr_concat, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.promoter_bed, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_depth_concat, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_learning_rate, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_merge_sr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.reference_dict, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.case_manta_vcf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_convergence_snr_trigger_threshold, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_copy_number_autosomal_contigs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_index_vcf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cutoffs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_std_melt_vcfs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_active_class_padding_hybrid_mode, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.sv_pipeline_rdtest_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_override_rename_variants, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.melt_standard_vcf_header, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.primary_contigs_fai, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.matrix_qc_distance, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_wham_whitelist, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_qc, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.use_wham, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.linux_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_ploidy, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.pesr_blacklist, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.matrix_qc_pesrbaf_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_override_integrate_resolved_vcfs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.manta_jobs_per_cpu, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.insert_size, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.primary_contigs_list, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_caller_internal_admixing_rate, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.case_melt_vcf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_melt_coverage, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.Collins_2017_tarball, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_merge_stats, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.melt_metrics_intervals, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.pesr_frac, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.coverage, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_std_manta_vcfs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.run_vcf_qc, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_p_alt, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_case, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_std_wham_vcfs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.genomes_in_the_cloud_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.genotype_pesr_depth_sepcutoff, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_bundle, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_merge_counts, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_caller_update_convergence_threshold, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_pesr_disc_files, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.contig_ploidy_model_tar, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.case_wham_vcf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.genotyping_n_per_split, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_baf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.genotype_depth_depth_sepcutoff, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.melt_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.allosome_file, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.n_RD_genotype_bins, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cytobands, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.use_manta, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cnmops_blacklist, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_max_training_epochs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.reference_fasta, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_adamax_beta_1, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cnmops_large_min_size, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_pesr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.genome_file, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.depth_blacklist, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.sv_pipeline_qc_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cnmops_mem_gb_override_sample10, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_override_clean_background_fail, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_log_emission_sampling_median_rel_error, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_count_pe, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.clean_vcf_min_variants_per_shard, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.matrix_qc_rd_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_override_clean_bothside_pass, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.preprocess_calls_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_make_subset_vcf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_initial_temperature, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_max_copy_number, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cnmops_sample3_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.segdups, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_srtest, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.use_delly, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.median_cov_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.wham_whitelist_bed_file, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_log_emission_samples_per_round, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.pe_blacklist, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.pf_reads_improper_pairs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ploidy_build_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_max_advi_iter_subsequent_epochs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.depth_frac, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.clean_vcf_min_records_per_shard_clean_vcf_step1, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.linc_rna_gtf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.depth_flags, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_rdtest_genotype, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.pct_chimeras, GATKSVPipelineSingleSampleTest.ref_samples, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_melt, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_split_vcf, GATKSVPipelineSingleSampleTest.PlotMetrics.preemptible_attempts, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.median_cov_mem_gb_per_sample, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cnmops_clean_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cnmops_ped_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.pesr_distance, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_merge_pe, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.Sanders_2015_tarball, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_shard_sr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_disable_annealing, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.use_melt, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.pesr_flags, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_genotype_sr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_depth_cluster, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.add_sample_to_ped_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_sample_psi_scale, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.manta_region_bed_index, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.bam_or_cram_file, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.allosomal_contigs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.total_reads, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.qc_definitions, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.clean_vcf_random_seed, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_shard_pe, GATKSVPipelineSingleSampleTest.base_metrics, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.mei_bed, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.annotation_sv_per_shard, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_explode, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_max_calling_iters, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.read_length, GATKSVPipelineSingleSampleTest.test_name, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_baf_gather, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_ped_file, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.wgd_scoring_mask, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_set_sample, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.samtools_cloud_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_override_merge_fam_file_list, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.preprocessed_intervals, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_convergence_snr_countdown_window, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ploidy_score_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cnmops_sample10_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_max_advi_iter_first_epoch, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.PE_metrics, GATKSVPipelineSingleSampleTest.PlotMetrics.disk_gb, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_cnv_coherence_length, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.inclusion_bed, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.case_delly_vcf, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.Werling_2018_tarball, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_min_training_epochs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.condense_counts_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.ref_panel_dup_bed, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_postprocess, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_rewritesrcoords, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.manta_mem_gb_per_job, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.depth_merge_set_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_concat_vcfs, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_adamax_beta_2, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_add_batch, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_rdtest_bed, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.sv_pipeline_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_log_emission_sampling_rounds, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_manta, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.rmsk, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.reference_version, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_qc_outlier, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.clean_vcf_max_shards_per_chrom_clean_vcf_step1, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_override_update_sr_list, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.sv_pipeline_base_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.bin_exclude, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.evidence_merging_bincov_runtime_attr, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.tabix_retries, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_pesr_cluster, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.cnmops_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.SR_metrics, GATKSVPipelineSingleSampleTest.PlotMetrics.sv_pipeline_base_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.sv_base_docker, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.gcnv_depth_correction_tau, GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.runtime_attr_merge_pesr_vcfs, GATKSVPipelineSingleSampleTest.PlotMetrics.mem_gib])

This error message is reported for the following "unexpected inputs":

  • GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.GATKSVPipelineSingleSample.SingleSampleMetrics.baseline_cleaned_vcf
  • GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.GATKSVPipelineSingleSample.SingleSampleMetrics.baseline_non_genotyped_unique_depth_calls_vcf
  • GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.GATKSVPipelineSingleSample.SingleSampleMetrics.baseline_genotyped_depth_vcf
  • GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.GATKSVPipelineSingleSample.SingleSampleMetrics.baseline_final_vcf
  • GATKSVPipelineSingleSampleTest.GATKSVPipelineSingleSample.GATKSVPipelineSingleSample.SingleSampleMetrics.baseline_genotyped_pesr_vcf

CollectCounts task ignores bam index parameter

If the bam index name can't be inferred from the bam name, CollectCounts fails with the following error:

A USER ERROR has occurred: Traversal by intervals was requested but some input files are not indexed.

I believe that we need to add the --read-index parameter to our invocation of GATK CollectReadCounts.

This issue may affect other tasks.

Sample ID cross-checking

Modules 00b-0506 have multiple inputs of Array[File] or File that assume consistent sample IDs.

For example, Module 01 has manta_vcfs, delly_vcfs, wham_vcfs, melt_vcfs, del_bed, and dup_bed. The sample IDs in these files are never cross-checked for consistency, eg the sample ID in manta_vcfs[i] matches that of wham_vcfs[i] and also exists in the two CNV bed files.

We should implement some simple checks in these modules to catch input errors caused by the user.

svtk collect-pesr returns inaccurate split read results on small contigs

There's a bug in svtk collect-pesr that causes split reads to be assigned to the incorrect contig in some cases. In particular, this happens when the program is processing a split read and the previously processed split read maps to the previous contig at a position that is within 300bp of the position of the current read on its contig. This is only likely to happen when contigs are very small, and therefore we see it only on alternate decoy and HLA contigs in current hg38 data. This should not affect current primary contig calling pipelines. Here is an illustration of the problem:

root@c56df9af077b:/# samtools view /data/test_pesr_bug.bam | cut -f1-6
H57MYDSXX180605:4:1302:29143:33896	177	chrUn_KN707973v1_decoy	1013	60	4S147M
H57MYDSXX180605:4:1633:8341:1595	113	chrUn_KN707974v1_decoy	1090	9	75S76M
H57MYDSXX180605:4:1633:8341:1596	113	chrUn_KN707974v1_decoy	1290	9	75S76M
H57MYDSXX180605:4:1633:8341:1597	113	chrUn_KN707974v1_decoy	1990	9	75S76M
root@c56df9af077b:/# svtk collect-pesr /data/test_pesr_bug.bam ${SAMPLE} test.split.txt test.disc.txt
root@c56df9af077b:/# cat test.split.txt 
chrUn_KN707973v1_decoy	1012	left	1	${SAMPLE}
chrUn_KN707973v1_decoy	1089	left	1	${SAMPLE}
chrUn_KN707973v1_decoy	1289	left	1	${SAMPLE}
chrUn_KN707974v1_decoy	1989	left	1	${SAMPLE}

BAF from BAM

BAF generation could be improved by using GATK's CollectAllelicCounts on sample BAMs rather than gVCFs/VCFs, which is not only more costly but also operationally more complicated (eg see #8). I propose collecting only at gnomAD SNP sites with >0.1% AF.

Merge cohort vcfs bug

merge_vcfs.py in sv-pipeline checks for duplicates based only on start/stop position and svtype:

def records_match(record, other):
    """Test if two records are same SV"""
    return (record.pos == other.pos and
            record.stop == other.stop and
            record.info['SVTYPE'] == other.info['SVTYPE'])

The start/stop contigs should also be checked.

Remove bash try-catch from GermlineCNVCallerCaseMode

In GermlineCNVCase.wdl GermlineCNVCallerCaseMode implements a pseudo try-catch statement to overcome rare NaN errors. Development is ongoing to pull this retry into the GATK tool, and upon completion the existing task command block should be simplified.

  • The try-catch block (actually two {} command blocks separated by ||) should be removed.
  • The function run_gcnv_case() should be simply inlined as code to be executed.
  • The function get_seeded_random() should be removed.

hg37 samples

Hi and thank you very much for this new and clear repository.

I was wondering if this method is somehow suitable for hg37 samples or it is strictly limited to hg38.

Thank you very much in advance!

How to make a 1kg_ref_panel_v1.ped?

Description

Recently, I just want to use :gs://gcp-public-data--broad-references/hg38/v0/sv-resources/ref-panel/1KG/v1/ped/1kg_ref_panel_v1.ped.
The content is as follows:

HG00129 HG00129 0 0 1 0
HG00140 HG00140 0 0 1 0

I want to make a file like this for my own dataset, but I have no idea about the meaning of these columns. How do I understand the meaning of these columns?


Thank you very much.

Replace NAs with `.` in single sample BED file output

Feedback from clinical users mentioned that NA was once used as a gene name so it can be confusing to have it in output fields; they recommended replacing it with ..

It might be nice to do this (at least optionally) in svtk vcf2bed, which currently seems to use NA by default for missing values.

Wham emits DEL/DUP calls with SVLEN=0

These appear to be misclassified insertions, and at least some of them are called as INS by manta. Evaluations need to be done on some different solutions:

  1. Upgrading wham (currently 1.7.0)
  2. Converting these calls to INS
  3. Filtering these calls

In addition, these sites are not clustered properly in Module 01.

Out of memory at CNMOPS in SingleSample mode

Dear developers,

I'm trying to run the SingleSample pipeline example (inputs/GATKSVPipelineSingleSample.ref_panel_1kg.na12878.no_melt.json), however the pipeline run was killed due to an out-of-memory issue. Please see the detail in the following:

Affected module(s) or script(s)

module: Module00c > CNMOPS or CNMOPSLarge
script:

  • runcnMOPS.R
  • cnMOPS_workflow.sh
  • CNMOPS.wdl

Affected version(s)

  • master branch cloned on 15-Sep-2020

Description

I noticed that a few R processes for runcnMOPS.R took all the memory (120G), but it's not clear why this happened and how to adjust the number of processes?

image

Please will you advise?

Best wishes,
Fengyuan

MELT.melt_standard_vcf_header

Hi guys, quick question. I downloaded MELT and made docker. How can I generate such MELT.melt_standard_vcf_header file?

Add number of samples in cohort output from MergeCohortVcfs

We are planning to add a "number of samples in the cohort" input to Module04.GenotypeDepthPart2.GetRegenotype to enable determination of variant frequency pre-genotyping in the cohort. For now, this can be a ballpark or determined from other sources, but going forward we should determine the number of samples in the cohort - after filtering - during MergeCohortVcfs, to feed into Module04. This can be done just by reading the header of each batch VCF during merge_vcfs.py and keeping a counter of samples.

Avoid repeated calculation

Hey,

I was wondering if in the BatchEvidenceMerging.wdl, on the MergeEvidenceFilesByContig task, it wouldn't make sense to avoid this calculation:

Int disk_size_gb = 10 + ceil(size(files, "GB") * disk_size_factor)

Since this task will be scattered, each one of those will calculate it when the value is the same for all of them. So, what do you think to calculate this value outside of the scatter and send it as input?

Peg GATK release in sv-base Dockerfile

We can first use a permanent commit from master once PrintSVEvidence is merged (currently a commit hash is used, which may change if the branch is deleted or rebased). Ultimately this should be pegged to a GATK release.

Formatting improvements

To-be-addressed from https://github.com/talkowski-lab/gatk-sv-v1/pull/252:

  • Some workflows currently require index files as inputs (e.g. module00c/CNMOPS). They should either be given default values or removed as inputs altogether.
  • Some tasks require index files as inputs. This matters in some cases because tasks are called from other workflows, requiring passing the input files around (e.g. ClinicalFiltering.wdl::resetFilter line 392, called from GATKSVPipelineClinical).
  • Because of the previous points, some workflows are still passing around index files. Note some index files have _idx in the file name, and some have _index
  • Camel case convention: I think in most projects acronyms are treated as "words", hence CNMOPS -> Cnmops, GATKSVPipelinePhase1 -> GatkSvPipelinePhase1. We've just been treating them as a grey area. I don't think this is a big deal, but since you're going through and making a camel case fix, I wanted to take the time to get an explicit statement on how we treat acronyms.
  • There are a few remaining references to "fam_file" instead of "ped_file"
    -- "runtime_override_merge_fam_file_list" in GATKSVPipelineClinical.wdl
    -- "cleaned_fam_file" in master_vcf_qc.wdl
    -- "fam_file" in several of the mosaic WDLs

Reconfigure external benchmarking datasets for automatic QC in Module05_06

Currently, Module05_06 takes the following as optional inputs:

Array[File]? thousand_genomes_tarballs
Array[File]? hgsv_tarballs
Array[File]? asc_tarballs
File? sanders_2015_tarball
File? collins_2017_tarball
File? werling_2018_tarball

These files are used to ensure appropriate calibration of allele frequencies, etc., and (where applicable) to directly compare sensitivity and PPV per sample vs. previous callsets.

Not an urgent request, but we should eventually reconfigure these inputs to flexibly accept any number of cohort-level and sample-level tarballs if any external benchmarking datasets are desired. For example, I would imagine gnomAD & 1kG being the default cohort-level benchmarking datasets, with a default of zero per-sample benchmarking datasets (although either of these could be overridden by the user depending on their cohort/study design, etc).

Happy to discuss in more detail if/when anyone reworks the standard QC subroutines in Module05_06

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.