Coder Social home page Coder Social logo

nf-core / raredisease Goto Github PK

View Code? Open in Web Editor NEW
83.0 150.0 34.0 12.86 MB

Call and score variants from WGS/WES of rare disease patients.

Home Page: https://nf-co.re/raredisease

License: MIT License

HTML 0.64% Python 2.88% Nextflow 96.02% Groovy 0.29% Shell 0.16%
nf-core nextflow workflow pipeline wgs wes variant-calling snv structural-variants variant-annotation

raredisease's Issues

Add feature/MarkDuplicates

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Add this finishing touch to the mapping subworkflow so that the preprocessing of bam files is complete before branching into other tools e.g. variant callers

Describe alternatives you've considered

Additional context

Add bcftools/annotate

Description of feature

Hello ๐Ÿ‘‹ , we use this to add additional annotations after vcfanno and include a header relating to software and case info

Create an interactive chart to use during pipeline development

To complement the project board, we would like to have an interactive chart that can reflect the progresses of the development work and that can be easily modified e.g. when we want to include more tools.
nf-core recommends LucidChart or Google Drawings for such task. For the moment we are going for Google Drawings.

Draft overview of future pipeline

nf-raredisease

This overview is based on the WGS/WES rare disease pipeline (MIP) that is currently in use at Clinical Genomics Stockholm. This outlines the basic functionality and modules that we would like to have from a pipeline specialised for calling, annotating and scoring variants relevant for rare disease patients.

Overview

Fastq files are prepared for variantcalling by alignment with bwa-mem/bwa-mem2 followed by markduplicates. From this point the workflow is split into a SNV/indel part and a SV part.

SNV/Indels

SNV/indels are primarly called with Deepvariant Glnexus but with the possibility of turning on the GATK Haplotypecaller workflow. These two callsets can be combined into one for maximum sensitivity. Vcfanno annotates the callset with population allele frequencies (Gnomad) and predicted pathogenicity (CADD). Common variation is removed from the callset and CADD scores are caclulated for indels. VEP is used for transcript annotation including annotation with CLINVAR, SpliceAI and pLI scores. The SNV/indels are split into a clinical callset and a research callset based on a bed file with genes of interest. Finally the variants are ranked for predicted pathogenicity based on their annotation as well as their modes of inheritance.

SV

We use Cnvnator, Manta, (Delly) and Tiddit to call structural variants. Using SVDB we combine the variants into one callset and using a local frequency database we remove common variants and sequencing/calling artefacts. The callset is annotated with vcfanno and VEP followed by a split into a clincal callset and a research callset. The SVs are then ranked in the same manner as the SNVs.

But wait there's more

Aside from SNVs and SVs the pipelines identifies and visualizes runs of homo/auto-zygosity as well as upd:s. Also included are identification and annotation of pathogenic STRs with ExpansionHunter and Stranger. SMNCopyNumberCaller is used to diagnose patients with spinal muscular atrophy.

The tools mentioned here are not set in stone and we are certainly open to adding and changing tools as we continue development. Below is a list of tools used in the workflow.

Bcftools
BedTools
BWA
CADD
Chanjo
Chromograph
Cnvnator
Cyrius
Delly
Deepvariant
Expansionhunter
FastQC
GATK
GENMOD
Gffcompare
Glnexus
Manta
MultiQC
Peddy
PicardTools
PLINK
Rhocall
Sambamba
Samtools
SMNCopyNumberCaller
Stranger
Svdb
Telomerecat
Tiddit
Upd
Vcf2cytosure
Vcfanno
VEP

Bcftools norm

Is your feature request related to a problem? Please describe

Normalize and split multi allelic variants using bcftools norm prior to annotation

Describe the solution you'd like

Incorporate the bcftools norm module from nf-core modules.

Describe alternatives you've considered

We could use vt decompose and normalize

Additional context

Add svdb/merge to pipeline

Description of feature

Add this to call_structural_variants.nf to combine VCFs from manta, cnvpytor, tiddit

Include a default variant catalog file

Maybe a default file for variant_catalog (in case the user doesn't provide one) should still be added? What do you think? In case this should be included in this merge, I can try to look for how this could be done in prepare_genome.nf.

It's not a bad idea. However I think we can go ahead and merge this one and add that option in a small PR later. We could bundle it with the pipeline or have it as an url https://raw.githubusercontent.com/Illumina/ExpansionHunter/master/variant_catalog/hg19/variant_catalog.json
There has also been a discussion about adding a download workflow which would automatically download all the references.

Originally posted by @jemten in #51 (comment)

RevertSam-GATK

RevertSam

Produce unmapped BAM (uBAM) from aligned BAM

New modules required: Sentieon

Here is a list of Sentieon tools that are relevant for the pipeline and for which issues have been opened in https://github.com/nf-core/modules.

Another tool that might be relevant but for which there is no open issue at the moment:

  • WgsMetricsAlgo

Add TIDDIT/cov module

Our in-house pipeline (MIP) uses this tool and we want to add this to the nextflow pipeline. It's not part of nf-core modules yet: nf-core/modules#792.

Once the module is added to nf-core/modules, then it'll be added to subworkflow qc_bam.nf

Adding read groups to meta

It would be good to add read_group to meta so bwa_mem2 can use it and other future programs too (e.g. peddy needs it).

I have tested a little bit the addition of line:
meta.read_group = "'@rg\tID:"+row.sample + "" + row.fastq_1.split('/')[-1].split('R1*.fastq')[0] + "" + row.lane + "\tPL:ILLUMINA\tSM:"+row.sample.split('')[0]+"'"
in subworkflows/local/input_check.nf
But it creates issues when GLnexus needs to combine the different channels again (see nextflow log)
This problem does however not arise with
meta.read_group = "'@rg\tID:myid\tPL:ILLUMINA\tSM:"+row.sample.split('_')[0]+"'"

The problem arises with both a unique sample or multiple samples in the samplesheet.

Parse input vcf to check for normalization

Is your feature request related to a problem? Please describe

We need to know that the input vcf:s used in for example the annotation process have been decomposed.

Describe the solution you'd like

Write a small script that parses the header and checks for bcftools norm command.

Describe alternatives you've considered

Additional context

Java memory issue on SLURM

Check Documentation

I have checked the following places for your error:

Description of the bug

Steps to reproduce

Steps to reproduce the behaviour:

  1. nextflow run nf-core/raredisease -profile test,singularity,hasta,dev_prio -r dev (-c customconf.conf )
  2. See error:

Without customconf.conf

[dd/f36687] NOTE: Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (134) -- Execution is retried (1)
WARN: Input tuple does not match input set cardinality declared by process `NFCORE_RAREDISEASE:RAREDISEASE:DEEPVARIANT_CALLER:GLNEXUS` -- offending value: [id:caseydonkey]
Error executing process > 'NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)'

Caused by:
  Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (134)

Command executed:

  picard \
      -Xmx6g \
      MarkDuplicates \
      --CREATE_INDEX \
      -I 1234N.bam \
      -O 1234N_sorted.bam \
      -M 1234N_sorted.MarkDuplicates.metrics.txt

  cat <<-END_VERSIONS > versions.yml
  MARKDUPLICATES:
      markduplicates: $(echo $(picard MarkDuplicates --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
  END_VERSIONS

Command exit status:
  134

Command output:
  #
  # A fatal error has been detected by the Java Runtime Environment:
  #
  #  Internal Error (g1PageBasedVirtualSpace.cpp:43), pid=211157, tid=211219
  #  guarantee(rs.is_reserved()) failed: Given reserved space must have been reserved already.
  #
  # JRE version:  (11.0.9.1) (build )
  # Java VM: OpenJDK 64-Bit Server VM (11.0.9.1-internal+0-adhoc..src, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
  # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
  #
  # An error report file with more information is saved as:
  # hs_err_pid211157.log
  #
  #

Command error:
  /usr/local/bin/picard: line 5: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
  /usr/local/bin/picard: line 66: 211157 Aborted                 /usr/local/bin/java -Xmx6g -jar /usr/local/share/picard-2.25.7-0/picard.jar MarkDuplicates "--CREATE_INDEX" "-I" "1234N.bam" "-O" "1234N_sorted.bam" "-M" "1234N_sorted.MarkDuplicates.metrics.txt"

With customconf.conf:

process {
    withName: PICARD_MARKDUPLICATES {
        memory = 5.GB
    }
}
Error executing process > 'NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)'

Caused by:
  Process `NFCORE_RAREDISEASE:RAREDISEASE:ALIGN_BWAMEM2:MARKDUPLICATES (1234N)` terminated with an error exit status (1)

Command executed:

  picard \
      -Xmx5g \
      MarkDuplicates \
      --CREATE_INDEX \
      -I 1234N.bam \
      -O 1234N_sorted.bam \
      -M 1234N_sorted.MarkDuplicates.metrics.txt

  cat <<-END_VERSIONS > versions.yml
  MARKDUPLICATES:
      markduplicates: $(echo $(picard MarkDuplicates --version 2>&1) | grep -o 'Version:.*' | cut -f2- -d:)
  END_VERSIONS

Command exit status:
  1

Command output:
  Error occurred during initialization of VM
  Could not reserve enough space for 5242880KB object heap

Command error:
  /usr/local/bin/picard: line 5: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory

Expected behaviour

Successful completion of the analysis

Log files

Have you provided the following extra information/files:

  • The command used to run the pipeline
  • The .nextflow.log file

System

  • Hardware: HPC, hasta
  • Executor: slurm
  • OS: CentOS
  • Version: 7

Nextflow Installation

  • Version: 21.04.3.5560

Container engine

  • Engine: singularity
  • version: 3.1.1-1.el7

Quick fix that solves the problem until more elegant solution

modules/nf-core/modules/picard/markduplicates/main.nf:
avail_mem = task.memory.giga-2

Next related issue

Similar error for bamqc.

Additional context

For the first error, markduplicates:

nextflow-customconf.log
nextflow-no-customconf.log

Add picardtools collecthsmetrics to BamQC subworkflow

Our in-house pipeline uses this tool and we want to add this to the nextflow pipeline. It's not part of nf-core modules yet: nf-core/modules#793.

Once the module is added to nf-core/modules, then it'll be added to subworkflow qc_bam.nf

EDIT: The module is part of nf-core/modules now. Please go ahead and add it to subworkflow qc_bam.nf

VEP

Is your feature request related to a problem? Please describe

Add VEP from nf-core modules

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Gens preprocessing

Description of feature

Add preprocessing for Gens to the pipeline.

  • GATK CollectReadCounts added to nfcore/modules
  • GATK DenoiseReadCounts added to nfcore/modules
  • Gens perl-scripts added as a local module
  • Local subworkflow added
  • Subworkflow added to main workflow

Create subworkflow to prepare indices

Is your feature request related to a problem? Please describe

The pipeline currently re-builds the index for bwamem2 on every run. In order to save resources, there should be a check if there are existing indices to be used instead.

Describe the solution you'd like

This subworkflow should 1) check for existing reference index files 2) allow for the re-use of indices in different downstream processes.

Describe alternatives you've considered

Additional context

Mitochondria workflow

We have agreed to use the mitochondria workflow currently implemented at GATK best practices.

The following steps are included. Modules already exist for some of them; all modules need to be included in a subworkflow. We plan to have the mitochondrial subworkflow run by default, but to have the possibility to turn it off and also to turn off the calling of variants for the autosomes.

  • Samtools subsampling [nf-core/raredisease] #49
  • RevertSam [nf-core/raredisease]#106
  • SamtoFastq [nf-core/raredisease]#107
  • BWA
  • GATK MergeBamAlignment
  • Picard MarkDuplicates
  • Haplocheck [nf-core/raredisease]#111
  • call variants with GATK Mutect2
  • Picard LiftoverVCF
  • GATK Mergevcfs [nf-core/raredisease]#113
  • GATK4 [FilterMutectCalls] [nf-core/raredisease]#115
  • GATK Filterblacklist
  • annotation with HmtNote
  • annotation with vep. Because it requires a database, special care has to be taken to run it offline. Cf nf-core/sarek too.
  • bcftools query and
  • bcftools view to prepare input for haplogrep2
  • call mitochondrial haplogroup with haplogrep2 (To be included in workflow)
  • detect mitochondrial deletions with eKLIPse (not in bioconda) (To be included in workflow) (to be checked )

This list can be modified as new issues are created and new modules are added.

Test dataset including mtDNA: https://github.com/nf-core/test-datasets/tree/raredisease

update current module versions

Is your feature request related to a problem? Please describe

Samtools and MultiQC are outdated.

Describe the solution you'd like

Update them versions ๐Ÿ˜ƒ

Describe alternatives you've considered

Additional context

add tiddit/sv

Description of feature

In MIP, we combine callsets using svdb from manta, tiddit/sv, and cnvnator. We should add tiddit/sv.

Add Vcfanno to nf-core modules

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Update the way module versions are emitted

Is your feature request related to a problem? Please describe

nf-core/modules updated the way versions are emitted, so from <software>.version.txt -> versions.yml. This allows to emit multiple versions in cases a module or subworkflow uses multiple tools. Updated documentation here.

This pipeline is not updated accordingly yet.

Describe the solution you'd like

Update the subworkflows and main workflow ๐Ÿ˜„ accordingly

Describe alternatives you've considered

Additional context

SamtoFastQ

SamtoFastQ

Convert SAM or BAM file to FastQ

refactor alignmnt modules

Description of feature

Currently, the code snippet is in the raredisease.nf script but when there are more mappers/aligners in the picture we should hide the logic away in a bigger subworkflow with switches for which tool to use. This way we can declutter the raredisease.nf script.

if (params.aligner == 'bwamem2') {
        ALIGN_BWAMEM2 (
            INPUT_CHECK.out.reads,
            PREPARE_GENOME.out.bwamem2_index
        )
...

turns into...

if (aligner == 'bwamem2') {
        ALIGN_BWAMEM2 (
            reads,
            bwamem2_index
        )
...

stowed in the bigger subworkflow ๐Ÿ‘ - where aligner is a resource defined in the take: definition block.

Add vcfanno to the annotation workflow

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.