nf-core / exoseq Goto Github PK

Please consider using/contributing to https://github.com/nf-core/sarek

License: MIT License

Nextflow 94.10% Python 3.68% Dockerfile 0.74% Shell 1.48%

nf-core nextflow workflow exome exome-sequencing genomics variant-calling

exoseq's Introduction

This is still work in practice, but will hopefully soon be a stable version that will then be published in a release version.

Introduction

nfcore/ExoSeq is a bioinformatics analysis pipeline that performs best-practice analysis pipeline for Exome Sequencing data.

The pipeline is built based on GATK best practices using Nextflow, a bioinformatics workflow tool. The main steps done by pipeline are the following (more information about the processes can be found here).

Alignment
Marking Duplicates
Recalibration
Realignment
Variant Calling
Variant Filtration
Variant Evaluation
Variant Annotation

Documentation

The nfcore/ExoSeq pipeline comes with documentation about the pipeline, found in the docs/ directory:

Credits

The pipeline was initally developed by Senthilkumar Panneerselvam (@senthil10) with a little help from Phil Ewels (@ewels) at the National Genomics Infrastructure, part of SciLifeLab in Stockholm and has been extended by Alex Peltzer (@apeltzer), Marie Gauder (@mgauder) from QBIC Tuebingen/Germany as well as Marc Hoeppner (@marchoeppner) from IKMB Kiel/Germany.

Many thanks also to others who have helped out along the way too, including @pditommaso, @colindaven.

exoseq's People

Contributors

Stargazers

Watchers

Forkers

scilifelab apeltzer marchoeppner pythseq heuermh inambioinfo alneberg luran1 losiclab denis-yuen lconde-ucl lokiluciferase tomraulet ash9nov natalieeo novapyth lrguo1204 nvrivera biofriends

exoseq's Issues

Add Travis Tests for ExoSeq

To be compatible with nf-core, we need travis tests !

Enable multi-lane support

Some sequencing setups will split libraries across lanes. This is currently not modeled in the pipeline.

Using a CSV to keep track of IndividualID and sampleID, we could do something along these lines:

runBWAOutput_grouped_by_sample = runBWAOutput.groupTuple(by: [0,1])

process mergeBamFiles_bySample {

        tag "${indivID}|${sampleID}"
	
	input:
        set indivID, sampleID, file(aligned_bam_list) from runBWAOutput_grouped_by_sample

	output:
	set indivID,sampleID,file(merged_bam) into mergedBamFile_by_Sample

	script:
	merged_bam = sampleID + "merged.bam"

	"""
		java -jar -Djava.io.tmpdir=tmp/ -jar ${PICARD} MergeSamFiles \
			INPUT=${aligned_bam_list.join(' INPUT=')} \
			OUTPUT=${merged_bam} \
			CREATE_INDEX=false \
			CREATE_MD5_FILE=false \
			SORT_ORDER=coordinate
	"""
}

Move FastQC into trimGalore

Move the fastqc analysis into trimgalore using the built-in --fastqc option. This way we get metrics about the actually used data rather than the initial raw data (which sequencing centers usually include anyway)

Singularity Support

We should have a Singularity file in here...

Include Switch to skip GenotypeGVCFs ?

GenotypeGVCFs requires > 30 Exomes . We can't test this properly unfortunately, but should aim in providing

a possibility to include more (pre-computed? gvcfs) to run it properly even if a user just has <30 Exome VCFs at hand
a possibility to skip this?

Cheers

Add AWSBatch Profiles

Merge BWA and Samtools steps

Merge Samtools into the BWA step using pipes to immediately produce a sorted CRAM (or BAM) file and index.

Evaluate / Integrate Googles DeepVariant Caller?

Make CRAM default alignment format

Include the --bam switch and make CRAM the default output for BWA and ApplyBQSR

Configuration - nest assemblies and references

At the moment, the config file does not group genome assemblies with e.g. kit targets/baits. Since these are usually relative to a given coordinate system (assembly), these should be grouped. For example:

params {

   genomes {


'hg19' {
		fasta = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/ucsc.hg19.fasta"
		dict = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/ucsc.hg19.dict"
		dbsnp = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/dbsnp_138.hg19.vcf.gz"
		gold = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz"
		g1k = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/1000G_phase1.indels.hg19.sites.vcf.gz"
		omni = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/1000G_omni2.5.hg19.sites.vcf.gz"
		hapmap = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/hapmap_3.3.hg19.sites.vcf.gz"
		kits {
      			'Nextera' {
         			targets = "/ifs/data/nfs_share/ikmb_repository/references/exomes/nextera_exome_target_2017/nexterarapidcapture_exome_target_v1.2_hg19.interval_list"
         			baits = "/ifs/data/nfs_share/ikmb_repository/references/exomes/nextera_exome_target_2017/nexterarapidcapture_exome_intervals_v1.2_hg19.interval_list"
      			}
      			'xGen' {
        			targets = "/ifs/data/nfs_share/ikmb_repository/references/exomes/idt_xgen/v1.0/xgen-exome-research-panel-targets.interval_list"
        			baits = "/ifs/data/nfs_share/ikmb_repository/references/exomes/idt_xgen/v1.0/xgen-exome-research-panel-probes.interval_list"
      			}
   		}
	}
  }
}

Update Dependencies

We should use fresh samtools etc ;-)

Split MultiQC output into fastq, library and sample

If we support an input mode in which we support sample IDs and multi-lane sequencing configs, it would make sense to split the MultiQC reports into these three tiers:

the fastq file(s)

fastqc report

the sequencing library

MarkDuplicate stats

the sample

Picard HS
alignment stats
picard multiple metrics (CollectMultipleMetrics)

Support for mixed capture methods

It should be possible to have different capture methods in a single run, e.g. Agilent v3, v4, v5.

Provide additional nf script for preparing Exome-Kit files

Add a DOI on first release

Add HSMetrics

Picard tools includes comprehensive metrics for exome analysis (target coverage, off target reads etc) - CollectHsMetrics.

Include this as a MultiQC input after duplicate marking

Example code (report attached): sample_custom_targets_only.html.zip

process runHybridCaptureMetrics {

    tag "${indivID}|${sampleID}"
    publishDir "${OUTDIR}/Common/${indivID}/${sampleID}/Processing/Picard_Metrics", mode: 'copy'

    input:
    set indivID, sampleID, file(bam), file(bai) from runPrintReadsOutput_for_HC_Metrics

    output:
    file(outfile) into HybridCaptureMetricsOutput mode flatten

    script:
    outfile = sampleID + ".hybrid_selection_metrics.txt"

    """
        java -XX:ParallelGCThreads=1 -Xmx${task.memory.toGiga()-mem_adjust}G -Djava.io.tmpdir=tmp/ -jar $PICARD CollectHsMetrics \
                INPUT=${bam} \
                OUTPUT=${outfile} \
                TARGET_INTERVALS=${TARGETS} \
                BAIT_INTERVALS=${BAITS} \
                REFERENCE_SEQUENCE=${REF} \
                TMP_DIR=tmp
        """
}

Parallelization for Haplotypecaller

In order to speed up Haplotypecaller step, parallelization can be introduced per chromosome etc.. and then merge the gvcfs. This will be particularly useful to reduce the time to analyze WGS samples.

Remove support for genome processing

Genome processing would require a somewhat different flow to fully leverage HPC systems (interval based parallelization starting at HC). I suggest to remove WGS support from the design and have the Sarek pipeline as the go-to option for that.

Automatically check Reference Files are compatible

Prepare for nf-core sync

In order to get synced with the nf-core template, this repo needs to be prepared accordingly as described in the nf-core documentation.

Move kits to a base params definition, similar to igenomes

Maybe with download option :-)

Joint Discovery merge with main script

In case we dont have > 30 exomes, we could simply set up the VQSR recalibration step to take 30 exome samples from 1000G and use these as calibration samples. For bigger clusters, we could even store these without having to process them every time we analyse a single exome sample.

What do you think @marchoeppner ?

Switch to a CSV/TSV based input

For the sake of pulling in relevant meta data, I suggest to use CSV/TSV as default input format rather than a folder with a bunch of FastQ files.

Suggested format would be:

IndivID;SampleID;libraryID;rgID;rgPU;platform;platform_model;Center;Date;R1;R2

Peter;Germline;G00077-L2;HGJJMBBXX.3.G00077-L2;HGJJMBBXX.3.TCCTGAGC+ATAGAGAG;Illumina;NextSeq500;IKMB;2018-02-06;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00077-L2_S20_L003_R1_001.fastq.gz;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00077-L2_S20_L003_R2_001.fastq.gz

Peter;Tumor;G00078-L2;HGJJMBBXX.3.G00078-L2;HGJJMBBXX.3.GGACTCCT+ATAGAGAG;Illumina;NextSeq500;IKMB;2018-02-06;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00078-L2_S21_L003_R1_001.fastq.gz;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00078-L2_S21_L003_R2_001.fastq.gz