Coder Social home page Coder Social logo

nf-core / exoseq Goto Github PK

View Code? Open in Web Editor NEW
15.0 171.0 27.0 2.83 MB

Please consider using/contributing to https://github.com/nf-core/sarek

Home Page: http://nf-co.re

License: MIT License

Nextflow 94.10% Python 3.68% Dockerfile 0.74% Shell 1.48%
nf-core nextflow workflow exome exome-sequencing genomics variant-calling

exoseq's Introduction

nf-core/ExoSeq

Build Status Nextflow Gitter Docker Container available Singularity Container available

This is still work in practice, but will hopefully soon be a stable version that will then be published in a release version.

Introduction

nfcore/ExoSeq is a bioinformatics analysis pipeline that performs best-practice analysis pipeline for Exome Sequencing data.

The pipeline is built based on GATK best practices using Nextflow, a bioinformatics workflow tool. The main steps done by pipeline are the following (more information about the processes can be found here).

  • Alignment
  • Marking Duplicates
  • Recalibration
  • Realignment
  • Variant Calling
  • Variant Filtration
  • Variant Evaluation
  • Variant Annotation

Documentation

The nfcore/ExoSeq pipeline comes with documentation about the pipeline, found in the docs/ directory:

  1. Pipeline installation and configuration instructions
  2. Pipeline configuration
  3. Running the pipeline
  4. Output and how to interpret the results
  5. Troubleshooting

Credits

The pipeline was initally developed by Senthilkumar Panneerselvam (@senthil10) with a little help from Phil Ewels (@ewels) at the National Genomics Infrastructure, part of SciLifeLab in Stockholm and has been extended by Alex Peltzer (@apeltzer), Marie Gauder (@mgauder) from QBIC Tuebingen/Germany as well as Marc Hoeppner (@marchoeppner) from IKMB Kiel/Germany.

Many thanks also to others who have helped out along the way too, including @pditommaso, @colindaven.

exoseq's People

Contributors

alneberg avatar apeltzer avatar ewels avatar maxulysse avatar senthil10 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

exoseq's Issues

Enable multi-lane support

Some sequencing setups will split libraries across lanes. This is currently not modeled in the pipeline.

Using a CSV to keep track of IndividualID and sampleID, we could do something along these lines:

runBWAOutput_grouped_by_sample = runBWAOutput.groupTuple(by: [0,1])

process mergeBamFiles_bySample {

        tag "${indivID}|${sampleID}"
	
	input:
        set indivID, sampleID, file(aligned_bam_list) from runBWAOutput_grouped_by_sample

	output:
	set indivID,sampleID,file(merged_bam) into mergedBamFile_by_Sample

	script:
	merged_bam = sampleID + "merged.bam"

	"""
		java -jar -Djava.io.tmpdir=tmp/ -jar ${PICARD} MergeSamFiles \
			INPUT=${aligned_bam_list.join(' INPUT=')} \
			OUTPUT=${merged_bam} \
			CREATE_INDEX=false \
			CREATE_MD5_FILE=false \
			SORT_ORDER=coordinate
	"""
}

Move FastQC into trimGalore

Move the fastqc analysis into trimgalore using the built-in --fastqc option. This way we get metrics about the actually used data rather than the initial raw data (which sequencing centers usually include anyway)

Include Switch to skip GenotypeGVCFs ?

GenotypeGVCFs requires > 30 Exomes . We can't test this properly unfortunately, but should aim in providing

  • a possibility to include more (pre-computed? gvcfs) to run it properly even if a user just has <30 Exome VCFs at hand
  • a possibility to skip this?

Cheers

Configuration - nest assemblies and references

At the moment, the config file does not group genome assemblies with e.g. kit targets/baits. Since these are usually relative to a given coordinate system (assembly), these should be grouped. For example:

params {

   genomes {


'hg19' {
		fasta = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/ucsc.hg19.fasta"
		dict = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/ucsc.hg19.dict"
		dbsnp = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/dbsnp_138.hg19.vcf.gz"
		gold = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz"
		g1k = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/1000G_phase1.indels.hg19.sites.vcf.gz"
		omni = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/1000G_omni2.5.hg19.sites.vcf.gz"
		hapmap = "/ifs/data/nfs_share/ikmb_repository/references/gatk/bundle/2.8/hg19/hapmap_3.3.hg19.sites.vcf.gz"
		kits {
      			'Nextera' {
         			targets = "/ifs/data/nfs_share/ikmb_repository/references/exomes/nextera_exome_target_2017/nexterarapidcapture_exome_target_v1.2_hg19.interval_list"
         			baits = "/ifs/data/nfs_share/ikmb_repository/references/exomes/nextera_exome_target_2017/nexterarapidcapture_exome_intervals_v1.2_hg19.interval_list"
      			}
      			'xGen' {
        			targets = "/ifs/data/nfs_share/ikmb_repository/references/exomes/idt_xgen/v1.0/xgen-exome-research-panel-targets.interval_list"
        			baits = "/ifs/data/nfs_share/ikmb_repository/references/exomes/idt_xgen/v1.0/xgen-exome-research-panel-probes.interval_list"
      			}
   		}
	}
  }
}

Split MultiQC output into fastq, library and sample

If we support an input mode in which we support sample IDs and multi-lane sequencing configs, it would make sense to split the MultiQC reports into these three tiers:

the fastq file(s)

  • fastqc report

the sequencing library

  • MarkDuplicate stats

the sample

  • Picard HS
  • alignment stats
  • picard multiple metrics (CollectMultipleMetrics)

Add HSMetrics

Picard tools includes comprehensive metrics for exome analysis (target coverage, off target reads etc) - CollectHsMetrics.

Include this as a MultiQC input after duplicate marking

Example code (report attached): sample_custom_targets_only.html.zip

process runHybridCaptureMetrics {

    tag "${indivID}|${sampleID}"
    publishDir "${OUTDIR}/Common/${indivID}/${sampleID}/Processing/Picard_Metrics", mode: 'copy'

    input:
    set indivID, sampleID, file(bam), file(bai) from runPrintReadsOutput_for_HC_Metrics

    output:
    file(outfile) into HybridCaptureMetricsOutput mode flatten

    script:
    outfile = sampleID + ".hybrid_selection_metrics.txt"

    """
        java -XX:ParallelGCThreads=1 -Xmx${task.memory.toGiga()-mem_adjust}G -Djava.io.tmpdir=tmp/ -jar $PICARD CollectHsMetrics \
                INPUT=${bam} \
                OUTPUT=${outfile} \
                TARGET_INTERVALS=${TARGETS} \
                BAIT_INTERVALS=${BAITS} \
                REFERENCE_SEQUENCE=${REF} \
                TMP_DIR=tmp
        """
}

Parallelization for Haplotypecaller

In order to speed up Haplotypecaller step, parallelization can be introduced per chromosome etc.. and then merge the gvcfs. This will be particularly useful to reduce the time to analyze WGS samples.

Remove support for genome processing

Genome processing would require a somewhat different flow to fully leverage HPC systems (interval based parallelization starting at HC). I suggest to remove WGS support from the design and have the Sarek pipeline as the go-to option for that.

Joint Discovery merge with main script

In case we dont have > 30 exomes, we could simply set up the VQSR recalibration step to take 30 exome samples from 1000G and use these as calibration samples. For bigger clusters, we could even store these without having to process them every time we analyse a single exome sample.

What do you think @marchoeppner ?

Switch to a CSV/TSV based input

For the sake of pulling in relevant meta data, I suggest to use CSV/TSV as default input format rather than a folder with a bunch of FastQ files.

Suggested format would be:

IndivID;SampleID;libraryID;rgID;rgPU;platform;platform_model;Center;Date;R1;R2

Peter;Germline;G00077-L2;HGJJMBBXX.3.G00077-L2;HGJJMBBXX.3.TCCTGAGC+ATAGAGAG;Illumina;NextSeq500;IKMB;2018-02-06;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00077-L2_S20_L003_R1_001.fastq.gz;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00077-L2_S20_L003_R2_001.fastq.gz

Peter;Tumor;G00078-L2;HGJJMBBXX.3.G00078-L2;HGJJMBBXX.3.GGACTCCT+ATAGAGAG;Illumina;NextSeq500;IKMB;2018-02-06;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00078-L2_S21_L003_R1_001.fastq.gz;/ifs/data/nfs_share/sukmb352/projects/pipelines/exomes/trio/original_sequences/G00078-L2_S21_L003_R2_001.fastq.gz

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.