Coder Social home page Coder Social logo

tonyliang19 / variant_catalogue_pipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from scorreard/variant_catalogue_pipeline

0.0 0.0 0.0 948.16 MB

Variant catalogue pipeline test

License: Apache License 2.0

Shell 0.09% Python 2.28% R 1.16% HTML 95.07% Nextflow 1.41%

variant_catalogue_pipeline's Introduction

Important disclaimer about this repo

This repo is a clone from https://github.com/wassermanlab/Variant_catalogue_pipeline and a Work In Progress.

https://github.com/wassermanlab/Variant_catalogue_pipeline must be considered as the main repo and be the one cited or mentionned if you use this code.

The purpose of this repo is to migrate the variant catalogue pipeline to nf-core standard.

It is possible to use this repo to test the pipeline on public, "reduced", dummy data.

The test data contains fastq files from 5 individuals (regions of chr20, chrX, chrY, chrM), the reference genome (GRCh38) for chr20, chrX, chrY and chrM. Moreover, it contains a subset of the files necessary to run some of the tools.

To test the pipeline on you server, you need nextflow, singularity and conda installed, then use :

git clone https://github.com/scorreard/Variant_catalogue_pipeline.git
gzip -d Variant_catalogue_pipeline/testdata/reference/hg38_full_analysis_set_plus_decoy_hla_chr20_X_Y_MT.fa.gz 
nextflow run scorreard/Variant_catalogue_pipeline -resume -latest -r main -profile GRCh38

Because the fasta reference files where too big to be uploaded as one file, I broke them down, hence the step to merge them before running the pipeline

The test is complete for the SNV and MT

The test is incomplete for the SV part as expansionHunter V5 (for STR calling) and MELT (for MEI calling) don't have containers.

The test can only be done with GRCh38

If the test fails, please, create an issue. Thanks.

Variant catalogue Pipeline

Introduction

The variant catalogue Pipeline is a workflow designed to generate variant catalogues, a list of variants and their frequencies in a population, from whole genome sequences.

the variant catalogue pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It takes as input Whole Genome Sequence (WGS) data and outputs multiple vcf files including the variant allele frequencies in the cohort and some basic annotation.

The variant catalogue pipeline includes detection of Single Nucleotide Variants (SNV), small insertions and deletions (indels), Mitochondrial variants, Structural Variants (SV), Mobile Element Insertions (MEI), and Short Tandem Repeats (STR). The output variant catalogue can be generated for GRCh37 and/or GRCh38 human reference genomes.

Figure : Overview of the variant catalogue pipeline

Pipeline description

The variant catalogue is composed of four sub-workflows represented by the grey boxes in the figure. The structure allows users to run the pipeline as a whole or choose to run individual sub-workflow(s) of interest. Each sub-workflow is composed of modules, which call upon open-access genomic software.

A more detailed description of the pipeline will be available soon.

Detailed representation of the variant catalogue pipeline : CAFE_supp.pdf

Pipeline availability

The variant catalogue pipeline is implemented in the NextFlow framework and relies only on open-access tools, therefore, any user with sufficient compute capacity should be able to use this pipeline. Users who want to use this pipeline on their local servers will have to install the necessary software on their instance.

All the software required to run the variant catalogue pipeline are open-source and the link to the installation guidelines are available in supplementary_information/software_information.md.

All the other resources necessary to run the pipeline (Reference genomes, annotation plugins, etc) are also publicly available and information related to them are available for GRCh37 in supplementary_information/GRCh37_specific_files.md, for GRCh38 in supplementary_information/GRCh38_specific_files.md and for the mitochondrial genome in supplementary_information/Mitochondrial_references.md.

Future of the pipeline

There is discussions with the NF_core team to move the variant catalogue pipeline to NF-core and improve and maintain this pipeline with the help of the community. Feel free to join in the effort!

Pipeline test on 100 genomes

In order to test the variant catalogue pipeline, 100 samples from the IGSR (International Genome Sample Resource) were processed. A more precise description of the method and results will be available soon.

The samples were processed in two batches, batch_1 contained 80 samples and batch_2 contained 20 samples. Output files generated by Nextflow (report, timeline, etc) are available in the test_case folder and vcf files containing annotated variant frequencies are also available in that folder. Intermediate files were not loaded into GitHub.

variant_catalogue_pipeline's People

Contributors

scorreard avatar phillip-a-richmond avatar melsiddieg avatar brittanyhewitson avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.