Introduction

GBSapp v2.2 is an automated pipeline for variant calling and filtering. The pipeline intuitively integrates existing/novel best practices, some of which can be controlled by user-defined parameters. It optimizes memory and speed at various steps of the pipeline, for example, a novel approach performs compression and decompression of unique reads before and after read alignment, respectively. Summary reports and visualizations allow for QC at each step of the pipeline.

For questions, bugs, and suggestions, please contact [email protected].

Features

Easy use and designed for biologist.
Dosage-based variant calling and filtering.
Fully-automated: “walk-away” and “walk-through” mode.
Allows for use of haplomes (haplotype-resolved), subgenomes (haploid), and pan-genomes (haploid or haplotype-resolved) references.
Minimizes excess heterozygosity and allele dropout.
Variant calling implemented for any ploidy-level.
Input data: shotgun WGS, reduced representation sequence (e.g., OmeSeq-qRRS, GBS, ddRADseq), and multiplexed-PCR data.
Can subsample shotgun whole genome data for variant calling, i.e. in silico reduced representation sequencing (RRS).
Parallelization of job on multiple compute cluster nodes (spark cluster infrastructure not required).
Splice-aware aligner allows for RNAseq data as input (recommended only for haploid or diploid genomes).
Generates variant sequence context (useful for applications such as oligo/primer design & sequenced-based phylogenetic analysis).
Variant calling delineates SNP from uniquely mapped and paralogs.
Fast variant calling due to dynamic downsampling (avoids allele dropout due to biased downsampling).
Fast alignment due to joint-alignment method.
Visualizations for report and QC.
Functions under-development:
- calling microhaplotypes
- estimating ploidy level and aneuploidy

Installation
Usage
Related Software
Select Article Referencing GBSapp
Acknowledgment
Troubleshooting
Versioning
License

Installation

Currently, GBSapp is only available for unix-based systems.
Clone or download the Git repository to your desired folder.

git clone https://github.com/bodeolukolu/GBSapp.git

Installation occurs automatically the first time you run the pipeline.
To install dependencies without running a job:

GBSapp_dir/GBSapp install

With the exception of R, and Python (v2.6 or greater) all dependencies are installed to a local directory within GBSapp.

Dependencies:

Installed on first run of pipeline:
-----------------------------------
NextGenMap (ngm), samtools, picard, bcftools, GATK, java, R-ggplot2, CMplot, and R-AGHmatrix


Pre-install before running GBSapp:
----------------------------------
- R
- python v2.6 or greater

Usage

Basic Usage

The project directory should contain the following files and directories:

config file: specifies run parameters (for details: GBSapp_vx.x/examples/config).
samples directory: contains “se” and/or “pe” (“paired” and “single” name format acceptable) fastq file(s). Paired-end (pe) sample filenames might require formatting so that they end in “_R1.fastq” or “.R1.fastq” and “_R2.fastq” or “.R2.fastq” (for details: GBSapp_vx.x/misc/format_fastq_filenames.txt).
refgenomes directory: contains fasta file(s) of the reference genome.
* Genomes with subgenome assemblies in single fasta file: such as allopolyploids and segmental allopolyploids might require formatting to split fasta file into multiple file containing each subgenome (for details: GBSapp_vx.x/misc/split_subgenomes_format_fasta_headers.txt).
* Haploids, diploids and autopolyploids with single reference genomes: splitting fasta file not required.
* Note:formatting of fasta headers to contain minimal text (e.g. >Chr05) might be required (for details: GBSapp_v0.1/misc/format_fasta_headers.txt)
for help: --help or -h
for version: --version or -v

From command line, run GBSapp with options shown below (absolute or relative path)

$ bash	<path-to-GBSapp-directory/GBSapp>	<path-to-project-directory>

Project directory setup

A project directory should contain the following sub-directories:

samples folder: this contains your quality filtered sequence data.
refgenomes folder: this contains your reference genome fasta file.
config.sh file: a template of the configuration file is provided in the examples folder of the GBSapp download.

Overview of workflow

The figure below outlines the order of steps in the GBSapp pipeline

In Progress

Configuration

Using a text editor, save a file containing any of the following variables as 'config.sh' file and include it in your project directory.

General parameters

Variable	Default	Usage	Input	required/Optional
threads	na	number of cores/processors	integer	Optional
walkaway	true	run in walk-away or walk-through mode	true or false	Optional
cluster	false	run on compute cluster node (default: slurm) or workstation	true or false	Optional
nodes	1	number of nodes	integer	Optional
samples_alt_dir	false	links samples in separate directory to project directory	true or false	Optional
lib_type	RRS	RRS (reduced representation sequence e.g. GBS, ddRADseq, qRRS) or WGS (shotgun whole genome sequence)	string	Optional
subsample_WGS_in_silico_qRRS	false	Fast alternative to variant calling on whole genome data	true or false	Optional

Variant calling parameters

Variable	Default	Usage	Input	required/Optional
ploidy	na	value = 1,2,4,6, or 8	integer	Required
haplome_number	1	variant call using haplomes/pangenomes/subgenomes (haplome # typically = ploidy)	integer	Optional
ref1	na	reference subgenome as .fasta file. Anchor-genome when other subgenomes specified	integer	Optional
ref2	na	2nd reference genome as .fasta file	integer	Optional
ref3	na	3rd reference genome as .fasta file	integer	Optional
ploidy_ref1	na	ploidy-level for subgenome 1	integer	Optional
ploidy_ref2	na	ploidy-level for subgenome 2, only specify for subgenome specific variants	integer	Optional
ploidy_ref3	na	ploidy-level for subgenome 3, only specify for subgenome specific variants	integer	Optional
genomes_ref	na	haplotype-resolved reference genome or pangenomes (# of haplomes typically = ploidy level)	integer	Optional
Get_Chromosome	na	variant calling on specific chromosomes, scaffolds,and contigs	comma delimited string(s)	optional
Exclude_Chromosome	na	variant calling to exclude specific chromosomes, scaffolds,and contigs	comma delimited string(s)	optional

*note: designate haplomes/pangenomes/subgenomes with single character prefix (alphabets: A-Z and a-z) in a single file *note: for polyploids with variable ploidy of subgenomes (e.g. hexapploid: 4x + 2x), indicate ref2-ref3 and ploidy_ref2-ploidy_ref3 (i.e. multiple reference genome files)

Variant filtering parameters

Variable	Default	Usage	Input	required/Optional
p1	na	maternal parent (specified only for biparental populations)	string	Optional
p2	na	paternal parent (specified only for biparental populations)	string	Optional
biallelic	false	filter to output only biallelic variants	string	Optional
genotype_missingness	1	maximum proportion of missing genotypes allowed per sample	comma delimited decimal number(s)	Optional
sample_missingness	1	maximum proportion of missing samples allowed per variant	comma delimited decimal number(s)	Optional
exclude_samples	na	sample IDs to be exclude from filtered variant data set	comma delimited string(s)	Optional
select_samples	na	limit variant filtering to samples IDs in file delimited by newline	filename	Optional
minRD_1x	2	minimum read depth threshold	integer	Optional
minRD_2x	6	minimum read depth threshold	integer	Optional
minRD_4x	25	minimum read depth threshold	integer	Optional
minRD_6x	45	minimum read depth threshold	integer	Optional
minRD_8x	100	minimum read depth threshold	integer	Optional
pseg	0.001	p-value threshold for chi-square test of segregation distortion	decimal number	Optional
maf	0.02	minor allele frequency threshold	decimal number	Optional
filtered_vcf	false	generate filtered vcf file	string	Optional

Advanced parameters

Variable	Default	Usage	Input	required/Optional
uniquely_mapped	true	include uniquely mapped for variant calling	string	Optional
paralogs	true	include paralogs for variant calling	string	Optional
minmapq	20	minimum mapping quality	integer	Optional
downsample_2x	50	value for unbiased downsampling for 2x ploidy	integer	Optional
downsample_4x	100	value for unbiased downsampling for 4x ploidy	integer	Optional
downsample_6x	150	value for unbiased downsampling for 6x ploidy	integer	Optional
downsample_8x	200	value for unbiased downsampling for 8x ploidy	integer	Optional
maxHaplotype	128	maximum number of haplotypes per haploid genome across population(increase for polyploids/high heterozygosity/high background mutational load)	integer	Optional
use_softclip	false	use soft-clipped bases for variant calling	string	Optional
joint_calling	false	cohort calling will be performed if set to false	string	Optional
keep_gVCF	false	keep sample gVCF files, if additional samples will be included for future joint calling)	string	Optional
RE1	NA	sequence motif at start of R1 reads	string	Optional
RE2	NA	sequence motif at start of R2 reads	string	Optional
filter_ExcHet	false	test and filter for excess heterozygosity	string	Optional
genomecov_est	false	compute genome coverage for each sample	string	Optional

*note: na indicates that variable is user-defined or hard-coded/computed intuitively, as well as a function of ploidy.

Below is an example of a configuration file:

config.sh

# General_parameters
###################################################
threads=16
walkaway=true
cluster=true
nodes=1
samples_alt_dir=false
lib_type=RRS
subsample_WGS_in_silico_qRRS=false

# Variant calling
###################################################
ploidy=6
haplome_number=6
# Variant calling with haploid subgenome(s)
# Anchored to ref1 for loci conserved across all subgenomes
ref1=TF.fasta
ref2=TL.fasta
ploidy_ref1=4
ploidy_ref2=2
# Variant calling with haplotype-resolved reference genome or pangenomes
# Anchored to haplome_ref1 for loci conserved across all haplomes
genomes_ref=Ib.fasta
# exclue or limit variant calling to specific chromosomes
Get_Chromosome=TF_Chr01,TF_Chr02
Exclude_Chromosome=TF_Chr00,TL_Chr00
genomecov_est=false

# SNP-filtering:
####################################################
p1=M9
p2=M19
biallelic=false
genotype_missingness=1
sample_missingness=1
exclude_samples=S1,S2,S3
select_samples=pop.txt
minRD_2x=6
minRD_4x=25
minRD_6x=45
pseg=0.001
maf=0.05
filtered_vcf=true

# Advanced_parameters
###################################################
uniquely_mapped=true
paralogs=true
minmapq=20
downsample_2x=50
downsample_4x=100
downsample_6x=150
downsample_8x=200
variant_intervals=false
interval_list=variant_intervals.list
interval_list_ref1=variant_intervals_TF.list
interval_list_ref2=variant_intervals_TL.list
maxHaplotype=128
use_softclip=false
joint_calling=false
keep_gVCF=false
RE1=TGCAT
RE2=CATG
filter_ExcHet=false

Alternatively, a configuration file (outlined below) specifying only the ploidy level is sufficient to run GBSapp.

config.sh

### Variant calling
###################################################
ploidy=2

Since most of the parameters are hard-coded in an intuitive manner, by specifying only the ploidy levels, the pipelines determines the other parameters as stated below:

threads: computes available number of cores (n) and and uses n-2 threads
defaults: refer to parameters above

Related Software

Select Article Referencing GBSapp

ngsComposer: an automated pipeline for empirically based NGS data quality filtering. Kuster et al. 2021
Genome-wide association study identified candidate genes controlling continuous storage root formation and bulking in hexaploid sweetpotato. Bararyenya et al. 2020
Sequencing depth and genotype quality: accuracy and breeding operation considerations for genomic selection applications in autopolyploid crops Gemenet et al. 2020
Genetic Diversity and Population Structure of the USDA Sweetpotato (Ipomoea batatas) Germplasm Collections Using GBSpoly Wadl et al. 2019

Acknowledgment

This package has been developed as part of the Genomic Tools for Sweetpotato Improvement project (GT4SP) and SweetGAINS, both funded by Bill & Melinda Gates Foundation.

Troubleshooting

Pre-Installation of R and Python:

    - To view R and python version (or to check if installed), from the terminal type:
          $ R --version
          $ python --version

    - For Ubuntu, install R and python using apt:
          $ sudo apt update && sudo apt upgrade
          $ sudo apt install r-base
          $ sudo apt install python3

    - For macOS, install using homebrew:
          brew install r
          brew install python3

If GATK can't find python:

- Make sure python v2.6 or greater is installed and then type the command below in terminal
- $ sudo ln -sf /usr/bin/python2 /usr/bin/python
- or
- If you are using python3
- $ sudo apt update
- $ sudo apt install python-is-python3

If samtools and bcftools doesn't install properly:

While the installation of samtools and bcftools are automated, the installation requires some dependencies:
  $ sudo apt-get update
  $ sudo apt-get install gcc
  $ sudo apt-get install make
  $ sudo apt-get install libbz2-dev
  $ sudo apt-get install zlib1g-dev
  $ sudo apt-get install libncurses5-dev
  $ sudo apt-get install libncursesw5-dev
  $ sudo apt-get install liblzma-dev
  $ sudo apt-get install libcurl4-gnutls-dev
  $ sudo apt-get install libssl-dev

If NextGenMap doesn't install properly:

While the installation of NextGenMap (ngm) is automated, the installation requires some dependencies:
  $ sudo apt install cmake

Problem with amount of memory and/or processors/cores specified:

- This might be due to specifing values greater than available resources
- Re-submit job with appropriate values or modify header of the GBSapp_run.sh batch file.
- If using compute cluster managers other than SLURM, header of GBSapp_run.sh batch can also be modified to fit the syntax of the cluster manager been used.

Versioning

Versioning will follow major.minor.patch semantic versioning format.

License

Apache License Version 2.0

bodeolukolu / gbsapp Goto Github PK

gbsapp's Introduction

Introduction

Features

Contents

Installation

Usage

Basic Usage

Project directory setup

Overview of workflow

Configuration

Related Software

Select Article Referencing GBSapp

Acknowledgment

Troubleshooting

Versioning

License

gbsapp's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org