BEAT v 1.0.0 Documentation
#################
#################
The Basic Exon Assembly Tool (BEAT) is a freeware, copyleft-compliant (GNU GPL3, see LICENSE.md) bioinformatics pipeline developed by R Bankoff, with significant inputs from L Kistler, E Lauterbur, M Jerjos, and B Hohman at the Pennsylvania State University. BEAT's scripts are written in perl, and may be freely modified to fit the user's needs. The purpose of the program is to reconstruct targeted areas of a genome by aligning short read files from an unassembled species to a relative with a well annotated genome OR orthologous sequence and extracting the reads which map to the coordinates of orthologous loci. It is structured as a pair of core programs (BEAT_assembly.pl and BEAT_consensus.pl) that can either be accessed individually by the user, or through a central portal program called "BEAT.pl”.
BEAT_assembly.pl: This is the script which directs the file splitting, indexing, and mapping of short reads for both tracks of BEAT. First, BEAT subdivides input short read files into small (default 1GB) chunks in order to take advantage the time benefits of parallel processing. There are two tracks to BEAT_assembly.pl:
Track 1: fast - Retrieve only short reads which align to a predetermined query sequence from a raw read file using NCBI’s blast+; map matching reads to the query.
Track 2: complete - Map all reads against an annotated reference genome from a related species; can be used for open-ended querying based on known coordinates of orthologs.
BEAT_consensus.pl: The resultant map (.bam) files from either track of BEAT_assembly can be fed into this script, which will generate a consensus sequence for the whole query sequence (fast track) or loci specified by coordinates on an annotated reference (complete track). This step is automatically completed for fast track queries.
What follows is the outline and general use guide to the command-line release version of the pipeline without the use of the BEAT.pl portal, whose operation should be self-explanatory because of its interactive nature.
################################
################################
None of the scripts need root access to run, but most of them require the installation of three key dependencies to run properly (or at all), which may require root permissions to install depending on your hardware configuration. The dependencies are:
-
Burrows-Wheeler Aligner (BWA; Li & Durbin, 2009; http://bio-bwa.sourceforge.net/)
-
Samtools, v0.1.19 (Li et al., 2009; http://samtools.sourceforge.net/)
-
Entrez Direct (Kans, 2013; http://www.ncbi.nlm.nih.gov/books/NBK179288/)
BEAT checks that these programs are installed and in the user's path before executing; if they are not, BEAT will throw an error to the effect of "edirect not found" and exit. It is highly recommended that users add the directory containing BEAT to their path. This can be accomplished with the following commands in a bash terminal:
PATH = $PATH:~/Downloads/BEAT_v1_0_0/; export PATH
###################################
###################################
The BEAT suite is optimized for use on a linux cluster running the TORQUE Resource Manager (http://www.adaptivecomputing.com/products/open-source/torque/), but it can be run on a multithreading-capable local linux or unix system as well. To determine if you are using a system with the TORQUE Resource Manager installed, enter the command "qstat" in the terminal; if a list of running and queued jobs is returned, you are. This package is accompanied by a template file ("template_qsub.qsub"), which you may use, modify or replace to alter the default settings of each subdivided job. If you would like to adapt BEAT for other resource management systems, there is a built-in switch called "$type" in BEAT_MAIN.pl which has only one option at present ("qsub"); adding another option that mimicks the use of "$type" along with a system-specific template file in place of "template_qsub.qsub" should enable relatively easy conversion.
BEAT is designed to intake FASTQ-formatted paired-end reads of indeterminate length, but a key strength of the pipeline is its ability to paralellize jobs to maximize efficiency. We therefore recommend that you use BEAT_splitter.pl to break up large short read files prior to assembling your reads to a reference genome, provided you have a cluster capable of running multiple simultaneous instances of BWA mem/samtools.
##############################
##############################
##BEAT.pl
To use BEAT in the interactive mode, simply enter perl BEAT.pl [program name]
into a bash terminal in the working directory. Available programs are as follows:
PROGRAM | DESCRIPTION |
---|---|
raw_fast | rapid extraction of targeted sequences from raw read files |
raw_complete | fully map read files to reference for random-access consensus generation |
entrez_fetch | retrieve full sequences from NCBI and generate coordinate files |
consensus | generate consensus sequences and associated data for user-specified genes |
batch | generate consensus sequences for large numbers of sequences in parallel |
split | break large short read files into properly associated subfiles for more efficient parallelization |
help | this menu |
For command-line instructions, the arguments and flags are detailed below.
##BEAT_assembly.pl usage:
perl BEAT_assembly.pl [1] [2] [mandatory flags] <optional flags>
ORDER | NAME | EXAMPLE | DESCRIPTION |
---|---|---|---|
1 | Number of queries | 4 | number of queries for BEAT_fast; if using BEAT_complete, enter 1 |
2 | Number of read files | 10 | number of paired-end short-read files (2 minimum) |
###Mandatory flags; these must be given, but order doesn’t matter
FLAG | EXAMPLE | DESCRIPTION |
---|---|---|
-sr | SR_1_1.fastq SR_1_2.fastq | filenames of FASTQ-formatted paired-end short-read files |
OR | ||
-list | List_of_SR_files.txt | filename of list of short-reads in local directory |
-query | Query1.fa Query2.fa | filename(s) of homologous query sequences for BEAT_fast |
OR | ||
-querylist | List_of_query_files.fa | filename of list of queries in local directory |
-usr | abc123 | cluster-recognized username for qsubbing (TORQUE) clusters |
-chrom | chr3 | name of the chromosome for each query in order entered |
-ref | hg19.fa | name of reference genome in fasta format |
-config | config.qsub | name of configuration file with cluster header |
-type | qsub | set up to receive “qsub”, but can be modified (see below) |
-mode | fast OR complete | defines which track the user will use, targeted or complete (see below) |
-species | Homo sapiens | the name of the species used as the reference file |
###Semi-optional flags; these must be included if certain preconditions are not fulfilled
FLAG | DESCRIPTION |
---|---|
-first | indicates to BEAT_fast to make BLAST databases of short read files if none are present |
###Optional flags; these may be specified or left at default values
FLAG | EXAMPLE | DEFAULT | DESCRIPTION |
---|---|---|---|
-samtools_qflag | 20 | 1 | defines custom quality flag value |
-e_value | 100 | 50 | minimum e value to keep in blast alignment |
-perc_identity | 85 | undefined | minimum percentage sequence identity to in blast alignment |
####IMPORTANT NOTE: To ensure proper results if using the TORQUE Resource Manager, you MUST set up your BEAT_assembly.pl job within a qsub script.
###Example run walkthorugh
Included in the BEAT github repository are a set of example files to demonstrate a properly set up run on the BEAT_fast track. Everything needed to run the pipeline is in there except for a reference genome, which for these data can be retrived from here. The demo short read files are a greatly subsetted group of reads from an aye-aye sequence repository (SRA066444; Perry et al., 2012) that have been spiked with reads that match to the human TMC1 orthologue in a previous run of BLAST+. To modify or inspect the parameters of the job, simply open the "Demo_Master.qsub" file in a text editor. Two parameters need to be changed in the example job file: your username, after the -usr
flag, and the name of the reference you download after the -ref
flag. To run the job after inserting the correct data for these fields, enter qsub Demo_Master.qsub
on a TORQUE queue manager compliant system.
##BEAT_consensus.pl usage:
perl BEAT_consensus.pl [1] [2] [3] [4]
###Arguments must be given in order
ORDER | NAME | EXAMPLE | DESCRIPTION |
---|---|---|---|
1 | Listfile | List.txt | list file in format specified in format section below |
2 | BAMfile | Master.bam | mapped .bam file from BEAT_assembly |
3 | Outfile | Out_folder | suffix to be added to query name for results directory |
4 | Reference | hg38.fa | FASTA-formatted reference |
NOTE: Unlike BEAT_assembly.pl, BEAT_consensus.pl can be run either locally or as a qsub job without issue.
###########################
###########################
##BEAT_splitter.pl
###Purpose:
A script to subdivide FASTQ files over 2GB into smaller chunks (default 1GB) that can be run in parallel. After subdivision, BEAT_splitter.pl outputs a properly formatted file listing the subdivisions to be input using the BEAT_MAIN.pl -list option.
###Usage:
perl BEAT_splitter.pl [1] [2] [3] <4>
ORDER | NAME | EXAMPLE | DESCRIPTION |
---|---|---|---|
1 | Short read list | List.txt | short reads to be split in a newline-delimited file |
2 | Job name | Monday_map | name of prefix to be associated with the split files |
3 | Outfile | Out_list.txt | name of file containing the split short read file names |
4 | Split value | 15000000 | optional: number of FASTQ-formatted lines in each daughter file |
NOTE: 1 GB is approximately 15,313,576 lines; if you do not enter a 4th argument, this is the value that will be used
##blast2fq.pl/fqselect.pl ###Purpose: A pair of scripts that reassociate sequence data from reads matched to the sequence of interest by BLAST+ with FASTQ quality data. fqselect4.pl is handed data by blast2fq.pl; running it alone is useless and thus arguments will not be specified.
###Usage:
perl blast2fq.pl [1] [2] [3] [4] [5]
ORDER | NAME | EXAMPLE |
---|---|---|
1 | Blast results filename | file.blast |
2 | Minimum score | 50 |
3 | Forward reads | s_1_1.fastq(.gz allowed) |
4 | Reverse reads | s_1_2.fastq(.gz allowed) |
5 | Out prefix | TMC1 |
##Consensus_mpileup.pl ###Purpose: Generates the actual consensus sequences produced from samtools mpileup data, hard to use without going through BEAT_consensus.pl
###Usage:
perl Consensus_mpileup.pl [1] [2] [3] [4]
ORDER | NAME | EXAMPLE |
---|---|---|
1 | FASTA file name | TMC1.fasta |
2 | Chromosome name | 9 |
3 | Mpileup file name | TMC1.mpileup |
4 | Outfolder prefix | TMC1 |
##Streamline_formatDB.pl
###Purpose:
Subscript used to batch-process the creation of BLAST+ databases with the makeblastdb utility.
###Usage:
perl Streamline_FormatDB.pl [name of file to be databased]
##############################
##############################
BEAT uses the Entrez Direct suite of tools from the National Center for Biotechnology Information (NCBI) to fetch data and sequences relevant to user's tasks using the Entrez_fetch.pl script. The script is designed to run as a subroutine within BEAT_consensus entirely autonomously of the user, but the script can also be used to directly query and refine data retrieved from any of the NCBI's databases that are available to Entrez Direct users. The syntax for commands to Entrez_fetch.pl is as follows:
perl Entrez_fetch.pl [1] [2] [3]
ORDER | NAME | EXAMPLE |
---|---|---|
1 | Gene symbol (NCBI) | TMC1 |
2 | Generic/species name | Homo |
3 | Download sequences? | (Y/n) |
This will produce a listfile in the format needed by BEAT_consensus.pl to generate a consensus for the targeted query/reference, and can also be used to download sequences directly from NCBI.
#######################
#######################
##Standard formats:
###.bam: BAM files are the binary versions of SAM files, and cannot be read by humans without being passed through the samtools view interpreter.
###.fa/.fasta: A text-based format for representing nucleotide sequences.
Example:
>GENE_NAME
ACGTCCTGACACTGATGTATGCCACAATGTAACGTCCTGACACTGATGTATGCCACAATGTA
###.fq/.fastq: A text-based format for storing read sequences along with their PHRED-formatted quality scores.
Example:
@SEQ_ID
ACGTCCTGACACTGATGTATGCCACAATGTAACGTCCTGACACTGATGTATGCCACAATGTA
+
!!''(%%%+*******!!$$CCGF928))))))'''!*.15HHHHHH(H!!!$%$$!'',.$
###.qsub: A generic suffix to denote a script that is meant to be queued by the TORQUE Resource Management system.
Example:
#PBS -l walltime=4:00:00
#PBS -l nodes=1:ppn=1
#PBS -l pmem=2gb
cd $PBS_O_WORKDIR
###.sam: Samtools alignment specification, see (https://samtools.github.io/hts-specs/SAMv1.pdf) for full specifications.
##BEAT-specific formats:
###Listfile format: The listfile for BEAT_consensus [1] is an internal file which can be generated automatically by using the Entrez_fetch.pl script (see above), or can be created manually with the following format, where each field is separated by a tab:
FIELD 1 | FIELD 2 | FIELD 3 | FIELD 4 |
---|---|---|---|
name of .gbl file | chromosome/scaffold name | start coordinates along chromosome/scaffold | end coordinates along chromosome/scaffold |
NOTE: for “fast” track jobs, input 1 for field 3 and the length of the sequence for field 4.
###.mapping_script: An internally-generated queueable script that contains mapping instructions for a given pair of short reads.
###.merging_script: An internally-generated queueable script that merges and indexes all short reads mapped by the merging script.
###.consensus.fa: FASTA-formatted consensus for specified query/region
###.out: The position-wise mapping scores from samtools mpileup used by BEAT_consensus to extract targeted consensus sequences.
Example:
REF NAME | REF COORD | REF ID | READ DEPTH | READ ID | READ QUALITY |
---|---|---|---|---|---|
scaffold_774 | 72731 | G | 9 | ..,,,,,,. | FFHHH@HH6 |
scaffold_774 | 72732 | T | 9 | GGgggg,gG | HFHHHEHH6 |