A pipeline is in fact a list of commands. More correctly, a pipeline combines a
list of independent steps or jobs in which each jobs are composed of closely
related commands (such as samtools view
, samtools sort
and samtools index
are always embed in one step). With knowing the dependency between jobs, the
complete dependency tree can be built.
For PBS system, a pipeline maker should at least do following things:
- put commands into shell scripts.
- set PBS parameters such as memory, time requested, number of CPUs.
- solve the dependency between jobs.
A more stronger pipeline maker can support:
- monitor status of jobs (i.e. whether it is failed or successful).
- check whether jobs are successful finished.
- recovery from last failed jobs instead of start from beginning of the pipeline.
Here I developed a light weight pipeline maker. It allows quick implementing
a NGS pipeline. When it is fully tested, it can be immigrated to more stronger
pipeline administrators such as Roddy
.
In this repository, there is a CO::PipelineMaker
class which do the pipeline
maker stuff and CO:::NGSPipeline
classes which implement tools and pipelines
for WGBS pipeline and RNAseq pipeline.
Mainly do following things:
- write commands into shell script.
- add job checking in shell script.
- add job flags
- handle job dependency
- submit by qsub
To use this class, first create a CO::PipelineMaker
object which must contains
the path of the working directory for the pipeline.
my $pm = CO::PipelineMaker->new("dir" => $dir);
Set the job name and dependency
$pm->set_job_name("test_job");
$pm->set_job_dependency($previous_pid1, $previous_pid2);
Add commands to this step
$pm->add_command($cmd1);
$pm->add_command($cmd2);
Check file size of some output files. This is important since in some NGS commands, error occurs while the commands only finish without return a non-zeor exit code.
$pm->check_filesize($output);
Finally, when all settings for current job are done, you can prepare the shell
script, configure PBS settings and submit the job. run
will return the PID of
this job and you can use it to set next job's dependency.
my $pid = $pm->run("-l" => { walltime => '10:00:00',
memory => '10G',
nodes => '1:ppn=8' }
);
Normally, for a NGS step, I will embed above codes into a single subroutine.
Examples can be found under CO::NGSPipeline::Tool
classes.
There are two classes in this namespace. CO::NGSPipeline::Tool
defines
independent jobs and CO::NGSPipeline::Pipeline
defines pipelines by integrating
methods from CO::NGSPipeline::Tool
.
Under each level of different namespace, there is a Common
class which defines
common methods which is base class of more specific classes. Also, there is a
Config
module which defines global variables.
- fastqc
- trim
- sort sam/bam
- samtools view
- merge and remove duplicates
- flagstat
- bwa aln
- bwa sampe
- merge sam/bam
- picard metric
- picard insertsize
- global variables for common methods
Currently I implemented methods for WGBS pipelines and RNAseq pipelines.
Methods for WGBS pipeline
Common methods for WGBS pipeline, including:
- QC
- save methylation into R
Bsseq
object.
Global variables for WGBS pipeline
Methods for Bismark pipeline, including:
- alignment
- Bismark's methylation calling
- lambda conversion rate
Methods for BSMAP pipeline, including:
- alignment
- BSMAP's methylation calling
- lambda conversion rate
Methods for methylCtools pipeline, including:
- fqconv
- bconv
- bcall
- lambda conversion rate
- steps for BisSNP methylation calling
- save methylation as RData for downstream DMR calling
Methods for RNAseq pipeline
- rnaseqqc
- rpkm
- counting
Global variables for RNA seq pipeline
Methods for gene fusion pipeline
- defuse
- fusionmap
- fusionhunter
- tophatfusion
Methods for GSNAP pipeline
- alignment
Methods for STAR pipeline
- alignment
Methods for TopHat pipeline
- alignment
Integrated pipelines which is a collection of methods from CO::NGSPipeline::Tool
This class provides 'shortcut' methods to call real methods in CO::NGSPipeline::Tool
namespace. For example, we already had a method called align
in CO::NGSPipeline::Tool::BSseq::BSMAP
.
In order to use this methed in a pipeline, you do not need to initialize the object
and deal with PipelineMaker stuff. Just using the 'shortcut' method:
$pipeline->bsmap->align(@param);
in which $pipeline
is a pipeline object and should be initialized with
a pipeline maker object. $pipeline->bsmap
is a shortcut method which will initialize
a CO::NGSPipeline::Tool::BSseq::BSMAP
object and attach the pipeline maker object,
finally you can call align
method on this BSMAP object.
Scirpts for pipeline report, currently only reports for WGBS pipeline.
All established pipelines need paired-end FastQ files, so we need pathes of FastQ files and the sample names. Also, working directory as well as some running mode are common for all pipelines. Therefore, this class take charge of command line parameters, construct and print help messages, validate parameters and finally returns validated and transformed variables.
Util methods for all CO
classes