bauhaus
is a prototype implementation of a minimal
tertiary-analysis system for use in-house at PacBio. It is not
intended as an official solution, but more as an experimental
playground for some ideas about how users can specify input and
analyis conditions.
bauhaus
is best understood as a compiler. It accepts the
user's specification of the experiment (a CSV table with a
well-defined schema), validates the table,
resolves the inputs (which can be referred to symbolically using
runcodes, or job identifiers, or explicitly using paths), and
then generates an output directory containing files run.sh
and
build.ninja
. build.ninja
is a jobscript runnable using the
"ninja" build tool; run.sh
is the main entry point,
which sets up the software environment properly and then executes
the Ninja script.
Please read the tutorial.
To run a workflow, with inputs and variables specified by a
condition table, you can invoke the run
subcommand:
% bauhaus -o myWorkflow -w {workFlowName} -t condition-table.csv run
This command does a few things: it 1) validates that the condition table honors the schema and refers to valid input; 2) it compiles the ninja script for the workflow and copies it and other required scripts to the output directory ("myWorkflow" here); 3) it executes the workflow.
You can also run the validate and generate steps without running
the workflow. The generate
subcommand performs validation and
generates the workflow directory, ready for execution.
% bauhaus -o myWorkflow -w {workFlowName} -t condition-table.csv generate
The validate
subcommand performs just the validation.
% bauhaus -w {workFlowName} -t condition-table.csv validate
-
experiment
subpackage: A model and API for thecondition table
that is provided by the user as a specification of the input data, grouped by "conditions" with associated experimental variables. This model includes a specification; the API validates the condition table, enabling quick feedback on problems in experiment setup. This feature was lacking in the original Milhouse system. -
pbls2
subpackage: An API for resolving internal-PacBio input data specifiers (runcodes, job IDs) to concrete NFS paths to dataset objects. -
pflow
subpackage: a minimal, experimental workflow engine I wrote just as a lark. It takes a different approach than other engines---runningpflow
just generates aninja
build file that then can be invoked to execute the workflow. There are advantages to this approach: it enables workflows to be composed in a language (Python) that has genuine capabilities for composability, and then leaves execution to be driven by a separate, robust, tool. However, it lacks dynamic capabilities, and theninja
buidld -
workflows
subpackage: workflows building on thepflow
engine and theexperiment
model. The workflows are divided intosecondary
andtertiary
analyses. The distinction is thatsecondary
workflows treat input conditions independently, just shepherding a condition through mapping, variant calling, etc.tertiary
workflows build onsecondary
workflows, and then perform a metanalysis, comparing conditions and generating plots and tables.
The following subpackages will be spun out into their own python packages:
- pbls2
- experiment
(as pbexperiment
)
The pflow
workflow engine is just a lark and is not intended to be
used for the "real" tertiary analysis system. For that, we are
going to leverage pbsmrtpipe
.