lcr-bccrc / lcr-modules Goto Github PK

View Code? Open in Web Editor NEW

15.0 15.0 7.0 45.61 MB

Collection of standard analytical pipelines for genomic and transcriptomic data

Home Page: https://lcr-modules.rtfd.io

License: MIT License

Python 84.99% Shell 0.66% R 6.90% Perl 7.41% Awk 0.05%

lcr-modules's People

Contributors

Stargazers

Watchers

Forkers

matnguyen lakshay-sethi mannycruz hayashaalan ashakru gregthompsonjr shaghayeghsoudi

lcr-modules's Issues

null scratch_directory causes fatal error

It looks like the setup_subdirs function is expecting a non-null value for scratch_directory. I was loading two configs, one that had scratch_directory set to "scratch/" and one that had it set to "null". The latter was loaded by header.smk. This clobbered the former. This results in a TypeError (see below).

TypeError in line 25 of /projects/rmorin/projects/gambl-repos/gambl-rmorin/src/lcr-modules/modules/battenberg/1.0/battenberg.smk:
expected str, bytes or os.PathLike object, not NoneType
  File "/projects/rmorin/projects/gambl-repos/gambl-rmorin/Snakefile", line 17, in <module>
  File "/projects/rmorin/projects/gambl-repos/gambl-rmorin/src/lcr-modules/modules/battenberg/1.0/battenberg.smk", line 25, in <module>
  File "/projects/rmorin/software/lcr-modules/oncopipe/oncopipe/__init__.py", line 1545, in setup_module
  File "/projects/rmorin/software/lcr-modules/oncopipe/oncopipe/__init__.py", line 1630, in setup_subdirs
  File "/projects/rmorin_scratch/conda_environments/gambl-default/lib/python3.7/posixpath.py", line 80, in join

GRIDSS-PURPLE-LINX (CNV and SV callers)

There is a working Snakefile I've been using for this pipeline, based on Laura's implementation. It generates a massive amount of tempfiles and really needs to properly utilize scratch space and the shadow directive.

SigProfiler/SignatureAnalyzer (mutational signature analysis)

We should try to adapt the wrapper code Luka and Eric Zhao provided into a portable pipeline. I'm sure Luka would appreciate having his code made more portable. Perhaps he even wants to be involved in the process.

bam2fastq module

RNAseq Variant Calling Pipeline

I have an RNAseq Variant Calling Pipeline in snakemake form, adapted from the one made by Jeffrey for BL. I can modify it to fit lcr-modules guidelines.

delly module

cookiecutter always adds unmatched normal to default config

Could the cookiecutter be modified to ask whether the user wants to analyze unpaired tumours with with unmatched normal? I don't think this is asked (or at least not in a way that is intuitive) so I keep ending up with this config.

            genome:
                run_paired_tumours: True
                run_unpaired_tumours_with: "unmatched_normal"
                run_paired_tumours_as_unpaired: False
            capture:
                run_paired_tumours: True
                run_unpaired_tumours_with: "unmatched_normal"
                run_paired_tumours_as_unpaired: False```

input bams are broken symlinks when results directory is a symlink

I've modified the demo while creating a new module. I created a symlink named "results" that points to another location where I want the pipeline to write to. The symlinks that are created are broken and snakemake exits with an error. Oddly, when I move that directory to results_symlink and replace it with an actual directory, the links work fine.

.
├── BATTENBERG_Snakefile
├── config.yaml
├── data
│   ├── 02-13135N.bam -> /projects/rmorin/projects/gambl-repos/gambl-rmorin/data/genome_bams/02-13135N.grch37.bam
│   ├── 02-13135N.bam.bai -> /projects/rmorin/projects/gambl-repos/gambl-rmorin/data/genome_bams/02-13135N.grch37.bam.bai
│   ├── 02-13135T.bam -> /projects/rmorin/projects/gambl-repos/gambl-rmorin/data/genome_bams/02-13135T.grch37.bam
│   ├── 02-13135T.bam.bai -> /projects/rmorin/projects/gambl-repos/gambl-rmorin/data/genome_bams/02-13135T.grch37.bam.bai
│   ├── samples.tsv
│   ├── TCRBOA7-N-WEX.bam -> /projects/bgrande/lcr-modules/test_data/TCRBOA7-N-WEX.bam
│   ├── TCRBOA7-N-WEX.bam.bai -> /projects/bgrande/lcr-modules/test_data/TCRBOA7-N-WEX.bam.bai
│   ├── TCRBOA7-T-RNA.bam -> /projects/bgrande/lcr-modules/test_data/TCRBOA7-T-RNA.bam
│   ├── TCRBOA7-T-RNA.bam.bai -> /projects/bgrande/lcr-modules/test_data/TCRBOA7-T-RNA.bam.bai
│   ├── TCRBOA7-T-RNA.read1.fastq.gz -> /projects/bgrande/lcr-modules/test_data/TCRBOA7-T-RNA.read1.fastq.gz
│   ├── TCRBOA7-T-RNA.read2.fastq.gz -> /projects/bgrande/lcr-modules/test_data/TCRBOA7-T-RNA.read2.fastq.gz
│   ├── TCRBOA7-T-WEX.bam -> /projects/bgrande/lcr-modules/test_data/TCRBOA7-T-WEX.bam
│   └── TCRBOA7-T-WEX.bam.bai -> /projects/bgrande/lcr-modules/test_data/TCRBOA7-T-WEX.bam.bai
├── dry-run.sh
├── reference
│   ├── chrom_mappings -> /projects/bgrande/reference_files/chrom_mappings
│   ├── downloads -> /projects/bgrande/reference_files/downloads/
│   └── genomes -> /projects/bgrande/reference_files/genomes/
├── results
│   └── battenberg-1.0
├── results_symlink -> /projects/rmorin_scratch/rmorins_stuff/results/
├── run.sh
├── scratch
│   └── star-1.0
└── Snakefile```

```(lcr-modules) -bash-4.2$ ls -lhH results/battenberg-1.0/00-inputs/bam/genome--grch37/02-13135T.bam
-rw-r--r-- 1 bioapps users 164G Mar 24 13:02 results/battenberg-1.0/00-inputs/bam/genome--grch37/02-13135T.bam
(lcr-modules) -bash-4.2$ ls -lhH results_symlink/battenberg-1.0/00-inputs/bam/genome--grch37/02-13135T.bam
ls: cannot access results_symlink/battenberg-1.0/00-inputs/bam/genome--grch37/02-13135T.bam: No such file or directory```

Rainstorm/Doppler

Python and R implementations exist. Need to consider if we should make Snakemake workflows for both or just move forward with the most useful/efficient.

Reimplement reference_files workflow as module (to be used with include)

A major limitation of subworkflows in Snakemake is that they don't inherit the parallelization capabilities of the snakemake job that calls them. Reimplementing the workflow as a module that can be included will remove this limitation.

Unmatched normal ID still being demanded even if no input samples share seq_type

Reference subworkflow error: Permission denied

I tried pointing my Snakefile to the pre-built reference directory at /projects/bgrande/reference_files, but I get the following error on a dry run:

PermissionError: [Errno 13] Permission denied: '/projects/bgrande/reference_files/.snakemake/log/2020-06-10T143949.122053.snakemake.log'

The Snakefile I'm running is at /home/lhilton/rrDLBCL_trios, and looks like this:

import oncopipe as op 

# Load default module config 
configfile: "/home/lhilton/repos/lcr-modules/modules/manta/2.0/config/default.yaml"

# Load project config
configfile: "config/config.yaml"

# Load samples
SAMPLES = op.load_samples("data/metadata/samples.tsv")
config["lcr-modules"]["_shared"]["samples"] = SAMPLES 

subworkflow reference_files: 
    workdir: 
        "/projects/bgrande/reference_files"
    snakefile: 
        "/home/lhilton/repos/lcr-modules/workflows/reference_files/1.0/reference_files.smk"
    configfile: 
       "/home/lhilton/repos/lcr-modules/workflows/reference_files/1.0/config/default.yaml"


# Include manta module Snakefile
include: "/home/lhilton/repos/lcr-modules/modules/manta/2.0/manta.smk"

oncopipe installation from pip appears to be broken

I get the following error when I attempt to build the reference directory after installing oncopipe from pip (pip install oncopipe):

AttributeError in line 125 of /home/rmorin/lcr-modules/workflows/reference_files/1.0/reference_files.smk:
module 'oncopipe' has no attribute 'as_one_line'
File "/home/rmorin/lcr-modules/workflows/reference_files/1.0/prepare_reference_files.smk", line 28, in
File "/home/rmorin/lcr-modules/workflows/reference_files/1.0/reference_files.smk", line 125, in

Workaround:
Running the installation from the oncopipe version in the repo fixes this issue.

pip install -e lcr-modules/oncopipe

Installing collected packages: oncopipe
Attempting uninstall: oncopipe
Found existing installation: oncopipe 1.0.0
Uninstalling oncopipe-1.0.0:
Successfully uninstalled oncopipe-1.0.0
Running setup.py develop for oncopipe
Successfully installed oncopipe

/workflows/reference_files/1.0/prepare_reference_files.sh /projects/rmorin/TEMP/lcr-modules-reference/ Building DAG of jobs...

Refactor modutils as a more object-oriented package

What started off as a small collection of functions has become more complex and could use a refactor now knowing what is required of modutils. The package would be much simpler if an object-oriented approach would be taken for describing modules. Here's a list of features that would be easier to implement with OOP:

Format {REPODIR} and {MODSDIR} in formatters (e.g. switch_on_*() functions)
Use inspect package to pull standard variables from parent frame (example)
Automatically check/store locals() when setting up a module and restoring locals() when cleaning up a module, allowing switch_on_*() dictionaries to be created separately.
Better handling of unmatched_normal_id and unmatched_normals.
Better warning/error messages. For example, the module name should be included in the following warning:

Some samples have seq_types {'mirna'} that are not configured in the pairing config. They will be dropped.

Manta fails if index files aren't produced simultaneously

I'm running a RNAseq pipeline where BAMs are aligned with STAR and duplicates are marked with GATK MarkDuplicates. The .bai index files are produced in a subsequent step. Because the Manta module takes only .bam files as input and assumes the .bai indexes exist, Snakemake proceeded to run Manta before the .bai indexes were created, meaning it populated the 00-inputs directory with dead symlinks for the .bai files. To ensure this doesn't happen, should the .bai files be specified as inputs to the Manta module?

Allow module output subdirectories to be symlinks to a scratch directory

Some modules create large intermediate files. Normally, the temp() feature in Snakemake can ensure that these files are automatically deleted. However, some file systems take snapshots (e.g. every hour) as a form of backup, and this will cause the large intermediate files to be stored even after it is deleted by Snakemake due to temp(). Also, not everyone is running their modules in a large space.

A workaround is to allow some module output subdirectories to be stored in a separate scratch space, i.e. one large enough to house the large intermediate files (at least some of them) and without snapshots.

My current proposal is to add an optional argument to setup_modules() that specifies the output subdirectories that are to be made into symlinks to the scratch directory. For example, if the manta subdirectory is expected to contain large files, the manta module could be set up as follows:

CFG = md.setup_module(
    config = config, 
    name = "manta", 
    version = "1.0",
    subdirs = ["inputs", "chrom_bed", "manta", "bedpe", "outputs"],
    req_references = ["genome_fasta", "genome_fasta_index", "main_chroms"],
    scratch_subdirs = ["manta"]
)

Develop testing suite for oncopipe package

Before refactoring the oncopipe package (see #27), it would be really useful to create a suite of tests to ensure no regressions occurs.

mutect module

Sequenza (CNV caller)

I have a Sequenza Snakefile we can use as a starting point. Optionally this pipeline should include custom filters for FFPE data sets. Otherwise, we could have a separate FFPE version. My preference is for one pipeline that works on both and runs data-specific rules depending on the type of genome indicated in the metadata.

https://github.com/rdmorin/cancer_docker_singularity/tree/master/sequenza

Control-FREEC

https://github.com/BoevaLab/FREEC

BEDPE files from Manta have a massive header

It looks like the vcf to bedpe conversion is leaving the VCF header or adding a new (similar) header to the bedpe files. The header looks to be 236 lines with the last line referring to the columns and the rest somewhat extraneous. Is the inclusion of this on purpose? If so, do we really want this?

##fileformat=BEDPE
##fileDate=20200630
##source=GenerateSVCandidates 1.6.0
##reference=file:///projects/rmorin/projects/gambl-repos/gambl-rmorin/ref/lcr-modules-references/genomes/hg38/genome_fasta/genome.fa
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
##contig=<ID=chr4,length=190214555>
##contig=<ID=chr5,length=181538259>
##contig=<ID=chr6,length=170805979>
##contig=<ID=chr7,length=159345973>
##contig=<ID=chr8,length=145138636>
##contig=<ID=chr9,length=138394717>
##contig=<ID=chr10,length=133797422>

Strelka2 doesn't work properly with Manta vcfs

I tried running Strelka2 1.0 with an output from the latest Manta module (2.2) and it failed when trying to compress the files that were already compressed during the Manta workflow. I don't know if the compression and indexing is a more recent addition to Manta but if so, the Strelka2 module needs to be updated to fix this. If someone can confirm I'm happy to make the patch. It's currently working with this modification:

rule _strelka_input_vcf:
    input:
        vcf = CFG["inputs"]["candidate_small_indels"],
        tbi = CFG["inputs"]["candidate_small_indels"] + ".tbi"
    output:
        vcf = CFG["dirs"]["inputs"] + "{seq_type}--{genome_build}/vcf/{tumour_id}--{normal_id}--{pair_status}.candidateSmallIndels.vcf.gz",
        tbi = CFG["dirs"]["inputs"] + "{seq_type}--{genome_build}/vcf/{tumour_id}--{normal_id}--{pair_status}.candidateSmallIndels.vcf.gz.tbi"
    run:
        op.relative_symlink(input.vcf, output.vcf)
        op.relative_symlink(input.tbi, output.tbi)

#    params:
#        vcf = CFG["inputs"].get("candidate_small_indels_vcf") or "",
#        tbi = CFG["inputs"].get("candidate_small_indels_tbi") or ""
#    conda:
#        CFG["conda_envs"]["tabix"]
#    shell:
#        op.as_one_line("""
#        bgzip -c {input.manta_vcf} > {output.vcf}
#            &&
#        tabix {output.vcf}
#        """)

obtain input files that are stored remotely as local tempfiles

Within the lcr-modules approach, is there a way to obtain a temporary copy of input files on-the-fly? For example, if bam files are stored on a remote file system that cannot be mounted? This situation can occur when working in AWS.

sambamba error

I was testing the lcr-modules using a demo snakefile and demo samples. The workflow failed during running the sambamba markdup step. I looked on the log and found that this step failed because it was not able to open the new file:
sambamba-markdup: Cannot open file "/tmp/sambamba-pid2186658-markdup-lhow/PairedEndsInfoqbqa9" in mode "w+" (Too many open files)
For solution, I found that this is a common error and here biod/sambamba#177 it was suggested to increase the number of temporary files by specifying the flag --overflow-list-size 600000 . I tried this and it helped to execute the workflow. I do not think that these many temporary files were needed, but this flag helped to proceed with workflow.
However, to add the flag, I went to the snakefile utils.smk in utils module.

Implement support for sample-based modules for `op.switch_on_column()`

Specify unmatched_normal_id based on wildcard

For gambl we have three different genome builds, and each will need an unmatched normal specified. I've tried implementing this a couple of ways, most recently by including this in header.smk:

GENOME_DEFAULT_SWITCH = op.switch_on_wildcard("genome_build", {
    "grch37": "14-11247N", 
    "hg38": "BLGSP-71-06-00286-99A-01D", 
    "hs37d5": "SP116656"
})
shared_config["pairing_config"]["genome"]["unmatched_normal_id"] = GENOME_DEFAULT_SWITCH
genome_default = shared_config["pairing_config"]["genome"]["unmatched_normal_id"]
PAIRS = op.generate_pairs(SAMPLES, genome=("allow_unmatched", genome_default), mrna="no_normal")

I get the following error when I run the above with snakemake -np -s src/snakemake/header.smk:

AssertionError in line 71 of /projects/rmorin/projects/gambl-repos/gambl-lhilton/src/snakemake/header.smk:
There are 0 genome samples matching the normal ID <function switch_on_wildcard.<locals>._switch_on_wildcard at 0x7fb56528a440> (instead of just one).
  File "/projects/rmorin/projects/gambl-repos/gambl-lhilton/src/snakemake/header.smk", line 71, in <module>
  File "/home/lhilton/miniconda3/lib/python3.7/site-packages/oncopipe/__init__.py", line 1254, in generate_pairs
  File "/home/lhilton/miniconda3/lib/python3.7/site-packages/oncopipe/__init__.py", line 1072, in generate_runs

oncopipe version 1.0.2
snakemake-minimal version 5.14.0

What can I change to get this to work?

Allow for a shared space for lcr-modules conda environments

Conda environments can take up a considerable amount of space and they take time to create. By default, they are installed locally in .snakemake/conda, meaning that time and space would be duplicated for each project.

It would be ideal to have a shared space to store the conda environments in lcr-modules since they are identical between different projects. This will speed up module launching since the environments will have already been created, and it will reduce disk space usage.

The --conda-prefix is meant for this purpose, but it's set globally, not on a rule-by-rule basis. Hence, any personal rules or work-in-progress would get added to the shared space unless the user is diligent with using --conda-prefix strictly when they are running lcr-modules.

An alternative approach is creating a script (or adding code to modutils) which symlinks the lcr-modules environments being used into the shared directory into .snakemake/conda. This would involve replicating (or leveraging) the Snakemake code that generates the hash for an environment, which is used to determine the conda folder name.

Manta won't rerun if input BAM files are updated

I have make changes to some of the BAM files I'm using as input to manta because of errors. I deleted the BAM files in question and was planning to run Snakemake with _manta_all as the target rule, which should also trigger re-generating the input BAM files, but this doesn't seem to be working. I deleted the symlinks in the 00-inputs folder, but my Snakemake dry run is still telling me that the only jobs that will be run are as follows:

Job counts:
        count   jobs
        1       _manta_all
        3       _manta_all_dispatch
        3       _manta_run
        7

Given that the input bam files don't currently exist, this is troubling behaviour and it tells me that changes to input files won't currently propagate to output files without significant manual curating of the contents of the manta-1.0 directory by the user.

change config in demo to not use relative paths

I am wondering if it would be better to use the REPODIR and MODSDIR variables explicitly the demo snakefile instead of relative paths. This will give concrete examples to show usesr how to specify paths with this variable and prevents them from needing to change it if they use this snakefile to start their own project outside the demo directory.

subworkflow reference_files:
    workdir:
       	"reference/"
    snakefile:
	"{REPODIR}/workflows/reference_files/1.0/reference_files.smk"
    configfile:
        "{REPODIR}/workflows/reference_files/1.0/config/default.yaml"

# Load module-specific configuration
configfile: "{MODSDIR}/star/1.0/config/default.yaml"
configfile: "{MODSDIR}/manta/2.0/config/default.yaml"```

Modutils: Subset sample table by exclusion criteria

The filter_samples() function in modutils is currently able to filter the sample metadata based on things to include. Could you please implement the ability to specify samples to exclude?

STAR

It is becoming increasingly clear to me that we need a STAR pipeline that runs it with sensible parameters. The default settings are terrible. We should generate bam files that can be run through STAR-fusion later on and that include all the original reads including:

Chimeric/discordant read pairs
Unmapped reads
No hard clipping

Manta 99-outputs contains only "dispatched" files

The current (v 2.2) Manta workflow does not symlink any of the actual results from Manta to 99-outputs. Shouldn't at least the bed files be symlinked there or am I misunderstanding the purpose of this directory?

ls results/icgc_dart/manta-2.2/99-outputs/dispatched/genome--hg38/
SP116652--SP116651--matched.dispatched  SP59270--SP59269--matched.dispatched  SP59284--SP59282--matched.dispatched  SP59332--SP59330--matched.dispatched

MiXCR

https://mixcr.readthedocs.io/en/master/

bam_util (sort, filter, mark and remove duplicates)

Directing Manta to look at STAR bams

I've pulled the latest master (00d1cca) and installed oncopipe. When I try to launch the demo Snakefile I get this error message:

MissingInputException in line 41 of /projects/rmorin/software/lcr-modules/modules/manta/2.0/manta.smk:
Missing input files for rule _manta_input_bam:
data/TCRBOA7-T-RNA.bam.bai
data/TCRBOA7-T-RNA.bam

I guess since the manta module is set to look for bam files in the data directory, it expects to find an RNAseq bam file there. However we really want it to look at the bam in data for capture, and at the bam output by STAR for RNAseq. How can a user set up their config to use a different input directory for manta depending on whether the input is RNAseq or other?

varscan2 module

ChildIOException

After my last successful run of my sequenza pipeline I deleted an intermediate file and tried re-running it. Interestingly, this error is not from the same sample as the file I deleted so I don't know what caused this. Any idea what this might relate to?

Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
('/home/rmorin/lcr-modules/demo/results/sequenza-1.0/03-plots/capture--grch37/TCRBOA7-T-WEX--TCRBOA7-N-WEX--matched/TCRBOA7-T-WEX_filtered_segments.igv.seg', _sequenza_filtered_igv_segments)
('/home/rmorin/lcr-modules/demo/results/sequenza-1.0/03-plots/capture--grch37/TCRBOA7-T-WEX--TCRBOA7-N-WEX--matched/TCRBOA7-T-WEX_filtered_segments.igv.seg', _sequenza_output_seg)

strelka module

vcf2maf pipeline

I've created one of these that uses a Docker image. It should be easily modified to make that optional and use conda instead. I haven't tried to get vcf2maf working via conda.

QC modules (picard and fastqc)

ichorCNA

I expect we will begin applying this to normals to identify tumour contamination and I hope it will be useful to estimate purity in pre-selecting tumours for deeper sequencing (e.g. WGS).

bwa_mem module

pseudoalignment modules (salmon, kallisto)

Add function for automatically increasing `mem_mb` on retries

Snakemake allows resources like mem_mb to be functions with an attempt parameter. This makes it easy to wrap any value to be automatically increase (e.g. by 50%) on failed attempts, assuming memory was the issue.

https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#resources

Manta (SV caller)

This should be a very straightforward one. It's unclear to me if Manta on FFPE genomes is giving useful results. If so, is there anything we need to do differently for this data type?

GISTIC2 (CNV analysis)

I have a Snakefile for running GISTIC that can be used as a starting point. It includes a Dockerized/singularity image. We will need to consider what copy number formats to support and modify Prasath's script accordingly. For now, I've packaged a stable version of his script with the pipeline. This is not a good solution.

https://github.com/rdmorin/cancer_docker_singularity/tree/master/gistic2

U1 spliceosomal RNA variant calling

I've been given most pieces of the pipeline implemented by Elias Campo's group to realign reads and call and annotate somatic variants in U1. This should be easy to convert to Snakemake.

Feature Requests

I've put together the following list of feature requests to keep in mind when designing the pipelines based on a discussion with the CLC bioinformaticians. I've also listed some pros and cons to consider for each feature.

Terminology

Before going over the features, I wanted to bring up the issue of terminology. During the meeting, we agreed to refer to the modular units as modules (what I've referred to as pipelines so far), and reserve the term pipeline for an arrangement of modules. For example, Strelka variant calling would be encapsulated into a module whereas genome analysis would be a pipeline consisting of multiple modules (probably one for read alignment and one for each mutation type). The steps within each module are referred to as rules to adopt the Snakemake term.

Flexible Modules

The motivation for these modules is to standardize commonly performed analyses. Hence, they are expected to be run without modification in most cases. In practice though, tweaks are often required in the research setting. For this reason, these modules should remain flexible and easily allow for modifications. This primarily hinges on readability and accessibility of the Snakefiles. As long as users can understand them (assuming some Snakemake knowledge), they should be able to add, remove or edit rules.

Pros

Modules are more useful for a larger number of people if they can be easily tweaked.
By having the modules more readable, they will be easier to maintain in the long run.

Cons

Prioritizing readability might preclude more advanced features that require more complicated code in the Snakefiles.

Granular Rules

Some have experienced modules where many tasks are being done in one rule. In turn, this obfuscates the module—making it harder to understand—and complicates the addition or removal of individual tasks (e.g. between two tasks captured in one rule). To avoid this issue, I propose we follow the Unix philosophy and limit each rule to a small task.

Pros

Modules will be easier to understand.
Rules can more readily be shared between modules if we pursue the route of Snakemake wrappers.

Cons

Obviously, this will make modules longer to develop.

Modular Design

Ideally, the various modules should be modular and easily feed into one another. For example, a read alignment module should be readily "connectable" with a variant detection module. The modules should expose an easy way of defining the input files. Practically speaking, this means that we should ensure that the inputs and outputs between various pipelines should match up as much as possible. In other words, in a pipeline where the modules are the edges and the nodes are the intermediate file types, we want to minimize the different types of nodes.

For example, a Salmon RNA-seq module would be nicely complemented by a STAR/featureCount RNA-seq module. Both modules should take paired FASTQ files as input and output expression matrices using the same format (ideally). Here, while STAR/featureCount will produce gene-level counts, Salmon will produce gene-level and transcript-level counts. It's okay if one module generates additional files.

Pros

Running similar pipelines and comparing their results will be easy.
Less "glue" will be needed if the number of different intermediate files is kept to a minimum.
A modular design will make it easy to assemble meta-modules (or pipelines) that incorporate related modules (e.g. an exome analysis pipeline).

Cons

Trying to stick with the fewest number of different intermediate files might reduce the flexibility of the modules we can create (if we are too restrictive).

User-configurable Cluster Parameters

Given that resource requirements can vary wildly—even for the same rule—based on the input data, it is important to that the number of CPUs and memory can be configured by the user. This way, if an exome is being run through a module typically run on genomes, it will be easy to reduce the resource requirements, which in turn will reduce the job wait time in the cluster queue.

Pros

This will make modules more portable to different compute environments.
By being able to refine the resource requirements for each module, jobs will run sooner on the cluster.
By storing resource requirements for different types of module inputs, that knowledge can be shared among the module users (e.g. FF genomes need 8 GB in this rule, but FFPE genomes need double that).

Cons

If we add configurable cluster parameters for each rule, this will add significant bulk to the configuration required for each module.

Software Versioning

These modules should ideally ensure that the versions of software tools being used are tracked. This can naturally be done using conda environments. However, given our past experience with packages suddenly disappearing online and environments being unable to be instantiated just three weeks after they were originally created, I propose we adopt a mixed strategy: (1) we store the complete environment YAML file (with or without build IDs); (2) we also store the environment specification for just the packages that we requested at the command line (i.e. using the --from-history conda option); and (3) after instantiating a read-only local copy on the GSC file system, we store the path to this conda environment for local use.

Pros

This will tremendously help with ensuring the reproducibility of our analyses over time.

Cons

As I mentioned, conda can be flaky, so we must take precautions here to avoid getting bitten.
This involves extra work for each module because three environments will need to be specified.

Automatic Logging

The automatic logging of commands, stdout, and stderr would make these modules even more useful and reproducible. Apparently, the level of logging is slightly different between running jobs locally vs. on the cluster. This discrepancy should be eliminated. I'm aware that a lot of stored in the .snakemake directory, but I don't think those logs are very discoverable.

Pros

These logs can help answer questions if they come up down the line.

Cons

This feature might add a lot of overhead to each rule depending on how it's implemented.

File Permissions

These modules should come with guidelines on how to ensure proper file permissions. In particular, it would be great if the owning group of the output files as well as the group permissions (read and write) were automatically enabled for all output files as well as the hidden .snakemake files.

Pros

More than one person could easily work on a project if this was implemented.

Cons

This feature might add a lot of overhead to each rule depending on how it's implemented.

IGV screenshot pipeline

A beta version with Python helper script has already been implemented by Ryan and Jasper.

Automatic cram-to-bam for compressed bam files

We will inevitably start encountering cram-compressed genomes during GAMBL and other projects. It would be wonderful if we had a standard rule or module that would handle this scenario. If it was an optional module, then anyone encountering a cram could add it when necessary. I envision the functionality being:

decompress to bam in scratch directory
regenerate index for bam
symlink as usual as per the first step in a lcr pipeline
optionally remove the bam after all steps in the workflow that rely on that bam have completed

lcr-bccrc / lcr-modules Goto Github PK

lcr-modules's People

Contributors

Stargazers

Watchers

Forkers

lcr-modules's Issues

Terminology

Flexible Modules

Granular Rules

Modular Design

User-configurable Cluster Parameters

Software Versioning

Automatic Logging

File Permissions

Recommend Projects

Recommend Topics

Recommend Org