The as_analysis from aryarm

improve use of STAR in Snakefile-WASP

STAR should be called with the --alignEndsType EndToEnd argument to force reads to be globally aligned (instead of locally)

clean up repo

the following directories should be created:

Snakefiles
schemas
scripts

and READMEs should be created for all but the schema dir

extracting heterozygotes in rules.extract_hets

Currently, the pipeline uses grep '0|1' to extract heterozygotes from the counts. However, grep '1|0' should also be allowed.

revise the grep command to allow both 0|1 and 1|0
revise the grep command to be more specific (ie also match whitespace around the genotype fields)

use snakemake validation

Snakemake can validate config files. It will be easier for people to use your pipeline if they know what its inputs are. Get on this

consider extracting common rules to separate snakefile

both Snakefile-counts and Snakefile-WASP need four rules for downloading and installing WASP, splitting VCFs by chromosome, and running snp2h5
it might be cleaner and more simple to extract these rules to their own snakefile and then include it in the other Snakefiles

LD_LIBRARY_PATH problems

We try to automatically install WASP for those who don't have it, but

The conda envs must have LD_LIBRARY_PATH set inside them. Otherwise, snp2h5 won't work
Also, if the user is manually installing dependencies, you shouldn't assume that their HDF5 libraries are in the conda directory (or indeed, that conda even exists on their machine). Instead, specify this as an optional config option.

incorporate hard filtering into variant calling pipeline

If a user says they don't want to perform variant score recalibration, they probably still want to perform hard-filtering. Make this the alternative.

snps that don't overlap a gene

the message for this and this isn't very accurate
sometimes, it outputs a negative number because queryHits() may duplicate itself

reference genome index files

Ideally, the Snakefile would also generate the required index files on the fly. Let's do that

gzip all of the things

sometimes, things can take up a lot of space. currently, we do a lazy job of minimizing space usage. make it better

one example: gq_subset.table files

a report

snakemake offers the ability to generate and maintain reports for the user from each rule. this might be a great way to log information about resource usage (specifically, memory)
you may want to look into this eventually

create config option for base score and variant score recalibration

Some users might choose to forego GATK's recalibration, perhaps because they want quick and dirty output. Others might want to compare their results with and without recalibration. In any case, it would be nice if your pipeline could offer this functionality as an option or switch.

samtools mutli threading

some of the WASP and counts pipelines use samtools, which can be multi-threaded. you should add the threads parameters to the rules

prepare_counts uses too much memory

extract gq into one file

use SnpSift and change the R script to accept it

revise params

snakemake params can take input and output in their lambda functions. resolve them to be better

use wildcards to get_as_counts dna/rna

The current implementation runs both DNA and RNA counts for each sample in the same shell command, but you can use wildcards to create a single command for both.

switch to tsv's

sample files as of now have been single-space delimited. that will end up being a problem when your file has a space in it

use tab-separated sample files instead

refine python code in Snakefiles

There's some inefficiency with how global functions are called. Shouldn't be too hard to just create a global variable for this instead.

investigate heterozygotes in counts data

Most of the counts outputted by WASP are genotyped as "0|1", some are genotyped as "0|0", less are genotyped as "1|1", and even less are genotyped as "-1|-1"

What are these genotypes and why are they appearing in a file that is supposed to only represent heterozygotes? Does WASP do its own genotyping?

download WASP for the user

The pipeline could attempt to download and install WASP if the user hasn't specified a WASP directory. This might be especially good given that all of WASP's dependencies are listed already as dependencies of ours.
It might be a nice addition. Think about it.

Another benefit is that it will prevent problems where the version of hdf5 used to compile a library is different from the version that is used to run snp2h5.

consider printing progress statements in prepare_counts and find_imbalance?

it might be nice to have some print statements, just so one can watch the progress of the program and figure out what might be going wrong should that ever happen

re-attempt failed jobs and scale resource usage

you can make the pipeline more robust to cluster mishaps using the --restart-times option in combination with a function that automatically scales the number of jobs depending on how often jobs have had to be restarted
see the documentation for an explanation of how the scaling can be done

running entire pipeline all at once

Perhaps Snakefile-variant_calling should have an option for running the entire pipeline, since it is much farther upstream of the other Snakefiles? You could use includes to do this.

rna version of allele.imbalance script

add it to the pipeline and revise Snakefile-counts appropriately
add an option to run the pipeline with the rna version instead of the dna+rna version

set min snakemake version

for all of the Snakefiles. It should be 5.1.4

incorporate WASP pipeline back into Snakefile

I excluded it because it isn't clear yet whether it has all the config options that it needs.
You should also decide whether you want to attempt to get it working or whether you want to use STAR's wasp mode (see issue #1)

use of samples file prevents reruns

use of a separate samples file for the R portion of Snakefile-counts will prevent detection of rules that need to get rerun before it (if it already exists)

incorporate R script into Snakefile

You might want to figure out how to call the script from bash, first. Presumably Rscript will work, but you'll want to make sure that anaconda's version is being called, not the default.

provide flexibility for specifying samples

There are a lot of inputs in the config files. Find out whether you can get rid of some of them (like config-WASP in Snakefile-counts) or get the info you need from a smaller number of files.

conda envs

You should specify environments for each rule, so that Snakemake can automatically handle them.

update default env to reflect new WASP version

WASP has updated to v0.3.x

Let's change envs/default.yaml so that the new version of WASP can be used. Also, the new update will probably allow us to get rid of envs/pytables2.yaml.

r packages

Make sure that you're actually requiring only the r packages you need in find_imbalance.r and prepare_counts.r. Delete each call to library() and see if anything breaks.

extract hets from WASP counts, as well

this is to resolve issues we were seeing in #26

incorporate R package dependencies into default conda env

Unfortunately, it doesn't seem like Anaconda can reliably install all of the current R packages that are required by allele_imbalance.r. You should investigate this further and see whether you can fix this so that all dependencies can be handled by snakemake.

include index files in all rules

If a rule needs an index file, you should explicitly include it as input. You should also include output index files but only if another rule needs them

split Snakefile

We decided it would be a good idea to split the Snakefile into multiple parts, so that it could be used more flexibly. That's why I thought it might be a good idea to split them up according to the FASTQ inputs. Here are the different sections:

Variant Calling
WASP
Allelic Imbalance Detection

However, I could also split them up according to the BAM outputs:

DNA BAM pre-processing
Variant Calling + WASP
Allelic Imbalance Detection

Which one should I do?

redevelop find_imbalance.r for multiple samples

The current implementation has us calling find_imbalance.r for each sample, which is quite inefficient.
We can take advantage of R's efficiency with large data-sets by having find_imbalance.r handle all of the samples concurrently (and simply distinguishing samples using a column that contains sample names). This would require converting the input arguments to a file for the script to read.

specifying specific samples

Snakefiles are supposed to figure out the minimum number of files they need to create in order to generate a requested output file.

But some of the rules in the pipeline create a single file from a bunch of sample-specific files, like a funnel. As they're currently implemented, these rules necessarily require all sample specific files as input.
So when a rule downstream creates sample-specific files, Snakemake will always think it needs to create all upstream sample-specific files because the funnel doesn't know which samples the user chose. In this way, the pipeline might be partly run on all samples even if only one of them is requested.

Fix the funnel rules so that they determine which inputs they need based only on which outputs are requested downstream. This may involve creating global variables in the Snakefile eeek

shadow rules

you may want to implement shadow rules to keep temporary files and other output out of the working directory

wasp mode of STAR

You should incorporate STAR's WASP mode into the pipeline.

multi-threading for gatk

I think GenotypeVCFs can be multi-threaded. You should attempt to implement multi-threading with as many other GATK commands as possible.

named pipes

named pipes are a great way to save space on the file system!
they can also speed up the pipeline, since no I/O is being performed
let's use them with snakemake

test final.bash equivalent of snakemake pipeline

parallelize R script

and make it fail gracefully if a single sample has illegitimate data

add vqsr filtering for indels?

modify our use of WASP snakemake pipeline to save space

In the find_intersecting_snps step, WASP separates reads that don't overlap a SNP from those that do. Later, reads that didn't overlap a SNP are merged back into a final BAM final with the filtered reads. However, allele specific analysis really only cares about the reads that overlap a SNP. Perhaps we can get rid of the merging step so that reads that don't overlap a SNP can be discarded instead?
This would certainly help to save space and speed up execution of downstream analysis, since only a small percentage of reads actually overlap a SNP in most samples.

memory

Our pipeline can sometimes use a lot of memory (especially when running the GTEX samples). This often results in issues where some jobs will get killed because we use all the memory on the cluster. Last I checked, this was happening when I used STAR to map reads (the second time). It also occurred when I was using WASP's get_as_counts.

We can fix get_as_counts by implementing a lighter hdf5 version of it (see my fork of WASP), but I'm not sure about what to do with STAR. My guess is that it has something to do with loading all of the VCF data all at once but you should generally look into how much memory all of the rules use and whether there is anything you can do about it (besides just scaling down the number of jobs that snakemake will run at any given moment of time).

SNPs only

we're only supposed to have SNPs in our counts

rewrite config files to take advantage of yaml

see https://github.com/snakemake-workflows/rna-seq-star-deseq2 for an example

aryarm / as_analysis Goto Github PK

as_analysis's People

Contributors

Stargazers

Watchers

Forkers

as_analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org