Coder Social home page Coder Social logo

aryarm / as_analysis Goto Github PK

View Code? Open in Web Editor NEW
9.0 2.0 9.0 251 KB

A complete Snakemake pipeline for detecting allele specific expression in RNA-seq

License: MIT License

Shell 11.01% R 59.94% Python 29.06%
snakemake allele-specific-expression rna-seq atac-seq whole-genome-sequencing

as_analysis's People

Contributors

aryarm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

as_analysis's Issues

clean up repo

the following directories should be created:

  • Snakefiles
  • schemas
  • scripts

and READMEs should be created for all but the schema dir

extracting heterozygotes in rules.extract_hets

Currently, the pipeline uses grep '0|1' to extract heterozygotes from the counts. However, grep '1|0' should also be allowed.

  • revise the grep command to allow both 0|1 and 1|0
  • revise the grep command to be more specific (ie also match whitespace around the genotype fields)

use snakemake validation

Snakemake can validate config files. It will be easier for people to use your pipeline if they know what its inputs are. Get on this

consider extracting common rules to separate snakefile

both Snakefile-counts and Snakefile-WASP need four rules for downloading and installing WASP, splitting VCFs by chromosome, and running snp2h5
it might be cleaner and more simple to extract these rules to their own snakefile and then include it in the other Snakefiles

LD_LIBRARY_PATH problems

We try to automatically install WASP for those who don't have it, but

  1. The conda envs must have LD_LIBRARY_PATH set inside them. Otherwise, snp2h5 won't work
  2. Also, if the user is manually installing dependencies, you shouldn't assume that their HDF5 libraries are in the conda directory (or indeed, that conda even exists on their machine). Instead, specify this as an optional config option.

gzip all of the things

sometimes, things can take up a lot of space. currently, we do a lazy job of minimizing space usage. make it better

one example: gq_subset.table files

a report

snakemake offers the ability to generate and maintain reports for the user from each rule. this might be a great way to log information about resource usage (specifically, memory)
you may want to look into this eventually

create config option for base score and variant score recalibration

Some users might choose to forego GATK's recalibration, perhaps because they want quick and dirty output. Others might want to compare their results with and without recalibration. In any case, it would be nice if your pipeline could offer this functionality as an option or switch.

samtools mutli threading

some of the WASP and counts pipelines use samtools, which can be multi-threaded. you should add the threads parameters to the rules

use wildcards to get_as_counts dna/rna

The current implementation runs both DNA and RNA counts for each sample in the same shell command, but you can use wildcards to create a single command for both.

switch to tsv's

sample files as of now have been single-space delimited. that will end up being a problem when your file has a space in it

use tab-separated sample files instead

refine python code in Snakefiles

There's some inefficiency with how global functions are called. Shouldn't be too hard to just create a global variable for this instead.

investigate heterozygotes in counts data

Most of the counts outputted by WASP are genotyped as "0|1", some are genotyped as "0|0", less are genotyped as "1|1", and even less are genotyped as "-1|-1"

What are these genotypes and why are they appearing in a file that is supposed to only represent heterozygotes? Does WASP do its own genotyping?

download WASP for the user

The pipeline could attempt to download and install WASP if the user hasn't specified a WASP directory. This might be especially good given that all of WASP's dependencies are listed already as dependencies of ours.
It might be a nice addition. Think about it.

Another benefit is that it will prevent problems where the version of hdf5 used to compile a library is different from the version that is used to run snp2h5.

re-attempt failed jobs and scale resource usage

you can make the pipeline more robust to cluster mishaps using the --restart-times option in combination with a function that automatically scales the number of jobs depending on how often jobs have had to be restarted
see the documentation for an explanation of how the scaling can be done

running entire pipeline all at once

Perhaps Snakefile-variant_calling should have an option for running the entire pipeline, since it is much farther upstream of the other Snakefiles? You could use includes to do this.

rna version of allele.imbalance script

add it to the pipeline and revise Snakefile-counts appropriately
add an option to run the pipeline with the rna version instead of the dna+rna version

incorporate WASP pipeline back into Snakefile

I excluded it because it isn't clear yet whether it has all the config options that it needs.
You should also decide whether you want to attempt to get it working or whether you want to use STAR's wasp mode (see issue #1)

use of samples file prevents reruns

use of a separate samples file for the R portion of Snakefile-counts will prevent detection of rules that need to get rerun before it (if it already exists)

incorporate R script into Snakefile

You might want to figure out how to call the script from bash, first. Presumably Rscript will work, but you'll want to make sure that anaconda's version is being called, not the default.

provide flexibility for specifying samples

There are a lot of inputs in the config files. Find out whether you can get rid of some of them (like config-WASP in Snakefile-counts) or get the info you need from a smaller number of files.

conda envs

You should specify environments for each rule, so that Snakemake can automatically handle them.

update default env to reflect new WASP version

WASP has updated to v0.3.x

Let's change envs/default.yaml so that the new version of WASP can be used. Also, the new update will probably allow us to get rid of envs/pytables2.yaml.

r packages

Make sure that you're actually requiring only the r packages you need in find_imbalance.r and prepare_counts.r. Delete each call to library() and see if anything breaks.

incorporate R package dependencies into default conda env

Unfortunately, it doesn't seem like Anaconda can reliably install all of the current R packages that are required by allele_imbalance.r. You should investigate this further and see whether you can fix this so that all dependencies can be handled by snakemake.

include index files in all rules

If a rule needs an index file, you should explicitly include it as input. You should also include output index files but only if another rule needs them

split Snakefile

We decided it would be a good idea to split the Snakefile into multiple parts, so that it could be used more flexibly. That's why I thought it might be a good idea to split them up according to the FASTQ inputs. Here are the different sections:

  1. Variant Calling
  2. WASP
  3. Allelic Imbalance Detection

However, I could also split them up according to the BAM outputs:

  1. DNA BAM pre-processing
  2. Variant Calling + WASP
  3. Allelic Imbalance Detection

Which one should I do?

redevelop find_imbalance.r for multiple samples

The current implementation has us calling find_imbalance.r for each sample, which is quite inefficient.
We can take advantage of R's efficiency with large data-sets by having find_imbalance.r handle all of the samples concurrently (and simply distinguishing samples using a column that contains sample names). This would require converting the input arguments to a file for the script to read.

specifying specific samples

Snakefiles are supposed to figure out the minimum number of files they need to create in order to generate a requested output file.

But some of the rules in the pipeline create a single file from a bunch of sample-specific files, like a funnel. As they're currently implemented, these rules necessarily require all sample specific files as input.
So when a rule downstream creates sample-specific files, Snakemake will always think it needs to create all upstream sample-specific files because the funnel doesn't know which samples the user chose. In this way, the pipeline might be partly run on all samples even if only one of them is requested.

Fix the funnel rules so that they determine which inputs they need based only on which outputs are requested downstream. This may involve creating global variables in the Snakefile eeek

shadow rules

you may want to implement shadow rules to keep temporary files and other output out of the working directory

multi-threading for gatk

I think GenotypeVCFs can be multi-threaded. You should attempt to implement multi-threading with as many other GATK commands as possible.

named pipes

named pipes are a great way to save space on the file system!
they can also speed up the pipeline, since no I/O is being performed
let's use them with snakemake

modify our use of WASP snakemake pipeline to save space

In the find_intersecting_snps step, WASP separates reads that don't overlap a SNP from those that do. Later, reads that didn't overlap a SNP are merged back into a final BAM final with the filtered reads. However, allele specific analysis really only cares about the reads that overlap a SNP. Perhaps we can get rid of the merging step so that reads that don't overlap a SNP can be discarded instead?
This would certainly help to save space and speed up execution of downstream analysis, since only a small percentage of reads actually overlap a SNP in most samples.

memory

Our pipeline can sometimes use a lot of memory (especially when running the GTEX samples). This often results in issues where some jobs will get killed because we use all the memory on the cluster. Last I checked, this was happening when I used STAR to map reads (the second time). It also occurred when I was using WASP's get_as_counts.

We can fix get_as_counts by implementing a lighter hdf5 version of it (see my fork of WASP), but I'm not sure about what to do with STAR. My guess is that it has something to do with loading all of the VCF data all at once but you should generally look into how much memory all of the rules use and whether there is anything you can do about it (besides just scaling down the number of jobs that snakemake will run at any given moment of time).

SNPs only

we're only supposed to have SNPs in our counts

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.