aryarm / as_analysis Goto Github PK
View Code? Open in Web Editor NEWA complete Snakemake pipeline for detecting allele specific expression in RNA-seq
License: MIT License
A complete Snakemake pipeline for detecting allele specific expression in RNA-seq
License: MIT License
STAR should be called with the --alignEndsType EndToEnd
argument to force reads to be globally aligned (instead of locally)
the following directories should be created:
and READMEs should be created for all but the schema dir
Currently, the pipeline uses grep '0|1'
to extract heterozygotes from the counts. However, grep '1|0'
should also be allowed.
0|1
and 1|0
Snakemake can validate config files. It will be easier for people to use your pipeline if they know what its inputs are. Get on this
both Snakefile-counts
and Snakefile-WASP
need four rules for downloading and installing WASP, splitting VCFs by chromosome, and running snp2h5
it might be cleaner and more simple to extract these rules to their own snakefile and then include it in the other Snakefiles
We try to automatically install WASP for those who don't have it, but
If a user says they don't want to perform variant score recalibration, they probably still want to perform hard-filtering. Make this the alternative.
the message for this and this isn't very accurate
sometimes, it outputs a negative number because queryHits() may duplicate itself
Ideally, the Snakefile would also generate the required index files on the fly. Let's do that
sometimes, things can take up a lot of space. currently, we do a lazy job of minimizing space usage. make it better
one example: gq_subset.table files
snakemake offers the ability to generate and maintain reports for the user from each rule. this might be a great way to log information about resource usage (specifically, memory)
you may want to look into this eventually
Some users might choose to forego GATK's recalibration, perhaps because they want quick and dirty output. Others might want to compare their results with and without recalibration. In any case, it would be nice if your pipeline could offer this functionality as an option or switch.
some of the WASP and counts pipelines use samtools, which can be multi-threaded. you should add the threads parameters to the rules
use SnpSift and change the R script to accept it
snakemake params can take input and output in their lambda functions. resolve them to be better
The current implementation runs both DNA and RNA counts for each sample in the same shell command, but you can use wildcards to create a single command for both.
sample files as of now have been single-space delimited. that will end up being a problem when your file has a space in it
use tab-separated sample files instead
There's some inefficiency with how global functions are called. Shouldn't be too hard to just create a global variable for this instead.
Most of the counts outputted by WASP are genotyped as "0|1", some are genotyped as "0|0", less are genotyped as "1|1", and even less are genotyped as "-1|-1"
What are these genotypes and why are they appearing in a file that is supposed to only represent heterozygotes? Does WASP do its own genotyping?
The pipeline could attempt to download and install WASP if the user hasn't specified a WASP directory. This might be especially good given that all of WASP's dependencies are listed already as dependencies of ours.
It might be a nice addition. Think about it.
Another benefit is that it will prevent problems where the version of hdf5 used to compile a library is different from the version that is used to run snp2h5
.
it might be nice to have some print statements, just so one can watch the progress of the program and figure out what might be going wrong should that ever happen
you can make the pipeline more robust to cluster mishaps using the --restart-times
option in combination with a function that automatically scales the number of jobs depending on how often jobs have had to be restarted
see the documentation for an explanation of how the scaling can be done
Perhaps Snakefile-variant_calling
should have an option for running the entire pipeline, since it is much farther upstream of the other Snakefiles? You could use includes to do this.
add it to the pipeline and revise Snakefile-counts
appropriately
add an option to run the pipeline with the rna version instead of the dna+rna version
for all of the Snakefiles. It should be 5.1.4
I excluded it because it isn't clear yet whether it has all the config options that it needs.
You should also decide whether you want to attempt to get it working or whether you want to use STAR's wasp mode (see issue #1)
use of a separate samples file for the R portion of Snakefile-counts will prevent detection of rules that need to get rerun before it (if it already exists)
You might want to figure out how to call the script from bash, first. Presumably Rscript
will work, but you'll want to make sure that anaconda's version is being called, not the default.
There are a lot of inputs in the config files. Find out whether you can get rid of some of them (like config-WASP in Snakefile-counts) or get the info you need from a smaller number of files.
You should specify environments for each rule, so that Snakemake can automatically handle them.
WASP has updated to v0.3.x
Let's change envs/default.yaml
so that the new version of WASP can be used. Also, the new update will probably allow us to get rid of envs/pytables2.yaml
.
Make sure that you're actually requiring only the r packages you need in find_imbalance.r
and prepare_counts.r
. Delete each call to library()
and see if anything breaks.
this is to resolve issues we were seeing in #26
Unfortunately, it doesn't seem like Anaconda can reliably install all of the current R packages that are required by allele_imbalance.r
. You should investigate this further and see whether you can fix this so that all dependencies can be handled by snakemake.
If a rule needs an index file, you should explicitly include it as input
. You should also include output index files but only if another rule needs them
We decided it would be a good idea to split the Snakefile into multiple parts, so that it could be used more flexibly. That's why I thought it might be a good idea to split them up according to the FASTQ inputs. Here are the different sections:
However, I could also split them up according to the BAM outputs:
Which one should I do?
The current implementation has us calling find_imbalance.r for each sample, which is quite inefficient.
We can take advantage of R's efficiency with large data-sets by having find_imbalance.r handle all of the samples concurrently (and simply distinguishing samples using a column that contains sample names). This would require converting the input arguments to a file for the script to read.
Snakefiles are supposed to figure out the minimum number of files they need to create in order to generate a requested output file.
But some of the rules in the pipeline create a single file from a bunch of sample-specific files, like a funnel. As they're currently implemented, these rules necessarily require all sample specific files as input.
So when a rule downstream creates sample-specific files, Snakemake will always think it needs to create all upstream sample-specific files because the funnel doesn't know which samples the user chose. In this way, the pipeline might be partly run on all samples even if only one of them is requested.
Fix the funnel rules so that they determine which inputs they need based only on which outputs are requested downstream. This may involve creating global variables in the Snakefile eeek
you may want to implement shadow rules to keep temporary files and other output out of the working directory
You should incorporate STAR's WASP mode into the pipeline.
I think GenotypeVCFs can be multi-threaded. You should attempt to implement multi-threading with as many other GATK commands as possible.
named pipes are a great way to save space on the file system!
they can also speed up the pipeline, since no I/O is being performed
let's use them with snakemake
and make it fail gracefully if a single sample has illegitimate data
In the find_intersecting_snps
step, WASP separates reads that don't overlap a SNP from those that do. Later, reads that didn't overlap a SNP are merged back into a final BAM final with the filtered reads. However, allele specific analysis really only cares about the reads that overlap a SNP. Perhaps we can get rid of the merging step so that reads that don't overlap a SNP can be discarded instead?
This would certainly help to save space and speed up execution of downstream analysis, since only a small percentage of reads actually overlap a SNP in most samples.
Our pipeline can sometimes use a lot of memory (especially when running the GTEX samples). This often results in issues where some jobs will get killed because we use all the memory on the cluster. Last I checked, this was happening when I used STAR to map reads (the second time). It also occurred when I was using WASP's get_as_counts
.
We can fix get_as_counts
by implementing a lighter hdf5 version of it (see my fork of WASP), but I'm not sure about what to do with STAR. My guess is that it has something to do with loading all of the VCF data all at once but you should generally look into how much memory all of the rules use and whether there is anything you can do about it (besides just scaling down the number of jobs that snakemake will run at any given moment of time).
we're only supposed to have SNPs in our counts
see https://github.com/snakemake-workflows/rna-seq-star-deseq2 for an example
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.