jsa-aerial / aerobio Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 1.0 660 KB

Extensible full DAG streaming computation server with services and jobs for RNA-Seq, Tn-Seq, WG-Seq and Term-Seq.

License: MIT License

Clojure 59.26% R 3.19% JavaScript 0.04% Ruby 6.90% Perl 2.01% Python 28.60%

clojure genome-sequencing pipeline-framework pipelines rna-seq streaming-data term-seq tn-seq wg-seq

aerobio's People

Contributors

Stargazers

Watchers

Forkers

davidalphafox

aerobio's Issues

Completion emails (any?) should always include relevant EID

From Stephen:

I am running some more samples for Eddie and one feature I would like to request is that the email sent saying a phase was successful should have the EID with it. I'm running some of the plasmids in parallel and its not clear which sample is which by the email

Generate experiment sheets?

Is it plausible to do a good job of generating the various experiment description sheets. See #4 as supporting information

Split off flow graph to separate lib

pgmgraph and friends should be separate library!! Additionally tool and job db, create, delete, update should be part of this and removed from server (which is a crazy place for it)

Flow graph should be rewritten using Specter

While issue #2 speaks to the need of splitting this off into a separate lib which would then be a dependency, this issue is a marker indicating the need to rewrite the (rather ugly) custom graph rewriting code to use Specter. Specter has vastly cleaner general purpose rewrite code for all manner of nested datastructures, including graphs, trees, maps, and all manner of nesting. This is exactly what the flow graph recursive rewriting code needs.

Validation for missing replicates in Tn-Seq comparisons

Per page 9 of cookbook, this needs to be accounted for by including missing expansion factor fields. The validator should check and indicate error if this is not done.

WGseq phase-2 flow needs better parallelism

Simplest way would be to put a strm-take between get-comparison-files and breseq-runs. Then stagger launch the number given to spread alignments out.

Change status progress report indication

Currently, status reports in progress output with a [done: [list of nodes that have finished]]. The 'done' is confusing and would be more accurate to be run progress

Validator misses sample name/id check

Sample names and IDs in SampleSheet [Data] section must not contain underbars _, as they are the delimiter used by the conversion software (bcl2fastq and bcl-convert) to demarcate sequencer conversion information - NOT sample names/ids. This can result in 'missing input fastqs' during experiment barcode demultiplexing.

calc_fitness and aggregate for TNSeq should be rewritten

TItle pretty much says it all. This is both for performance and maintenance reasons

Need RErun command which resets data areas and experiment DB information

Need to build a specification verification system for all experiment sheets

Make `replicates` the default; add new `combined` modifier

Generally, running with replicates enabled is what is typically needed the most. Combination runs are less likely to be used / desired - though they can still happen. So, change the default behavior to be replicate oriented while allowing explicit replicates modifier for backward compatibility and adding new combined modifier to indicate that a combination run is desired.

Replicate ID's - allow more than a single character

Looks like the replicate IDs can only be a single character, which limits the number of possible replicates in an experiment. For large (e.g. human population) studies there might be hundreds of samples in one group, so it would be nice to have replicate ID's that can be more than a single character. A workaround for now is to do things in batches of 26 and name replicates a-z.

config.clj file missing?

Hi, I'm trying to see if I could install the package on my MacBook pro. It looks like the config.clj is missing? I wonder if you have it or it's not necessary for running? Thank you!

Use GTFs in TnSeq analysis

calc_fitness and aggregate both use gbk parsing to determine annotations but this has two problems:

It introduces dependency on bioperl and biopython - nothing else in them uses this
neither of them properly parse gbks with multiple locus entries - say for whole genome and some associated plasmids

Using GTFs:

eliminates these dependencies - making installation simpler
simplifies the 'parse' - basically it is just csv read and pick fields
easy to create GTFs with multiple locus entries (the 'chromosome' field) from multiple gbks
gbks can be kept simple - single locus per gbk
runs involving a strain with whole genome and associated plasmids become simple to accommodate

Flow for making STAR indices from reference genome files

Would be great to also have the option of providing fasta+gtf OR genbank file (would have to be converted to fasta/gtf first)

Validate no repeating replicates

Make sure that no replicate names repeat in Exp-SampleSheet.csv. This is for RNASeq,TNSeq and TermSeq.
Make sure that SampleName/IDs do not repeat in SampleSheet. This is because bcl2fastq will catch this but it has no good way of relaying the error to Aerobio...