The tidysq from biogenies

Set LenSq as an alias for R_xlen_t

It's because length(sq) should be compatible with R_xlen_t.

Recreate sqapply and Operation interface

Add human readable alias and blossum matrix

What are suitable human-readable names?

anyNA() for sq class

While it's possible to write any(is.na(sequence)), anyNA(sequence) would be cleaner and possibly faster.

Add more iterators and improve existing

Sq::iterator
ProtoSq::iterator
ProtoSequence::iterator
Alphabet::iterator (it would be better if it was iterator over Letters, not over pairs)

sessionInfo for tidysq.options

constructing unt sq with ignore_case = TRUE doesn't ignore case enough

Describe the bug
Using alphabet = "unt" with ignore_case = TRUE does not ignore case completely, instead, ignores case for those lowercase letters, that appear as uppercase as well. If no uppercase equivalent is present, letter is treated as NA.

To Reproduce

# simply run
sq(c("oXYOqwwKCNJLo"), alphabet = "unt", ignore_case = TRUE)

Expected behavior
Above call should return sq object of type unt with all letters as uppercase, with no NA, that is, !.

Implement summary.sq

For a long time, summary.sq() was simply a call to summary.default(). We can do so much better and provide a valuable insight into sq objects. However, first and foremost, we have to decide what should be included in this summary.

Add subclass of OperationVectorToVector that supports ResultWrapper

Refactor sq and read_fasta, extract common code and make more compact functions

Add safe mode for atp alphabet when constructing sq and reading_fasta

Replace import_from_R and export_to_R with Rcpp::wrap and Rcpp::as

Create generified vector wrapper template

for example - we should be able to use VectorWrappper instead of std::vector and Rcpp::LogicalVector respectively

Allow random_sq to use various distributions to generate lengths

Investigate memory copying when using Proxy

Extract common code between find_motifs and %has% operator

Right now there are repeated lines of code like this one:

y <- lapply(y, function(s) replace(s, s == "D", "[DATG]"))

It would make sense to group them in one method (or three methods, each one for separate class) instead of having them in find_motifs() and %has% simultaneously.

Internal generic lengths.sq does not dispatch properly

Calling lengths(sq) results in calling default implementation of lengths instead of method lengths.sq. lengths is not a common generic function, but an internal generic, which may be the reason why it fails to work correctly.

Stop loading pkg::tibble upon loading tidysq

Add paste functionality to sq class

The idea is that the user might want to paste some sequences together. I imagine it would be used like that:

sq_dna_1 <- sq(c("CTTCGCCA", "CGATCTTG"), "dna_bsc")
sq_dna_2 <- sq(c("ATTGC", "TCACC"), "dna_bsc")
paste(sq_dna_1, sq_dna_2)
# above would be identical to
sq(c("CTTCGCCAATTGC", "CGATCTTGTCACC"), "dna_bsc")

We could also have sep and collapse arguments implemented as well.

If it would make it easier for bioinformaticians to use it, we could extract specific function called collapse to operate on single sq object:

sq_dna <- sq(c("CTTCGCCA", "CGATCTTG"), "dna_bsc")
collapse(sq_dna)
# above would be identical to
sq("CTTCGCCACGATCTTG", "dna_bsc")

I dare even say that the second functionality would be easier to implement, as no type compatibility checking would be necessary.

Add Sequence::(const_)reverse_iterator

Actually Sequence::const_reverse_iterator would be used in reverse(). Right now it has to be iterated over manually. Also, it's just a useful thing to have.

Add construct_sq_[unt/atp] as a convenience shortcut

There are currently methods called construct_sq_ami() and construct_sq_nuc() – the latter scheduled to be replaced with construct_sq_dna() and construct_sq_rna() – that allow a subset of parameters, making their use easier and less confusing. I'd like to see similar methods for unt and atp types, expecting something like:

construct_sq_unt(sq)
construct_sq_atp(sq, non_standard) or construct_sq_atp(sq, alphabet)

Add safe_mode to import_sq()

We should implement safe_mode for all input functions. We have it done or scheduled for sq() and read_fasta(), so only import_sq() is left unsafe.

Can we call method of base class directly?

In some of the operations, we have to override the base class method ELEM_OUT operator(ELEM_IN), even though it looks exactly the same as in the base class. Furthermore, when we want to use the base class method initilaize_element_out we have to specify superclass (e.g. OperationSqToSq<INTERNAL_IN, INTERNAL_OUT>::initialize_element_out(sequence_in) in complement.h even though it is not overridden.

Choose unified color palette and create logo

Unique visual identification is very important if we want to be recognizable ;) Color palette can be used for vignettes, presentations etc.

Extract common code from FastaSampler and obtain_alphabet

Upgrade random to use seed and random library

Make tests for find_motifs more organized

Function suggestion: complement()

Like so?

complement = function(x){
  tbl = c('A', 'C', 'G', 'T')
  names(tbl) = c('T', 'G', 'C', 'A')
  x_clt = sapply(X = strsplit(x = x, split = ''),
                 FUN = function(x_i){ paste0(tbl[x_i], collapse = '') })
  return(x_clt)
}

Faster implementation of functions

Performance of most of the functions can still be improved by moving their work to C++. Below we will keep list of them:

Important (improvement gain is relatively big in comparison to workload):

%has% operator,
== operator (it uses as.character -- that can be inefficient; it doesn't even need to be written in C++).
encode, typify, substitute_letters (checking for presence of unspecified letters in R is way too slow)
encsq_to_list
write_fasta
reverse

More trickier (they require quite a lot work):

bite (it could unpack sequences "intelligently" ),
clean
complement

Implement bite for subsequences

Whenever we call bite() with indices like 7:32, we access every element separately, that is, 7, 8, 9... Computing all these bit and byte indices and shifting three/five bits at once is very inefficient. It would be better to have dedicated function that interprets these subsequences and computes shifting indices only for the first and the last of the passed indices (i.e. 7 and 32 in this example).

Allow using c() on mixed clean/unclean sequences

As of now, c.sq() checks for the identity of alphabets. However, concatenating clean and unclean sequences of one type (say, "nuc" type) is a plausible use case. While I understand that they might (and probably do!) use different encodings, it would make sense to include such possibility.

Extract common code from Sequence and ProtoSequence, Sq and ProtoSq

Improve Sequence::iterator

Every time element is accessed bit shifting is performed despite the fact that when accessing is sequential caching may be used and result in efficiency improvement

Move type mappings from TypeMapper to classes responsible for internal types

More functionality to c() function

Should c function generalize type of combined sq objects?

Replace temporary exceptions with enable_if etc.

Add "original_length" attribute to each sequence in sq

Such a solution would make some functions faster (e.g. print)

Move internal structures to separate files

Add more convenient aliases for types for interpret_type

Thoughts on PepTools

So, my original thought with PepTools, was a small super light weight, non-dependent (I.e. only base code) toolbox for working with peptide data (which is what we do in the group). E.g.

Create random peptides drawing from different backgrounds
Translate between 1-3-full AA names
Extend a set of peptides of different lengths, to have same length using X
Create a set of mutant peptides from a wild type peptide
Encode for machine learning using different schemes one-hot, BLOSUM, atchley-factors, BLOSUM_pca, etc.
...and alike

At the same time, I wanted to use it in my teaching ("Immunological Bioinformatics" and "R for Bio Data Science")

Some of the functions would be simple wrappers, primarily to match the terminology of bioinformatics, e.g.

PepTools2::pep_split
function(pep){
  # Check input
  pep_check(pep = pep)
  # Convert to matrix
  # do.call applies a function to the list returned from args
  # so rbind to form matrix each of the elements in the list returned
  # by strsplit
  return( do.call(what = rbind, args = strsplit(x = pep, split = '')) )
}

and then also include standard data, like the PepTools2::BLOSUM62 and PepTools2::BLOSUM50, natural background frequencies PepTools2::BGFREQS and example peptides PepTools2::PEPTIDES. Furthermore, the ggseqlogo package is quite nice, but it only support simple shannon entropy based logos, which is sub-optimal compared to Kullback-Leibler logos. So basically, I wanted to extend with the ability to compute PSSMs to match the functionality of Seq2Logo, these matrices could then be visualised using the custom functionality of ggseqlogo. Lastly, my intention was to name all functions using the prefix pep_

Thinking about it, perhaps, we should make the PepTools package as a separate package, but still as a sub-part of tidysq? A bit like ggplot2 is a part of tidyverse?

I'm interested in your thoughts? 👍

Create flow diagram for construct_sq()

Currently (as of 7 VII 2020) method construct_sq() has four parameters and their usage varies greatly depending on other parameters. All cases are described in documentation, but even clearest of lists have limited clarity. Thus I'd recommend creating a flow diagram that illustrates all possible (and impossible) parameter combinations. It could be used within vignettes, Github readme and possibly cheatsheet.

Add warning handling on C++ side where possible

Suitable parameter should be passed to the function on C++ side

Add support for relative paths for read and write fasta

Citations on reproducible computational biology research

Tidy Data. Hadley Wickham (2014)
A Guide to Reproducible Code in Ecology and Evolution. Cooper et al. (2017)
Our path to better science in less time using open data science tools. Lowndes et al. (2017)
A Quick Guide to Organizing Computational Biology Projects. Noble (2009)
Good enough practices in scientific computing. Wilson et al. (2017)
Practical Data Science for Stats - a PeerJ Collection

I'll add more...

Change parameters type in integration layer to reduce number of casting

i.e. replace util::convert_to_scalar(Rcpp::IntegerVector) with just int

Use roxygen templates for common documentation parameters

Override "head" for sq so that it prints original length of sq object

Problem description
I think head.sq() would be more descriptive if it printed the length of original (not-headed) sq object.

Proposed solution
Override head.sq() so that it sets an attribute with original length value (maybe name it original_length and rename original_length attribute of single sequence to seq_length or something like that?).

Additional notes
Actually the same request works for tail.sq() as well.

Discussion on compare operator `==`

There are two or three things that came across my mind when reading the code and I'd like to discuss them.
First, should we be able to compare, say, "dnasq" and "rnasq" objects? If so, should they return TRUE whenever they match? What about "dnasq" and "amisq" or "untsq", where letters may match, but may have different meanings?
Another question is more of an improvement: even if we compare two sq objects, they are still both coerced to character vectors. I think it could be done better, like, first compare alphabets and then simply compare raw vectors of the sq objects.

biogenies / tidysq Goto Github PK

tidysq's People

Contributors

Stargazers

Watchers

Forkers

tidysq's Issues

Recommend Projects

Recommend Topics

Recommend Org