biogenies / tidysq Goto Github PK
View Code? Open in Web Editor NEWtidy processing of biological sequences in R
Home Page: https://BioGenies.github.io/tidysq/
tidy processing of biological sequences in R
Home Page: https://BioGenies.github.io/tidysq/
It's because length(sq)
should be compatible with R_xlen_t
.
What are suitable human-readable names?
While it's possible to write any(is.na(sequence))
, anyNA(sequence)
would be cleaner and possibly faster.
Describe the bug
Using alphabet = "unt"
with ignore_case = TRUE
does not ignore case completely, instead, ignores case for those lowercase letters, that appear as uppercase as well. If no uppercase equivalent is present, letter is treated as NA
.
To Reproduce
# simply run
sq(c("oXYOqwwKCNJLo"), alphabet = "unt", ignore_case = TRUE)
Expected behavior
Above call should return sq
object of type unt
with all letters as uppercase, with no NA
, that is, !
.
For a long time, summary.sq()
was simply a call to summary.default()
. We can do so much better and provide a valuable insight into sq
objects. However, first and foremost, we have to decide what should be included in this summary.
for example - we should be able to use VectorWrappper instead of std::vector and Rcpp::LogicalVector respectively
Right now there are repeated lines of code like this one:
y <- lapply(y, function(s) replace(s, s == "D", "[DATG]"))
It would make sense to group them in one method (or three methods, each one for separate class) instead of having them in find_motifs()
and %has%
simultaneously.
Calling lengths(sq)
results in calling default implementation of lengths
instead of method lengths.sq
. lengths
is not a common generic function, but an internal generic, which may be the reason why it fails to work correctly.
The idea is that the user might want to paste some sequences together. I imagine it would be used like that:
sq_dna_1 <- sq(c("CTTCGCCA", "CGATCTTG"), "dna_bsc")
sq_dna_2 <- sq(c("ATTGC", "TCACC"), "dna_bsc")
paste(sq_dna_1, sq_dna_2)
# above would be identical to
sq(c("CTTCGCCAATTGC", "CGATCTTGTCACC"), "dna_bsc")
We could also have sep
and collapse
arguments implemented as well.
If it would make it easier for bioinformaticians to use it, we could extract specific function called collapse
to operate on single sq
object:
sq_dna <- sq(c("CTTCGCCA", "CGATCTTG"), "dna_bsc")
collapse(sq_dna)
# above would be identical to
sq("CTTCGCCACGATCTTG", "dna_bsc")
I dare even say that the second functionality would be easier to implement, as no type compatibility checking would be necessary.
Actually Sequence::const_reverse_iterator
would be used in reverse()
. Right now it has to be iterated over manually. Also, it's just a useful thing to have.
There are currently methods called construct_sq_ami()
and construct_sq_nuc()
โ the latter scheduled to be replaced with construct_sq_dna()
and construct_sq_rna()
โ that allow a subset of parameters, making their use easier and less confusing. I'd like to see similar methods for unt
and atp
types, expecting something like:
construct_sq_unt(sq)
construct_sq_atp(sq, non_standard)
or construct_sq_atp(sq, alphabet)
We should implement safe_mode
for all input functions. We have it done or scheduled for sq()
and read_fasta()
, so only import_sq()
is left unsafe.
In some of the operations, we have to override the base class method ELEM_OUT operator(ELEM_IN)
, even though it looks exactly the same as in the base class. Furthermore, when we want to use the base class method initilaize_element_out we have to specify superclass (e.g. OperationSqToSq<INTERNAL_IN, INTERNAL_OUT>::initialize_element_out(sequence_in)
in complement.h
even though it is not overridden.
Unique visual identification is very important if we want to be recognizable ;) Color palette can be used for vignettes, presentations etc.
Like so?
complement = function(x){
tbl = c('A', 'C', 'G', 'T')
names(tbl) = c('T', 'G', 'C', 'A')
x_clt = sapply(X = strsplit(x = x, split = ''),
FUN = function(x_i){ paste0(tbl[x_i], collapse = '') })
return(x_clt)
}
Performance of most of the functions can still be improved by moving their work to C++. Below we will keep list of them:
Important (improvement gain is relatively big in comparison to workload):
%has%
operator,==
operator (it uses as.character
-- that can be inefficient; it doesn't even need to be written in C++).encode
, typify
, substitute_letters
(checking for presence of unspecified letters in R is way too slow)encsq_to_list
write_fasta
reverse
More trickier (they require quite a lot work):
bite
(it could unpack sequences "intelligently" ),clean
complement
Whenever we call bite()
with indices like 7:32
, we access every element separately, that is, 7
, 8
, 9
... Computing all these bit and byte indices and shifting three/five bits at once is very inefficient. It would be better to have dedicated function that interprets these subsequences and computes shifting indices only for the first and the last of the passed indices (i.e. 7
and 32
in this example).
As of now, c.sq()
checks for the identity of alphabets. However, concatenating clean and unclean sequences of one type (say, "nuc"
type) is a plausible use case. While I understand that they might (and probably do!) use different encodings, it would make sense to include such possibility.
Every time element is accessed bit shifting is performed despite the fact that when accessing is sequential caching may be used and result in efficiency improvement
Should c
function generalize type of combined sq
objects?
Such a solution would make some functions faster (e.g. print)
So, my original thought with PepTools
, was a small super light weight, non-dependent (I.e. only base
code) toolbox for working with peptide data (which is what we do in the group). E.g.
X
one-hot
, BLOSUM
, atchley-factors
, BLOSUM_pca
, etc.At the same time, I wanted to use it in my teaching ("Immunological Bioinformatics" and "R for Bio Data Science")
Some of the functions would be simple wrappers, primarily to match the terminology of bioinformatics, e.g.
PepTools2::pep_split
function(pep){
# Check input
pep_check(pep = pep)
# Convert to matrix
# do.call applies a function to the list returned from args
# so rbind to form matrix each of the elements in the list returned
# by strsplit
return( do.call(what = rbind, args = strsplit(x = pep, split = '')) )
}
and then also include standard data, like the PepTools2::BLOSUM62
and PepTools2::BLOSUM50
, natural background frequencies PepTools2::BGFREQS
and example peptides PepTools2::PEPTIDES
. Furthermore, the ggseqlogo
package is quite nice, but it only support simple shannon entropy based logos, which is sub-optimal compared to Kullback-Leibler logos. So basically, I wanted to extend with the ability to compute PSSMs to match the functionality of Seq2Logo, these matrices could then be visualised using the custom
functionality of ggseqlogo
. Lastly, my intention was to name all functions using the prefix pep_
Thinking about it, perhaps, we should make the PepTools package as a separate package, but still as a sub-part of tidysq
? A bit like ggplot2
is a part of tidyverse
?
I'm interested in your thoughts? ๐
Currently (as of 7 VII 2020) method construct_sq()
has four parameters and their usage varies greatly depending on other parameters. All cases are described in documentation, but even clearest of lists have limited clarity. Thus I'd recommend creating a flow diagram that illustrates all possible (and impossible) parameter combinations. It could be used within vignettes, Github readme and possibly cheatsheet.
Suitable parameter should be passed to the function on C++ side
Tidy Data. Hadley Wickham (2014)
A Guide to Reproducible Code in Ecology and Evolution. Cooper et al. (2017)
Our path to better science in less time using open data science tools. Lowndes et al. (2017)
A Quick Guide to Organizing Computational Biology Projects. Noble (2009)
Good enough practices in scientific computing. Wilson et al. (2017)
Practical Data Science for Stats - a PeerJ Collection
I'll add more...
i.e. replace util::convert_to_scalar(Rcpp::IntegerVector) with just int
Problem description
I think head.sq()
would be more descriptive if it printed the length of original (not-headed) sq
object.
Proposed solution
Override head.sq()
so that it sets an attribute with original length value (maybe name it original_length
and rename original_length
attribute of single sequence to seq_length
or something like that?).
Additional notes
Actually the same request works for tail.sq()
as well.
There are two or three things that came across my mind when reading the code and I'd like to discuss them.
First, should we be able to compare, say, "dnasq"
and "rnasq"
objects? If so, should they return TRUE
whenever they match? What about "dnasq"
and "amisq"
or "untsq"
, where letters may match, but may have different meanings?
Another question is more of an improvement: even if we compare two sq
objects, they are still both coerced to character vectors. I think it could be done better, like, first compare alphabets and then simply compare raw vectors of the sq
objects.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.