Coder Social home page Coder Social logo

biogenies / tidysq Goto Github PK

View Code? Open in Web Editor NEW
34.0 5.0 2.0 11.08 MB

tidy processing of biological sequences in R

Home Page: https://BioGenies.github.io/tidysq/

R 47.85% C++ 51.97% C 0.18%
biological-sequences rstats bioinformatics tidyverse r tidy bioconductor fasta sequences tibble

tidysq's People

Contributors

devsjr avatar dominikrafacz avatar erdaradungaztea avatar fpietluch avatar jarochi avatar ksidorczuk avatar michbur avatar slowikj avatar werpuc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tidysq's Issues

anyNA() for sq class

While it's possible to write any(is.na(sequence)), anyNA(sequence) would be cleaner and possibly faster.

constructing unt sq with ignore_case = TRUE doesn't ignore case enough

Describe the bug
Using alphabet = "unt" with ignore_case = TRUE does not ignore case completely, instead, ignores case for those lowercase letters, that appear as uppercase as well. If no uppercase equivalent is present, letter is treated as NA.

To Reproduce

# simply run
sq(c("oXYOqwwKCNJLo"), alphabet = "unt", ignore_case = TRUE)

Expected behavior
Above call should return sq object of type unt with all letters as uppercase, with no NA, that is, !.

Implement summary.sq

For a long time, summary.sq() was simply a call to summary.default(). We can do so much better and provide a valuable insight into sq objects. However, first and foremost, we have to decide what should be included in this summary.

Extract common code between find_motifs and %has% operator

Right now there are repeated lines of code like this one:

y <- lapply(y, function(s) replace(s, s == "D", "[DATG]"))

It would make sense to group them in one method (or three methods, each one for separate class) instead of having them in find_motifs() and %has% simultaneously.

Internal generic lengths.sq does not dispatch properly

Calling lengths(sq) results in calling default implementation of lengths instead of method lengths.sq. lengths is not a common generic function, but an internal generic, which may be the reason why it fails to work correctly.

Add paste functionality to sq class

The idea is that the user might want to paste some sequences together. I imagine it would be used like that:

sq_dna_1 <- sq(c("CTTCGCCA", "CGATCTTG"), "dna_bsc")
sq_dna_2 <- sq(c("ATTGC", "TCACC"), "dna_bsc")
paste(sq_dna_1, sq_dna_2)
# above would be identical to
sq(c("CTTCGCCAATTGC", "CGATCTTGTCACC"), "dna_bsc")

We could also have sep and collapse arguments implemented as well.

If it would make it easier for bioinformaticians to use it, we could extract specific function called collapse to operate on single sq object:

sq_dna <- sq(c("CTTCGCCA", "CGATCTTG"), "dna_bsc")
collapse(sq_dna)
# above would be identical to
sq("CTTCGCCACGATCTTG", "dna_bsc")

I dare even say that the second functionality would be easier to implement, as no type compatibility checking would be necessary.

Add Sequence::(const_)reverse_iterator

Actually Sequence::const_reverse_iterator would be used in reverse(). Right now it has to be iterated over manually. Also, it's just a useful thing to have.

Add construct_sq_[unt/atp] as a convenience shortcut

There are currently methods called construct_sq_ami() and construct_sq_nuc() โ€“ the latter scheduled to be replaced with construct_sq_dna() and construct_sq_rna() โ€“ that allow a subset of parameters, making their use easier and less confusing. I'd like to see similar methods for unt and atp types, expecting something like:

  • construct_sq_unt(sq)
  • construct_sq_atp(sq, non_standard) or construct_sq_atp(sq, alphabet)

Add safe_mode to import_sq()

We should implement safe_mode for all input functions. We have it done or scheduled for sq() and read_fasta(), so only import_sq() is left unsafe.

Can we call method of base class directly?

In some of the operations, we have to override the base class method ELEM_OUT operator(ELEM_IN), even though it looks exactly the same as in the base class. Furthermore, when we want to use the base class method initilaize_element_out we have to specify superclass (e.g. OperationSqToSq<INTERNAL_IN, INTERNAL_OUT>::initialize_element_out(sequence_in) in complement.h even though it is not overridden.

Function suggestion: complement()

Like so?

complement = function(x){
  tbl = c('A', 'C', 'G', 'T')
  names(tbl) = c('T', 'G', 'C', 'A')
  x_clt = sapply(X = strsplit(x = x, split = ''),
                 FUN = function(x_i){ paste0(tbl[x_i], collapse = '') })
  return(x_clt)
}

Faster implementation of functions

Performance of most of the functions can still be improved by moving their work to C++. Below we will keep list of them:

Important (improvement gain is relatively big in comparison to workload):

  • %has% operator,
  • == operator (it uses as.character -- that can be inefficient; it doesn't even need to be written in C++).
  • encode, typify, substitute_letters (checking for presence of unspecified letters in R is way too slow)
  • encsq_to_list
  • write_fasta
  • reverse

More trickier (they require quite a lot work):

  • bite (it could unpack sequences "intelligently" ),
  • clean
  • complement

Implement bite for subsequences

Whenever we call bite() with indices like 7:32, we access every element separately, that is, 7, 8, 9... Computing all these bit and byte indices and shifting three/five bits at once is very inefficient. It would be better to have dedicated function that interprets these subsequences and computes shifting indices only for the first and the last of the passed indices (i.e. 7 and 32 in this example).

Allow using c() on mixed clean/unclean sequences

As of now, c.sq() checks for the identity of alphabets. However, concatenating clean and unclean sequences of one type (say, "nuc" type) is a plausible use case. While I understand that they might (and probably do!) use different encodings, it would make sense to include such possibility.

Improve Sequence::iterator

Every time element is accessed bit shifting is performed despite the fact that when accessing is sequential caching may be used and result in efficiency improvement

Thoughts on PepTools

So, my original thought with PepTools, was a small super light weight, non-dependent (I.e. only base code) toolbox for working with peptide data (which is what we do in the group). E.g.

  • Create random peptides drawing from different backgrounds
  • Translate between 1-3-full AA names
  • Extend a set of peptides of different lengths, to have same length using X
  • Create a set of mutant peptides from a wild type peptide
  • Encode for machine learning using different schemes one-hot, BLOSUM, atchley-factors, BLOSUM_pca, etc.
  • ...and alike

At the same time, I wanted to use it in my teaching ("Immunological Bioinformatics" and "R for Bio Data Science")

Some of the functions would be simple wrappers, primarily to match the terminology of bioinformatics, e.g.

PepTools2::pep_split
function(pep){
  # Check input
  pep_check(pep = pep)
  # Convert to matrix
  # do.call applies a function to the list returned from args
  # so rbind to form matrix each of the elements in the list returned
  # by strsplit
  return( do.call(what = rbind, args = strsplit(x = pep, split = '')) )
}

and then also include standard data, like the PepTools2::BLOSUM62 and PepTools2::BLOSUM50, natural background frequencies PepTools2::BGFREQS and example peptides PepTools2::PEPTIDES. Furthermore, the ggseqlogo package is quite nice, but it only support simple shannon entropy based logos, which is sub-optimal compared to Kullback-Leibler logos. So basically, I wanted to extend with the ability to compute PSSMs to match the functionality of Seq2Logo, these matrices could then be visualised using the custom functionality of ggseqlogo. Lastly, my intention was to name all functions using the prefix pep_

Thinking about it, perhaps, we should make the PepTools package as a separate package, but still as a sub-part of tidysq? A bit like ggplot2 is a part of tidyverse?

I'm interested in your thoughts? ๐Ÿ‘

Create flow diagram for construct_sq()

Currently (as of 7 VII 2020) method construct_sq() has four parameters and their usage varies greatly depending on other parameters. All cases are described in documentation, but even clearest of lists have limited clarity. Thus I'd recommend creating a flow diagram that illustrates all possible (and impossible) parameter combinations. It could be used within vignettes, Github readme and possibly cheatsheet.

Override "head" for sq so that it prints original length of sq object

Problem description
I think head.sq() would be more descriptive if it printed the length of original (not-headed) sq object.

Proposed solution
Override head.sq() so that it sets an attribute with original length value (maybe name it original_length and rename original_length attribute of single sequence to seq_length or something like that?).

Additional notes
Actually the same request works for tail.sq() as well.

Discussion on compare operator `==`

There are two or three things that came across my mind when reading the code and I'd like to discuss them.
First, should we be able to compare, say, "dnasq" and "rnasq" objects? If so, should they return TRUE whenever they match? What about "dnasq" and "amisq" or "untsq", where letters may match, but may have different meanings?
Another question is more of an improvement: even if we compare two sq objects, they are still both coerced to character vectors. I think it could be done better, like, first compare alphabets and then simply compare raw vectors of the sq objects.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.