Coder Social home page Coder Social logo

rexsearch's Introduction

rexsearch

Port of vsearch into Rust

The licensing applies from vsearch, since it is mostly a port. I intend no copyright or other infringment. All credit to Torbjørn Rognes (torognes), also took some code from rust::bio. Same applies, all credit to Johannes. And of course, usearch.

Rewriting vsearch with protein comparison and FOGSAA. I have a few ideas how to use this for annotation and other projects. Such as "overhang" extension. I need to a tool to try to fill in gaps in assemblies from PacBio to Illumina. My Illumina reads always drop coverage in HiSeq and I would like to "fill-in" these gaps by returning partial alignments and long overhangs. Many tools already do something similar, but not quite the way I wish to use it though. Still getting to the point of the original intent. Now, I am also thinking about making several smaller, more modular tools, as part of package perhaps.

Comments

REXSEARCH is nearing VSEARCH speed. It is 30% faster for one core, but overall 50% slower because I only use 1 core. This is also for simple tasks such as msking or filtering reads, not alignment.

Candidate tools:

rfilter        fast filtering
rsearch     for sequnce matches
rsort          sort by length and abundance
rsample     subsample and shuffle reads
ralign         alignment and splicing tools

FILTER

Added capability to change softmask to hardmask (mask::fast_mask). Now have max/minlen filter (untested), max/minqual filter (untested), any_n filter (tested, filters sequences with any 'N' in seq), count_n (untested, LESS than provided, not equal to), and DUST preserving 80 char lines, and without (tested). DUST has soft- and hardmasking.

vsearch --fastx_mask /Users/daniel/Desktop/gg.fasta --fastaout  --qmask dust   10.55s user 0.13s system 291% cpu 3.660 total
./rsearch /Users/daniel/Desktop/gg.fasta  7.23s user 0.14s system 99% cpu 7.434 total

This in fact means vsearch was probably using more cores and was slower. Let's see with more thorough testing.

ALIGNMENT

Next Phase is back to alignment:

Smith-Waterman            Not Started
FOGSAA                          Not Started
Needlemann-Wunsch    In progress

CLUSTERING

This seems to be where vsearch and usearch shine. Not really going to focus on this for now.

DEREPLICATION

Now this could be something for rsearch to implement, rather simple, just look at "size=" tag on fasta header, then derep and eventually rerep. Saves space and time.

2 issues, vsearch --fastx_mask puts everything to uppercase (softmasking goes away). Maybe this is how it should be, but I don't think so. The other issue is, vsearch removes everything after " " in the fasta header. Again, maybe this is for a reason but I prefer to keep the file as is. Other than that, I am fairly sure I give more or less, the same output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.