Coder Social home page Coder Social logo

laurentnoe / iedera Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 1.0 1.29 MB

subset and spaced seed design tool

Home Page: https://bioinfo.univ-lille.fr/yass/iedera.php

License: BSD 3-Clause "New" or "Revised" License

Makefile 11.91% C++ 75.64% M4 0.73% Shell 9.40% Python 2.32%
sequence-alignment indexing spaced-seed read-mapping alignment-free metagenomic-classification sequence-alignments k-mer align

iedera's Introduction

Build Status Build Status Coverage Website

iedera

(more at <http://bioinfo.univ-lille.fr/yass/iedera.php>)

iedera is a tool to select and design spaced seeds, transition constrained spaced seeds, or more generally subset seeds, and vectorized subset seed patterns.

Installation

(more at <http://bioinfo.univ-lille.fr/yass/iedera.php#downloadiedera>)

Binaries for Windows (x64) and OS X (x64) are available at <https://github.com/laurentnoe/iedera/releases>.

Otherwise, you need a C++ compiler and the autotools. On Linux, you can install g++, autoconf, automake. On Mac, you can install xcode, or the command line developer tools (or you can use macports to install g++-mp-5 for example).

Using the command line, type:

git clone https://github.com/laurentnoe/iedera.git
cd iedera
./configure
make

or:

git clone https://github.com/laurentnoe/iedera.git
cd iedera
autoreconf
./configure
automake
make

you can install iedera to a standard /local/bin directory:

sudo make install

or copy the binary directly to your homedir:

cp src/iedera ~/.

Command-line

(more at <http://bioinfo.univ-lille.fr/yass/iedera.php#quick>)

First, use one of these two parameters :

-spaced for spaced seeds
-transitive for transitive spaced seeds

since they are shortcuts for quite long command lines.

Then you can change the weight, span, and number of seeds being designed:

-w <N,N> for the weight range, where N = [1..16] seems reasonable
-s <N,N> for the span range, where N = [1..32] seems reasonable
-n <N> for the number of seeds, where N = [1..32]

as well as the length of the alignment:

-l <N> where N = [1..64] seems reasonable

NOTE : since enumeration of all the combination of multiple seeds may take time, if "-n" is chosen with a value greater than one, please consider the two following:

-r <N> to run the tool on N randomly generated seed patterns
-k to activate the hill-climbing algorithm on previous parameter -r

(more at <http://bioinfo.univ-lille.fr/yass/iedera.php#details>)

Examples

Spaced seeds

A very small example where the seed weight is set to 11, and the span is at most 18 (full enumeration):

iedera -spaced -w 11,11 -s 11,18

will give the classical PatternHunter 1 spaced seed

###-#--#-#--##-###    0.999999761581      0.467122       0.532878
(SEED PATTERN)        (selectivity)       (SENSITIVITY)  (distance to 1,1)

A second example where the number of seeds is now set to 2, the alignment length is set to 50, and 10000 seeds will be tested with the hill-climbing algorithm activated:

iedera -spaced -n 2 -w 11,11 -s 11,22 -l 50 -r 10000 -k

Transition seeds

A very small example for transition seeds (hill climbing):

iedera -transitive -w 11,11 -s 11,22 -r 10000 -k

Lossless seeds

A very small example for lossless seeds (from Burkhard&Karkkainen) : find a lossless seed of weight 12, span at most 19, on alignments of length 25 with 2 mismatches:

iedera -spaced -s 12,19 -w 12,12 -l 25 -L 1,0 -X 2

A second example for lossless seeds (from Kucherov,Noe&Roytberg) on the previous problem, but with two seeds of weight 14, and span between 20 and 21 (to ease the search):

iedera -spaced -l 25 -L 1,0 -X 2 -n 2 -s 20,21 -w 14,14  -r 100..some.zeros..00 -k

IUPAC seeds

IUPAC filtered seeds could challenge minimizer based techniques <https://www.biorxiv.org/content/10.1101/2020.07.24.220616v2>, so we have extended the iedera tool to support such seeds

First getting the alignment probabilities, out of the TAM92 model <https://pubmed.ncbi.nlm.nih.gov/1630306/>, then launching the optimization for a starting shape, and with the given probabilities:

iedera -iupac -s 5,17 -m "RYYNNNNN,RRYNNNNN" -i shuffle  -r 10000 -k -z 100 -f  `./tam92.py -p 20 -k 1 --gc 50`

YNYRNNnnNN,RNYRNnnNNN       0.9999961853027 0.912921        0.087079

Here :

  • N is a mach symbol (equivalent to #)
  • n is a dont care symbol (equivalent to -)
  • R and Y (uppercase) are respectively Purine and Pyrimine Matches (e.g. R is A-A or T-T matches but not A-T or T-A; use downcase symbols to allow all)

Input/Ouput and reoptimization

Sometimes, it may be helpful to rerun several times the same experiment, and keep the best result of all runs. This can be easily done with input/ouput:

-e <filename> for input file (filename can be a non existing file)
-o <filename> for output file (filename may be of same name as input)

so running this command-line multiple times:

iedera -spaced -l 25 -L 1,0 -X 2 -n 2 -w 14,14 -s 20,21 -r 10000 -k -e file_n2_w14_l25_x2_lossless.txt -o file_n2_w14_l25_x2_lossless.txt

will probably find a lossless set of two seeds. Running this command-line multiple times:

iedera -spaced -l 64 -n 2 -w 11,11 -s 11,22 -r 10000 -k -e file_n2_w11_l64_lossy.txt -o file_n2_w11_l64_lossy.txt

will also probably improve the sensitivity result.

Polynomial form

Bernoulli model

When the probability p to generate a match is not fixed (for example p=0.7 was set in all the previous examples), Mak & Benson have proposed to use a polynomial form and select what they called dominant seeds. We have noticed that this dominance applies as well for any other i.i.d criteria as the Hit Integration (Chung & Park), for Lossless seeds, and several discrete models ... (see <http://doi.org/10.1186/s13015-017-0092-1>) so the flag:

-p to activate dominant selection and output polynomial coefficients

is added in the current commited version of iedera (master branch).

Other multivariate models

When the probabilitic model is more complex compared to a simple Bernoulli model on a binary alphabet, it is possible to compute the probability as a multivariate polynomial form. For a given seed provided with the -m parameter, the output will contain this polynomial form set in square brackets. Selection of the best seeds is left as an exercice for the reader. The flag -pF <filename> activates the output of the multivariate polynomial on the given model. The next example gives sensitivity of the seed 1101 on alignments of length 8

iedera -spaced -pF model_bernoulli_simple_x_xp.txt  -m "##-#" -l 8

on the bernoulli model provided by the file model_bernoulli_simple_x_xp.txt

2
   0   1
      0   1
         0   x
      1   1
         0   xp
   1   0
      0   1
         1   x
      1   1
         1   xp

Tools provided with iedera

The iedera binary is located in src/iedera. The scripts plot_spaced_seeds.py and plot_mow_seeds.py are provided to plot :

  • the sensitivity for a 1st hit, on alignments generated with a (parameter-free) bernoulli model,
  • the frequency for a 1st hit, on alignments generated with an increasing frequency of matches, for a set of given seeds.

spaced seeds sensitivity or frequency on 1st hit

References

how to cite this tool:

Kucherov G., Noe L., Roytberg, M., A unifying framework for seed sensitivity and its application to subset seeds, Journal of Bioinformatics and Computational Biology, 4(2):553-569, 2006 <http://doi.org/10.1142/S0219720006001977>

Noe L., Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds, Algorithms for Molecular Biology, 12(1). 2017 <http://doi.org/10.1186/s13015-017-0092-1>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.