Coder Social home page Coder Social logo

slidevar's Introduction

SlideVar

Sliding Window Variation Identifier (SlideVar)

SlideVar is a computational pipeline for detecting lineage-specific changes based on sequence consistency. Its main method is to slide a window of size 1 over previously aligned sequences to assess sequence consistency at the single nucleotide level. It then searches for conserved positions that show lineage-specific variation in the target lineage. Because the pipeline only considers absolute character matches, it can also identify lineage-specific amino acid substitutions. Whether in genes or non-coding regulatory elements, functional sequences can be quite short. Therefore, SlideVar.pl is well suited for finding short segment specific changes within longer conserved regions. Overall, this pipeline uses windowed consistency scoring to identify lineage-specfic variations embedded in multiple sequence alignments.


USAGE


Pipeline for finding lineage-specfic divergent Conserved Noncoding Elements

  1. Make LAST database,for species with relatively distant phylogenetic relationships, MAM8 is recommended. (Here, we used LAST software to perform pairwised whole genome alignment. https://gitlab.com/mcfrith/last )
/path/to/last/bin/lastdb -u MAM8 ref ref.fa
  1. Use lastal to get pairwised whole genome aligment, HOXD70 is better for distant phylogenetic relationships.
/path/to/last/bin/lastal -m10 -pHOXD70 /path/to/ref_MAM8_db query.fa > query.maf
  1. Use last-split to get one-to-one alignment and sort the maf
/path/to/last/bin/last-split -m10 query.maf | /path/to/last/bin/last-split -r -m10 | /path/to/last/bin/maf-sort > query.sort.maf 
  1. Combine all pairwised maf to multiple alignment maf
# remember to rename all pairwised alignment with maf format file to ref.query.sing.maf
/path/to/multiz/build/roast - T=. E=ref  "(ref, (species1, (species2 , species3)))" combined.maf > roast.maf.sh && sh roast.maf.sh

Or you can simply join all mafs with maf-join

/path/to/last/bin/maf-join *sort.maf > combined.maf # note this will only combine maf block which all species have sequence alignment
  1. phastCons (optional) see: http://compgen.cshl.edu/phast/phastCons-HOWTO.html

  2. You can also directly convert the maf block into fasta format and record tho position.

perl maf2fasta_by_speceis_list.pl <input.maf> <species.list> <min_length> <output_dir_prefix>
  1. Run SlideVar.pl for each fasta file
for i in output_dir/*/*/Block*fasta; do 
	perl SlideVar.pl -in $i -l species.list -w 20 -con 18 -div 12 -bn 2
done
  1. Run mergeResults.pl to merge all results, the results in the file like "M1I2D3S4" means one site matched, 2 sites inserted, 3 sites deleted and 4 sites substituted. This also means the length of the window is 10 (1+2+3+4=10).
perl mergeResults.pl output_dir species.list output_file

perl SlideVar.pl -in <input.fasta> -l <species.list> -w <window size> -con <conserved number> -div <changed number> -bn <background diverged species>

    -in input.fasta : sequence aligned in fasta format file
    -l species.list : species list with species marked
    -w window size : default 20 , the window size for calculating the conserved region
    -con conserved number : default 18 , threshold for conserved region, identity >= conserved number is conserved
    -div diverged number : default 12 , threshold for divergent region, identity <= diverged number is divergent
    -bn  : default 0 , how many species could be divergent in background species (in case of some species are not conserved in background species because of assembly error or other reasons)

    ---
    species list file format, one species per line and foreground species marked with '*', note that reference species should not be marked as foreground species:
    human
    mouse
    snake *
    frog
    caecilian *
    ...

    fasta file format: # species name should not contain '.' '-' '@' etc. ; '_' is allowed ; the first species should be the reference species
    >species_name
    AAGCTTGGG
    or
    >species_name.seqId
    AAGCTTGGG

slidevar's People

Contributors

jackie-duang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.