SWAMP - Sliding Window Alignment Masker for PAML

Documentation for release version: 31/03/14

Documentation last updated: 12/11/15

About:

SWAMP analyses multiple sequence alignments in a phylogenetic context, looking for regions of higher than expected non-synonymous substitutions along a branch, over a short sequence window. If a user defined threshold is exceeded then the window of sequence is masked to prevent its inclusion in downstream evolutionary analyses. This masking approach removes sequence data that violates the assumptions of the phylogenetic models implemented in the software package PAML that could otherwise give a false signal of positive selection.

SWAMP requirements:

Python 2.6 or 2.7 (not compatible with python 3) PAML 4.7 or 4.8 (for calculation of non-synonymous substitutions)

Prior to running SWAMP:

SWAMP requires branch information contained in the rst file generated by PAML. To create this file run a one-ratio model (model = 0, NSsites = 0), for each phylip alignment, in PAML codeml with remaining default parameters. For an example PAML control (.ctl) file see example_dataset/data/44/44.ctl Users should note that PAML should be run with the ‘cleandata’ function turned off (clean data = 0) if branch-specific filtering is desired. The rst file is a PAML temporary file that will be overwritten with each PAML run. If multiple alignments are to be analysed then each PAML output should be placed in a separate subfolder.

##SWAMP usage:

python SWAMP.py [-h] [-i INFOLDER] [-b BRANCHNAMES] [-t THRESHOLD] [-w WINDOWSIZE] [-m MINSEQLENGTH] [-s]

or to print alignment summary: python SWAMP.py [-p PHYLIPFILE]

Required arguments:

-i INFOLDER or --infolder INFOLDER Provide the full path to an INFOLDER that can either contain a single alignment (see example_dataset/data/44/) or multiple subfolders each containing alignments to scan with SWAMP (see example_dataset/data/). Each folder must contain a multiple sequence alignment file in sequential phylip format (.phy). The first line of the phylip file gives the number of sequences and the length of the sequences in the alignment. Then for each sequence a header is provided on a line by itself, followed by the sequence on a new line, but that can be split over multiple lines. Each line is usually limited to 60 characters. DNA alignments must be in frame so that codons can be determined and stop codons should be removed prior to analysis. Additionally, each folder must also contain information about branch specific substitutions in an rst file.
This is generated using a one-ratio codeml PAML analysis as detailed above. Please note that each subfolder should only contain one alignment .phy file and one rst file. SWAMP will ignore all files with the ending "_masked.phy" as this is the suffix SWAMP adds to output files. This prevents users unintentionally inputting already masked files. Therefore, if the user desires to run SWAMP on already masked files, for example to allow different parameters for different branches, then the output files will need to be renamed to remove the "_masked.phy" file ending prior to the next iteration of SWAMP.

-b BRANCHNAMESFILE or --branchnames BRANCHNAMESFILE The full path to a file listing which branches to analyse and which sequences to mask. Each line should name a branch (e.g. "8..9"), followed by a space/tab and then a comma separated list of the sequence headers that branch refers too. These must be identical to the sequence headers in the phylip file to be analysed. Internal branches influence more than one sequence so each branch needs to be listed separated by a comma if it needs to be masked. For example for the tree ((homo, pongo),(papio, colobus)); or ((2, 4),(3, 1): 6..2 homo 6..4 pongo 7..3 papio 7..1 colobus 5..6 pongo,homo 5..7 papio,colobus Also see: example_dataset/branchcodes.txt and example_dataset/branchcodes_nohomo.txt example_dataset/branchcodes_onlyhuman.txt

-t THRESHOLD or --threshold THRESHOLD A threshold positive integer of the number of non-synonymous substitutions at and above which the window will be masked.

-w WINDOWSIZE or --windowsize WINDOWSIZE An integer window size for the sliding window scan, given in numbers of codons.

Optional Arguments:

-h Print help.

-m MINSEQLENGTH or --minseqlength MINSEQLENGTH The required minimum number of informative codons in each of the sequences in the multiple sequence alignment post-masking. This is a positive integer. The program will print a warning to the user in the standard output if a masked sequence is shorter than this minimum length. The default is 33 codons (99 base pairs).

-s or --interscan Activates interscan masking. This will additionally mask regions adjacent to already masked regions based on relative sequence length. This additional masking is performed at the start or end of the sequence alignment if the unmasked sequence region length is shorter than twice the length of the preceding or subsequent masked section. Where a sequence contains multiple masked regions, interscan will also mask internal unmasked regions that are shorter than the combined length of their flanking masked regions. This process occurs repeatedly until no more sections that meet the interscan masking criteria are found. Interscan is useful for removing very short stretches of sequence, or sequence at the edge of masked regions, that are possibly unreliable, but that do not themselves meet the masking criteria.

-p or --print-alignment Prints a summary of the given alignment phylip file detailing the number and percentage of masked codons for each sequence in the alignment. This is useful to assess alignments before and after masking with SWAMP. This option should be run separately without any other SWAMP commands. Example output:

python SWAMP.py --print-alignment example_dataset/data/44/44_masked.phy
                 Alignment length: 1905 codons
          pongo  ACAGATGCACATTATTCCATACTGTCACTTCTTCTGTGTCTGTCAGACTC...   635 codons  171 masked (27%)
        colobus  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...   635 codons  340 masked (54%)
           homo  ACAGATGCACATTATTCCATACTGTCACTTCTTCTGTGTCTGTCAGACTC...   635 codons  114 masked (18%)
          papio  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...   635 codons  221 masked (35%)
                 Total masked codons: 846

Example SWAMP runs are provided in the Makefile:

These examples are designed to demonstrate SWAMP's functionality and are described below. To run an example in the SWAMP directory type:

make example1

make example2

etc.

Example 1:

Basic example of running SWAMP on a single alignment, with a large window and low substitution threshold. These paramters are purposefully over-aggressive to show the filtered alignment. This will mask the file 1.phy using a sliding window of 30 and a threshold of 1 non-synonymous changes. The branchcodes.txt file details which branches and sequences will be used for masking.

  python SWAMP.py --print-alignment example_dataset/data/44/44.phy

  python SWAMP.py -i example_dataset/data/44/ \
  -b example_dataset/branchcodes.txt -t 1 -w 30

  python SWAMP.py --print-alignment example_dataset/data/44/44_masked.phy

Example 2:

Run SWAMP with more reasonable parameters, summarizing the masked alignment.

  python SWAMP.py -i example_dataset/data/44/ \
  -b example_dataset/branchcodes.txt -t 2 -w 20

  python SWAMP.py --print-alignment example_dataset/data/44/44_masked.phy

Example 3:

Run with interscan=true, which recursively merges filtered regions to increase the filtering stringency.

  python SWAMP.py -i example_dataset/data/44/ \
  -b example_dataset/branchcodes.txt -t 2 -w 20 --interscan

  python SWAMP.py --print-alignment example_dataset/data/44/44_masked.phy

Example 4:

Run with very stringent filtering on just the human sequence.

  python SWAMP.py -i example_dataset/data/44/ \
  -b example_dataset/branchcodes_onlyhuman.txt -t 2 -w 50 --interscan

No codons were masked.

  python SWAMP.py --print-alignment example_dataset/data/44/44_masked.phy

Example 5:

Runs SWAMP on the entire directory within example_dataset/data/ This will mask all of the phylip files in subfolders of example_dataset/data/ using a sliding window of 15 and a threshold of 5 non-synonymous changes. It will additionally perform interscan masking. The branchcodes_nohomo.txt file in this example will not mask the homo sequences as they are not listed.

  python SWAMP.py -i example_dataset/data/ \
  -b example_dataset/branchcodes_nohomo.txt -t 5 -w 15 --interscan

Contact:

peter [at] ebi.ac.uk

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

peterwharrison / swamp Goto Github PK

swamp's Introduction

SWAMP - Sliding Window Alignment Masker for PAML

About:

SWAMP requirements:

Prior to running SWAMP:

Example SWAMP runs are provided in the Makefile:

Example 1:

Example 2:

Example 3:

Example 4:

Example 5:

Contact:

swamp's People

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent