Coder Social home page Coder Social logo

phyla_amphora's Introduction

Phyla_AMPHORA User Manual

A Phylum-specific Automated Phylogenomic Inference Pipeline for Bacterial Sequences. 

COPYRIGHT 
2012 by Martin Wu

Phyla_AMPHORA is free software: you may redistribute it and/or modify its under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or any later version.

Phyla_AMPHORA is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details (http://www.gnu.org/licenses/).

For any other inquiries send an Email to Martin Wu: [email protected]

CITATION
When publishing work that is based on the results from Phyla_AMPHORA, please cite:
Wang Z and Wu M: A Phylum-level Bacterial Phylogenetic Marker Database. Mol. Biol. Evol. Advance Access publication March 21, 2013. doi:10.1093/molbev/mst059

 
DEPENDENCY
Phyla_AMPHORA depends on several external programs.

1.HMMER3 (http://hmmer.janelia.org/). Required for marker identification, sequence alignment and trimming. Earlier versions of HMMER will not work!
2.RAxML version 7.3.0 or later (https://github.com/stamatak/standard-RAxML/downloads). Required for phylotyping.
3.Bioperl 1.5.2 or later (http://www.bioperl.org/wiki/Getting_BioPerl).
4.EMBOSS (http://emboss.sourceforge.net/download/). The 'getorf' program of the EMBOSS package is required only if you analyze DNA sequences using Phyla_AMPHORA

Make sure that these programs are installed and are in your system's executable search path. To test, in a terminal type

	raxmlHPC -version
	raxmlHPC-PTHREADS -version
	hmmsearch -h
	hmmalign -h
	getorf -help

If you see version or help messages, then these programs have been correctly installed. It is important to make sure they are the correct versions. 

A script named 'preinstall.pl' is also included with Phyla_AMPHORA to check and install the dependencies automatically. You need the privilege of the system administrator to run the script. See below for instructions.
 	

INSTALLATION

1.Download Phyla_AMPHORA
2.Unpack Phyla_AMPHORA 
	tar -zxvf Phyla_AMPHORA.tar.gz
3.Install dependencies if they have not been installed
	cd Phyla_AMPHORA
	sudo perl preinstall.pl
4.Setup Phyla_AMPHORA. 
You need to set up the environment variable 'Phyla_AMPHORA_home' so the Phyla_AMPHORA scripts know where to look for the phylogenetic marker database and the NCBI taxonomy information. Let's suppose your unpacked Phyla_AMPHORA folder is at /home/foo/Phyla_AMPHORA. 
		
If you are using a bash shell, you can add the following lines to the end of the file ~/.bashrc
	export Phyla_AMPHORA_home=/home/foo/Phyla_AMPHORA
Then in the terminal, issue this command
	source ~/.bashrc
		
If you are using a C shell, you can add the following lines to the end of the file ~/.tcshrc. 
	setenv Phyla_AMPHORA_home /home/foo/Phyla_AMPHORA
Then in the terminal, issue this command
	source ~/.tcshrc

5.Make the Phyla_AMPHORA scripts executable.   
	chmod +x /home/foo/Phyla_AMPHORA/Scripts/*

You should see five folders.

1. Marker 
This folder contains a seed alignment file in Stockholm format (*.stock), an alignment mask file (*.mask), a profile HMM file (*HMM) and a tree file in newick format (*.tre) for each marker gene.

For more information about the phylogenetic markers that are included in Phyla_AMPHORA, see the marker.list file in the Marker folder.

IMPORTANT: Because the Marker folder exceeds the 1GB (the size limit of github), it is not included in the github package. If you download Phyla_AMPHORA from github, you should download the Marker database from http://www.wulabuva.org/software.html and move it here.

2. Scripts
This folder contains the scripts for marker identification, alignment, trimming and phylotyping.

3. Taxonomy
This folder contains the NCBI taxonomy database that is used by the Phylotyping.pl script for phylotyping.

4. Tree
This folder contains the genome trees for 20 bacterial phyla in newick format. The genome trees are RAxML maximum likelihood trees made from concatenated protein sequences of the phylum-specific markers.

5. TestData
This folder contains the E. coli genome assembly (ecoli.fasta) and proteome sequences (ecoli.pep) for testing Phyla_AMPHORA.

 
RUNNING Phyla_AMPHORA

We recommend that you allocate at least 4GB of memory to Phyla_AMPOHRA.

1. Marker identification
Use MarkerScanner.pl to identify phylum-specific bacterial marker sequences. Given a sequence file, this program will identify markers from the input sequences and generate a protein fasta file for each marker gene in your working directory. For example, Acidobacteria.102.pep, Aquificae.33.pep. When DNA sequences are used as input, this program first identifies ORFs longer than 100 bp in all six reading frames, then scans the translated peptide sequences for the phylogenetic markers.

Usage: perl MarkerScanner.pl <options> sequence-file

Options:
	-Phylum:0. All (Default)
		1. Alphaproteobacteria
		2. Betaproteobacteria
		3. Gammaproteobacteria
		4. Deltaproteobacteria
		5. Epsilonproteobacteria
		6. Acidobacteria
		7. Actinobacteria
		8. Aquificae
		9. Bacteroidetes
		10. Chlamydiae/Verrucomicrobia
		11. Chlorobi
		12. Chloroflexi
		13. Cyanobacteria
		14. Deinococcus/Thermus
		15. Firmicutes
		16. Fusobacteria
		17. Planctomycetes
		18. Spirochaetes
		19. Tenericutes
		20. Thermotogae
	-DNA: input sequences are DNA. Default: no
	-Evalue: HMMER evalue cutoff. Default: 1e-7 
	-ReferenceDirectory: the file directory that contain the reference alignments, hmms and masks.
	-Help: print help;

Examples:
1a. Identify phylogenetic markers from the E. coli proteome

	perl MarkerScanner.pl -Phylum 3 TestData/ecoli.pep 

1b. Identify phylogenetic markers from the E. coli genome assembly

	perl MarkerScanner.pl -Phylum 3 -DNA TestData/ecoli.fasta

If Phyla_AMPHORA has been installed correctly, at the end of the run in example 1a or 1b, you should see 294 marker protein sequences (*.pep) in your working directory.

1c. If you want to identify phylogenetic markers of all the 20 phyla from a set of metagenomic sequence reads (e.g., 454 reads).

	perl MarkerScanner.pl -DNA -Phylum 0 metagenomic.fasta 

2. Marker sequence alignment and trimming
This program will align, mask and trim the marker protein sequences. Output will be aligned/trimmed sequences. For example, Acidobacteria.102.aln, Aquificae.33.aln and their corresponding alignment masks. The alignment masks can be used to weigh the alignment columns with the RAxML's -a option (for untrimmed alignment only).

Usage:	perl  MarkerAlignTrim.pl <options>

Options:
	-Trim:	trim the alignment using masks embedded with the marker database. Default: no
	-Cutoff: the Zorro masking confidence cutoff value (0 - 1.0; default: 0.4);
	-ReferenceDirectory: the file directory that contain the reference alignments, hmms and masks.
	-Directory: the file directory where sequences to be aligned are located. Default: current directory
	-OutputFormat:  output alignment format. Default: phylip. Other supported formats include: fasta, stockholm, selex, clustal
	-WithReference: keep the reference sequences in the alignment. Default: no
	-Help:	print help 

Example:

	perl MarkerAlignTrim.pl -WithReference -OutputFormat phylip

If Phyla_AMPHORA has been installed correctly, at the end of the run, you should see an alignment file (*.aln) and a mask file (*.mask) for each of the marker gene in your working directory.

It is important to know that in order to run the Phylotyping.pl script properly, the MarkerAlignTrimp.pl needs to be run using  '-WithReference -OutputFormat phylip' options.

3. Phylotyping
Use Phylotyping.pl to assign phylotypes for each identified marker sequences. This program will assign each identified marker sequence a phylotype using the parsimony method or the evolutionary placement algorithm of RAxML. The marker sequences need to be aligned first with the reference sequences using MarkerAlignTrim.pl (see above). The alignments should be in the phylip format. 

Usage:	perl Phylotyping.pl <options>

Options:
	-Method: use 'maximum likelihood' (ml) or 'maximum parsimony' (mp) for phylotyping. Default: ml
	-CPUs: turn on the multiple thread option and specify the number of CPUs/cores to use. Important: Make sure raxmlHPC-PTHREADs is installed. If the number specified here is larger than the number of cores that are free and available, it will actually slow down the script.
	-Help: print help;  

If your computer has multiple CPUs/cores, the phylotpying process can be sped up by running multiple threads of the RAxML. However, it is very important to check how many CPUs/cores are free and available to Phylotyping.pl. If you specify a number that is larger than the number of CPUs/cores that are actually available, it will slow down the script. For example, your computer has 8 CPUs but 2 of them are used by other processes. In this case, you can run Phylotyping.pl on 6 CPUs by using the '-CPUs 6' option. Of course, raxmlHPC-PTHREADS needs to be installed.

Example: 
Assign phylotypes using the maximum likelihood method

	perl Phylotyping.pl -CPUs 6 > phylotype.result

Again, if Phyla_AMPHORA has been installed correctly, you should see something like this as the output:

Query   Marker  Superkingdom    Phylum  Class   Order   Family  Genus   Species
NP_414730-NC_000913     Gamma.134       Bacteria(1.00)  Proteobacteria(1.00)    Gammaproteobacteria(1.00)       Enterobacteriales(1.00) Enterobacteriaceae(1.00)        Escherichia(1.00)       Escherichia coli(1.00)
NP_417099-NC_000913     Gamma.167       Bacteria(0.96)  Proteobacteria(0.96)    Gammaproteobacteria(0.96)       Enterobacteriales(0.96) Enterobacteriaceae(0.96)        Escherichia(0.70)       Escherichia coli(0.70)
NP_416616-NC_000913     Gamma.252       Bacteria(0.96)  Proteobacteria(0.96)    Gammaproteobacteria(0.96)       Enterobacteriales(0.96) Enterobacteriaceae(0.96)        Escherichia(0.74)       Escherichia coli(0.74)
NP_418155-NC_000913     Gamma.286       Bacteria(0.96)  Proteobacteria(0.96)    Gammaproteobacteria(0.96)       Enterobacteriales(0.96) Enterobacteriaceae(0.96)        Escherichia(0.84)       Escherichia coli(0.84)
NP_417422-NC_000913     Gamma.296       Bacteria(0.96)  Proteobacteria(0.96)    Gammaproteobacteria(0.96)       Enterobacteriales(0.96) Enterobacteriaceae(0.96)        Escherichia(0.76)       Escherichia coli(0.76)
NP_417226-NC_000913     Gamma.306       Bacteria(0.97)  Proteobacteria(0.97)    Gammaproteobacteria(0.97)       Enterobacteriales(0.97) Enterobacteriaceae(0.97)        Escherichia(0.61)       Escherichia coli(0.61)
NP_417800-NC_000913     Gamma.44        Bacteria(0.95)  Proteobacteria(0.95)    Gammaproteobacteria(0.95)       Enterobacteriales(0.95) Enterobacteriaceae(0.95)        Escherichia(0.95)       Escherichia coli(0.95)


The phylotyping results are tab-delimited. The numbers within the parentheses are the confidence scores of the assignment.

phyla_amphora's People

Contributors

wu-lab-uva avatar

Stargazers

moshi avatar Jason Stajich avatar Sean Jungbluth avatar Shengwei Hou avatar Adrien Assié avatar

Watchers

 avatar Ivanova Victoria avatar  avatar

phyla_amphora's Issues

Errors in phylotyping step

Hi,

I installed phyla_AMPHORA and its dependencies as described in the documentation. Then I tried to run the example from the testData (ecoli.fasta).

I was running the following 3 commands as described in the documentation:

MarkerScanner.pl -Phylum 3 -DNA ecoli.fasta
MarkerAlignTrim.pl -WithReference -OutputFormat phylip
Phylotyping.pl -CPUs 24 > phylotype.result

The first command worked as described and produced the .pep files.

[samapps@e1001 test2]$ export OMP_NUM_THREADS=24
[samapps@e1001 test2]$ MarkerScanner.pl -Phylum 3 -DNA ecoli.fasta  
Find and extract open reading frames (ORFs)
[samapps@e1001 test2]$
[samapps@euler04 test2]$ ls
ecoli.fasta      Gamma.139.pep  Gamma.175.pep  Gamma.215.pep  Gamma.24.pep   Gamma.285.pep  Gamma.327.pep  Gamma.36.pep  Gamma.68.pep
ecoli.fasta.orf  Gamma.140.pep  Gamma.176.pep  Gamma.217.pep  Gamma.250.pep  Gamma.286.pep  Gamma.328.pep  Gamma.37.pep  Gamma.69.pep
Gamma.100.pep    Gamma.141.pep  Gamma.177.pep  Gamma.218.pep  Gamma.252.pep  Gamma.287.pep  Gamma.32.pep   Gamma.38.pep  Gamma.70.pep
Gamma.101.pep    Gamma.142.pep  Gamma.178.pep  Gamma.219.pep  Gamma.253.pep  Gamma.288.pep  Gamma.332.pep  Gamma.3.pep   Gamma.71.pep
Gamma.103.pep    Gamma.143.pep  Gamma.179.pep  Gamma.220.pep  Gamma.254.pep  Gamma.289.pep  Gamma.333.pep  Gamma.40.pep  Gamma.72.pep
Gamma.104.pep    Gamma.144.pep  Gamma.17.pep   Gamma.221.pep  Gamma.255.pep  Gamma.28.pep   Gamma.334.pep  Gamma.41.pep  Gamma.73.pep
Gamma.105.pep    Gamma.145.pep  Gamma.180.pep  Gamma.223.pep  Gamma.256.pep  Gamma.291.pep  Gamma.335.pep  Gamma.42.pep  Gamma.74.pep
Gamma.106.pep    Gamma.146.pep  Gamma.182.pep  Gamma.224.pep  Gamma.257.pep  Gamma.294.pep  Gamma.336.pep  Gamma.43.pep  Gamma.75.pep
Gamma.107.pep    Gamma.147.pep  Gamma.183.pep  Gamma.225.pep  Gamma.258.pep  Gamma.295.pep  Gamma.337.pep  Gamma.44.pep  Gamma.76.pep
Gamma.108.pep    Gamma.148.pep  Gamma.184.pep  Gamma.226.pep  Gamma.259.pep  Gamma.296.pep  Gamma.339.pep  Gamma.45.pep  Gamma.77.pep
Gamma.109.pep    Gamma.149.pep  Gamma.185.pep  Gamma.227.pep  Gamma.25.pep   Gamma.297.pep  Gamma.33.pep   Gamma.46.pep  Gamma.78.pep
Gamma.10.pep     Gamma.14.pep   Gamma.186.pep  Gamma.228.pep  Gamma.260.pep  Gamma.298.pep  Gamma.340.pep  Gamma.47.pep  Gamma.79.pep
Gamma.111.pep    Gamma.150.pep  Gamma.187.pep  Gamma.229.pep  Gamma.261.pep  Gamma.299.pep  Gamma.343.pep  Gamma.48.pep  Gamma.7.pep
Gamma.112.pep    Gamma.151.pep  Gamma.188.pep  Gamma.22.pep   Gamma.263.pep  Gamma.300.pep  Gamma.344.pep  Gamma.49.pep  Gamma.80.pep
Gamma.115.pep    Gamma.153.pep  Gamma.189.pep  Gamma.230.pep  Gamma.264.pep  Gamma.302.pep  Gamma.347.pep  Gamma.4.pep   Gamma.81.pep
Gamma.116.pep    Gamma.154.pep  Gamma.18.pep   Gamma.231.pep  Gamma.265.pep  Gamma.304.pep  Gamma.348.pep  Gamma.50.pep  Gamma.82.pep
Gamma.117.pep    Gamma.156.pep  Gamma.190.pep  Gamma.232.pep  Gamma.266.pep  Gamma.305.pep  Gamma.349.pep  Gamma.52.pep  Gamma.84.pep
Gamma.118.pep    Gamma.157.pep  Gamma.192.pep  Gamma.233.pep  Gamma.268.pep  Gamma.306.pep  Gamma.350.pep  Gamma.53.pep  Gamma.86.pep
Gamma.119.pep    Gamma.158.pep  Gamma.195.pep  Gamma.234.pep  Gamma.26.pep   Gamma.307.pep  Gamma.352.pep  Gamma.54.pep  Gamma.87.pep
Gamma.11.pep     Gamma.159.pep  Gamma.196.pep  Gamma.235.pep  Gamma.270.pep  Gamma.308.pep  Gamma.353.pep  Gamma.55.pep  Gamma.88.pep
Gamma.122.pep    Gamma.15.pep   Gamma.197.pep  Gamma.236.pep  Gamma.271.pep  Gamma.310.pep  Gamma.354.pep  Gamma.56.pep  Gamma.89.pep
Gamma.123.pep    Gamma.160.pep  Gamma.198.pep  Gamma.237.pep  Gamma.272.pep  Gamma.311.pep  Gamma.355.pep  Gamma.57.pep  Gamma.8.pep
Gamma.124.pep    Gamma.161.pep  Gamma.199.pep  Gamma.238.pep  Gamma.274.pep  Gamma.312.pep  Gamma.356.pep  Gamma.58.pep  Gamma.90.pep
Gamma.125.pep    Gamma.162.pep  Gamma.1.pep    Gamma.239.pep  Gamma.275.pep  Gamma.314.pep  Gamma.357.pep  Gamma.59.pep  Gamma.91.pep
Gamma.126.pep    Gamma.163.pep  Gamma.200.pep  Gamma.23.pep   Gamma.277.pep  Gamma.316.pep  Gamma.358.pep  Gamma.5.pep   Gamma.92.pep
Gamma.128.pep    Gamma.164.pep  Gamma.201.pep  Gamma.241.pep  Gamma.278.pep  Gamma.317.pep  Gamma.359.pep  Gamma.60.pep  Gamma.93.pep
Gamma.130.pep    Gamma.166.pep  Gamma.202.pep  Gamma.242.pep  Gamma.279.pep  Gamma.318.pep  Gamma.35.pep   Gamma.61.pep  Gamma.95.pep
Gamma.131.pep    Gamma.167.pep  Gamma.204.pep  Gamma.243.pep  Gamma.27.pep   Gamma.319.pep  Gamma.360.pep  Gamma.62.pep  Gamma.96.pep
Gamma.132.pep    Gamma.168.pep  Gamma.20.pep   Gamma.244.pep  Gamma.280.pep  Gamma.320.pep  Gamma.361.pep  Gamma.63.pep  Gamma.97.pep
Gamma.133.pep    Gamma.169.pep  Gamma.210.pep  Gamma.245.pep  Gamma.281.pep  Gamma.322.pep  Gamma.362.pep  Gamma.64.pep  Gamma.98.pep
Gamma.134.pep    Gamma.170.pep  Gamma.212.pep  Gamma.246.pep  Gamma.282.pep  Gamma.323.pep  Gamma.363.pep  Gamma.65.pep  Gamma.99.pep
Gamma.137.pep    Gamma.172.pep  Gamma.213.pep  Gamma.248.pep  Gamma.283.pep  Gamma.324.pep  Gamma.365.pep  Gamma.66.pep  Gamma.9.pep
Gamma.138.pep    Gamma.174.pep  Gamma.214.pep  Gamma.249.pep  Gamma.284.pep  Gamma.325.pep  Gamma.366.pep  Gamma.67.pep
[samapps@euler04 test2]$

The second command was also running fine and I got the .aln and .mask files:

[samapps@e1001 test2]$ MarkerAlignTrim.pl -WithReference -OutputFormat phylip                       
Aligning Gamma.280 ...

Aligning Gamma.89 ...

Aligning Gamma.142 ...

Aligning Gamma.164 ...

Aligning Gamma.163 ...

Aligning Gamma.355 ...

Aligning Gamma.22 ...

Aligning Gamma.217 ...

Aligning Gamma.82 ...

Aligning Gamma.286 ...

Aligning Gamma.202 ...

Aligning Gamma.88 ...
[samapps@euler04 test2]$ ls
ecoli.fasta      Gamma.144.mask  Gamma.184.mask  Gamma.22.mask   Gamma.268.mask  Gamma.311.mask  Gamma.359.mask  Gamma.64.mask
ecoli.fasta.orf  Gamma.144.pep   Gamma.184.pep   Gamma.22.pep    Gamma.268.pep   Gamma.311.pep   Gamma.359.pep   Gamma.64.pep
Gamma.100.aln    Gamma.145.aln   Gamma.185.aln   Gamma.230.aln   Gamma.26.aln    Gamma.312.aln   Gamma.35.aln    Gamma.65.aln
Gamma.100.mask   Gamma.145.mask  Gamma.185.mask  Gamma.230.mask  Gamma.26.mask   Gamma.312.mask  Gamma.35.mask   Gamma.65.mask
Gamma.100.pep    Gamma.145.pep   Gamma.185.pep   Gamma.230.pep   Gamma.26.pep    Gamma.312.pep   Gamma.35.pep    Gamma.65.pep
Gamma.101.aln    Gamma.146.aln   Gamma.186.aln   Gamma.231.aln   Gamma.270.aln   Gamma.314.aln   Gamma.360.aln   Gamma.66.aln
Gamma.101.mask   Gamma.146.mask  Gamma.186.mask  Gamma.231.mask  Gamma.270.mask  Gamma.314.mask  Gamma.360.mask  Gamma.66.mask
Gamma.101.pep    Gamma.146.pep   Gamma.186.pep   Gamma.231.pep   Gamma.270.pep   Gamma.314.pep   Gamma.360.pep   Gamma.66.pep
Gamma.103.aln    Gamma.147.aln   Gamma.187.aln   Gamma.232.aln   Gamma.271.aln   Gamma.316.aln   Gamma.361.aln   Gamma.67.aln

(truncated output)

But when running the third command, I only receive error messages:

[samapps@e1001 test2]$ Phylotyping.pl -CPUs 24 > phylotype.result 
Error occured when assigning phylotype for Gamma.7
Error occured when assigning phylotype for Gamma.294
Error occured when assigning phylotype for Gamma.103
Error occured when assigning phylotype for Gamma.68
Error occured when assigning phylotype for Gamma.133
Error occured when assigning phylotype for Gamma.124

(truncated output)

and at the end

cat: *.phylotype: No such file or directory

Do you have any idea what could be the reason for the errors during the Phylotyping.pl run ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.