davidemms / orthofinder Goto Github PK

View Code? Open in Web Editor NEW

639.0 27.0 185.0 52.69 MB

Phylogenetic orthology inference for comparative genomics

Home Page: https://davidemms.github.io/

License: GNU General Public License v3.0

Python 100.00%

gene-trees orthogroup orthologs gene-duplication

orthofinder's Introduction

Interested in a single gene? Try SHOOT.bio, the phylogenetic search engine: https://SHOOT.bio

SHOOT.bio searches your query sequence against a database of gene families and instantly provides you with a phylogenetic tree with your query sequence grafted into it.

Or, if you want to run an orthology analysis for all genes in multiple species then keep reading about OrthoFinder.

In addition to this README there is a set of OrthoFinder tutorials here: https://davidemms.github.io/

Downloading and running OrthoFinder
Running an example OrthoFinder analysis
Exploring OrthoFinder's results
OrthoFinder best practices

OrthoFinder: phylogenetic orthology inference for comparative genomics

Figure 1: Automatic OrthoFinder analysis

What does OrthoFinder do?

OrthoFinder is a fast, accurate and comprehensive platform for comparative genomics. It finds orthogroups and orthologs, infers rooted gene trees for all orthogroups and identifies all of the gene duplication events in those gene trees. It also infers a rooted species tree for the species being analysed and maps the gene duplication events from the gene trees to branches in the species tree. OrthoFinder also provides comprehensive statistics for comparative genomic analyses. OrthoFinder is simple to use and all you need to run it is a set of protein sequence files (one per species) in FASTA format.

For more details see the OrthoFinder papers below.

Emms, D.M. and Kelly, S. (2019) OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology 20:238

Emms, D.M. and Kelly, S. (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biology 16:157

You can see an overview talk for OrthoFinder from the 2020 Virtual symposium on Phylogenomics and Comparative Genomics here:

Thanks to Rosa Fernández & Jesus Lozano-Fernandez for organising this excellent conference.

generated with DocToc

Getting started with OrthoFinder
- Installing OrthoFinder on Linux
- Installing OrthoFinder on Mac & Windows
Running OrthoFinder
OrthoFinder Results Files
Understanding Orthology
- Orthogroups, Orthologs & Paralogs
- Why Orthogroups
Trees from MSA: "-M msa"
Advanced usage
Methods
- Species Tree Inference
Command line options

Getting started with OrthoFinder

You can find a step-by-step tutorial here: Downloading and checking OrthoFinder including instructions for Mac, for which Bioconda is recommended and Windows, for which the Windows Subsystem for Linux is recommended. There are also tutorials on that site which guide you through running your first analysis and exploring the results files.

Installing OrthoFinder on Linux

You can install OrthoFinder using Bioconda or download it directly from GitHub. These are the instructions for direct download, see the tutorials for other methods.

Download the latest release from github: https://github.com/davidemms/OrthoFinder/releases
- If you have python installed and the numpy and scipy libraries then download OrthoFinder_source.tar.gz.
- If not then download the larger bundled package, OrthoFinder.tar.gz.
In a terminal, 'cd' to where you downloaded the package
Extract the files: tar xzf OrthoFinder_source.tar.gz or tar xzf OrthoFinder.tar.gz
Test you can run OrthoFinder: python OrthoFinder_source/orthofinder.py -h or ./OrthoFinder/orthofinder -h. OrthoFinder should print its 'help' text.
That's it! You can now run OrthoFinder on a directory of protein sequence fasta files: e.g. ./OrthoFinder/orthofinder -f /OrthoFinder/ExampleData/

If you want to move the orthofinder executable to another location then you must also place the accompanying config.json file and bin/ directory in the same directory as the orthofinder executable.

OrthoFinder is written in python, but the bundled version does not require python to be installed on your computer. Both versions contain the programs OrthoFinder needs in order to run (in bin/), it will use these copies in preference to any of the same programs in your system path. You can delete the individual executables if you would prefer it not to do this.

Installing OrthoFinder on Mac & Windows

The easiest way to install OrthoFinder on Mac is using Bioconda:

via bioconda: conda install orthofinder

The easiest way to run OrthoFinder on Windows is using the Windows Subsystem for Linux or Docker: davidemms/orthofinder:

docker pull davidemms/orthofinder
docker run -it --rm davidemms/orthofinder orthofinder -h
docker run --ulimit nofile=1000000:1000000 -it --rm -v /full/path/to/fastas:/input:Z davidemms/orthofinder orthofinder -f /input

A more complete guide can be found here: https://davidemms.github.io/orthofinder_tutorials/alternative-ways-of-getting-OrthoFinder.html. Note that running OrthoFinder on Windows in a docker containner will not be as fast as running it natively.

Running OrthoFinder

To run OrthoFinder on the Example Data type:

OrthoFinder/orthofinder -f OrthoFinder/ExampleData

To run on your own dataset, replace "OrthoFinder/ExampleData" with the directory containing your input fasta files, with one file per species. OrthoFinder will look for input fasta files with any of the following filename extensions:

.fa
.faa
.fasta
.fas
.pep

OrthoFinder Results Files

There is a tutorial that provides a guided tour of the main results files here: https://davidemms.github.io/orthofinder_tutorials/exploring-orthofinders-results.html

A standard OrthoFinder run produces a set of files describing the orthogroups, orthologs, gene trees, resolve gene trees, the rooted species tree, gene duplication events and comparative genomic statistics for the set of species being analysed. These files are located in an intuitive directory structure.

Phylogenetic Hierarchical Orthogroups Directory

From version 2.4.0 onwards OrthoFinder infers HOGs, orthogroups at each hierarchical level (i.e. at each node in the species tree) by analysing the rooted gene trees. This is a far more accurate orthogroup inference method than the gene similarity/graph based approach used by all other methods and used previously by OrthoFinder (the deprecated Orthogroups/Orthogroups.tsv file). According to the Orthobench benchmarks, these new orthogroups are 12% more accurate than the OrthoFinder 2 orthogroups (Orthogroups/Orthogroups.tsv). The accuracy can be increased still further (20% more accurate on Orthobench) by including outgroup species, which help with the interpretation of the rooted gene trees.

It is important to ensure that the species tree OrthoFinder is using is accurate so as to maximise the accuracy of the HOGs. To reanalyse with a different species tree use the options -ft PREVIOUS_RESULTS_DIR -s SPECIES_TREE_FILE. This runs just the final analysis steps "from trees" and is relatively quick. If outgroup species are used, refer to "Species_Tree/SpeciesTree_rooted_node_labels.txt" to determine which N?.tsv file that contains the orthogroups you require.

N0.tsv is a tab separated text file. Each row contains the genes belonging to a single orthogroup. The genes from each orthogroup are organized into columns, one per species. Additional columns give the HOG (Hierarchical Orthogroup) ID and the node in the gene tree from which the HOG was determined (note, this can be above the root of the clade containing the genes). This file effectively replaces the orthogroups in Orthogroups/Orthogroups.tsv from Markov clustering using MCL.
N1.txt, N2.tsv, ...: Orthogroups inferred from the gene trees corresponding to the clades of species in the species tree N1, N2, etc. Because OrthoFinder now infers orthogroups at every hierarchical level within the species tree, it is now possible to include outgroup species within the analysis and then use the HOG files to get the orthogroups defined for your chosen clade within the species tree.

(Hierarchical orthogroup splitting: When analysing the gene trees, a nested hierarchical group (any HOG other than N0, the HOG at the level of the last common ancestor of all species) may sometimes have lost its genes from the earliest diverging species and then duplicated before the first extant genes. The two first diverging clades will then be paralogous even though the evidence suggests they belong to the same HOG. For most analyses it is often better to split these clades into separate groups. This can be requested using the option '-y'.)

Orthologues Directory

The Orthologues directory contains one sub-directory for each species that in turn contains a file for each pairwise species comparison, listing the orthologs between that species pair. Orthologues can be one-to-one, one-to-many or many-to-many depending on the gene duplication events since the orthologs diverged (see Section "Orthogroups, Orthologues & Paralogues" for more details). Each row in a file contains the gene(s) in one species that are orthologues of the gene(s) in the other species and each row is cross-referenced to the orthogroup that contains those genes.

Orthogroups Directory (deprecated)

The orthogroups in Phylogenetic_Hierarchical_Orthogroups/ should be used instead. They are identifed using rooted genes trees and are 12%-20% more accurate.

Orthogroups.tsv (deprecated) is a tab separated text file. Each row contains the genes belonging to a single orthogroup. The genes from each orthogroup are organized into columns, one per species. The orthogroups in Phylogenetic_Hierarchical_Orthogroups/N0.tsv should be used instead.
Orthogroups_UnassignedGenes.tsv is a tab separated text file that is identical in format to Orthogroups.csv but contains all of the genes that were not assigned to any orthogroup.
Orthogroups.txt (legacy format) is a second file containing the orthogroups described in the Orthogroups.tsv file but using the OrthoMCL output format.
Orthogroups.GeneCount.tsv is a tab separated text file that is identical in format to Orthogroups.csv but contains counts of the number of genes for each species in each orthogroup.
Orthogroups_SingleCopyOrthologues.txt is a list of orthogroups that contain exactly one gene per species i.e. they contain one-to-one orthologues. They are ideally suited to between-species comparisons and to species tree inference.

Gene Trees Directory

A rooted phylogenetic tree inferred for each orthogroup with 4 or more sequences (4 sequences is the mimimum number required for tree inference with most tree inference programs).

Resolved Gene Trees Directory

A rooted phylogenetic tree inferred for each orthogroup with 4 or more sequences and resolved using the OrthoFinder hybrid species-overlap/duplication-loss coalescent model.

Species Tree Directory

SpeciesTree_rooted.txt A STAG species tree inferred from all orthogroups, containing STAG support values at internal nodes and rooted using STRIDE.
SpeciesTree_rooted_node_labels.txt The same tree as above but with the nodes given labels (instead of support values) to allow other results files to cross-reference branches/nodes in the species tree (e.g. location of gene duplication events).

Comparative Genomics Statistics Directory

Duplications_per_Orthogroup.tsv is a tab separated text file that gives the number of duplications identified in each orthogroup. This master file for this data is Gene_Duplication_Events/Duplications.tsv.
Duplications_per_Species_Tree_Node.tsv is a tab separated text file that gives the number of duplications identified as occurring along each branch of the species tree. This master file for this data is Gene_Duplication_Events/Duplications.tsv.
Orthogroups_SpeciesOverlaps.tsv is a tab separated text file that contains the number of orthogroups shared between each species-pair as a square matrix.
OrthologuesStats_*.tsv files are tab separated text files containing matrices giving the numbers of orthologues in one-to-one, one-to-many and many-to-many relationships between each pair of species.
- OrthologuesStats_one-to-one.tsv is the number of one-to-one orthologues between each species pair.
- OrthologuesStats_many-to-many.tsv contains the number of orthologues in a many-to-many relationship for each species pair (due to gene duplication events in both lineages post-speciation). Entry (i,j) is the number of genes in species i that are in a many-to-many orthology relationship with genes in species j.
- OrthologuesStats_one-to-many.tsv: entry (i,j) gives the number of genes in species i that are in a one-to-many orthology relationship with genes from species j. There is a walk-through of an example results file here: #259.
- OrthologuesStats_many-to-one.tsv: entry (i,j) gives the number of genes in species i that are in a many-to-one orthology relationship with a gene from species j. There is a walk-through of an example results file here: #259.
- OrthologuesStats_Total.tsv contains the totals for each species pair of orthologues of whatever multiplicity. Entry (i,j) is the total number of genes in species i that have orthologues in species j.
Statistics_Overall.tsv is a tab separated text file that contains general statistics about orthogroup sizes and proportion of genes assigned to orthogroups.
Statistics_PerSpecies.tsv is a tab separated text file that contains the same information as the Statistics_Overall.csv file but for each individual species.

Most of the terms in the files 'Statistics_Overall.csv' and 'Statistics_PerSpecies.csv' are self-explanatory, the remainder are defined below.

Species-specific orthogroup: An orthogroups that consist entirely of genes from one species.
G50: The number of genes in the orthogroup such that 50% of genes are in orthogroups of that size or larger.
O50: The smallest number of orthogroups such that 50% of genes are in orthogroups of that size or larger.
Single-copy orthogroup: An orthogroup with exactly one gene (and no more) from each species. These orthogroups are ideal for inferring a species tree and many other analyses.
Unassigned gene: A gene that has not been put into an orthogroup with any other genes.

Gene Duplication Events Directory

Duplications.tsv is a tab separated text file that lists all the gene duplication events identified by examining each node of each orthogroup gene tree. The columns are "Orthogroup", "Species Tree node" (branch of the species tree on which the duplication took place, see Species_Tree/SpeciesTree_rooted_node_labels.txt), "Gene tree node" (node corresponding to the gene duplication event, see corresponding orthogroup tree in Resolved_Gene_Trees/); "Support" (proportion of expected species for which both copies of the duplicated gene are present); "Type" ("Terminal": duplication on a terminal branch of the species tree, "Non-Terminal": duplication on an internal branch of the species tree & therefore shared by more than one species, "Non-Terminal: STRIDE": Non-Terminal duplication that also passes the very stringent STRIDE checks for what the topology of the gene tree should be post-duplication); "Genes 1" (the list of genes descended from one of the copies of the duplicate gene), "Genes 2" (the list of genes descended from the other copy of the duplicate gene.
SpeciesTree_Gene_Duplications_0.5_Support.txt provides a summation of the above duplications over the branches of the species tree. It is a text file in newick format. The numbers after each node or species name are the number of gene duplication events with at least 50% support that occurred on the branch leading to the node/species. The branch lengths are the standard branch lengths, as give in Species_Tree/SpeciesTree_rooted.txt.

Orthogroup Sequences

A FASTA file for each orthogroup giving the amino acid sequences for each gene in the orthogroup.

Single Copy Orthologue Sequences

The same files as the "Orthogroup Sequences" directory but restricted to only those orthogroups which contain exactly one gene per species.

WorkingDirectory

This contains all the files necessary for orthofinder to run. You can ignore this.

Understanding Orthology

Orthogroups, Orthologs & Paralogs

Figure 2A shows an example gene tree for three species: human, mouse and chicken. Orthologs are pairs of genes that descended from a single gene in the last common ancestor (LCA) of two species (Fig. 2B). They can be thought of as 'equivalent genes' between two species. An orthogroup is the extension of this concept to groups of species. An orthogroup is the group of genes descended from a single gene in the LCA of a group of species (Figure 2A). Genes within an orthogroup may be orthologs of one another or they may be paralogs, as explained below.

The tree shows the evolutionary history of a gene. First, there was a speciation event where the chicken lineage diverged from the human-mouse ancestor. In the human-mouse ancestor, there was a gene duplication event at X producing two copies of the gene in that ancestor, Y & Z. When human and mouse diverged they each inherited gene Y (becoming HuA & MoA) and gene Z (HuB & MoB). In general, we can identify a gene duplication event because it creates two copies of a gene in a species (e.g. HuA & HuB).

Figure 2: Orthologues, Orthogroups & Paralogues

To tell which genes are orthologs and which genes are paralogs we need to identify the gene duplciation events in the tree. Orthologs are gene that diverged at a speciation event (e.g HuA & MoA) while paralogs diverged at a gene duplication event (e.g. HuA & MoB, and others: Fig 2C). Because orthologs only diverged at the point when the species diverged, they are as closely related as any gene can be between the two species. Paralogs are more distantly related, they diverged at a gene duplication event in a common ancestor. Such a gene duplication event must have occurred further back in time than when the species diverged and so paralogs between a pair of species are always less closely related than orthologs between that pair of species. Paralogs are also possible within a species (e.g. HuA & HuB).

The chicken gene diverged from the other genes when the lineage leading to chicken split from the lineage leading to human and mouse. Therefore, the chicken gene ChC is an ortholog of HuA & HuB in human and an ortholog of MoA & MoB in mouse. Depending on what happend after the genes diverged, orthologs can be in one-to-one relationships (HuA - MoA), many-to-one (HuA & HuB - ChC), or many-to-many (no examples in this tree, but would occur if there were a duplication in chicken). All of these relationships are identified by OrthoFinder.

Why Orthogroups

Orthogroups allow you to analyse all of your data

All of the genes in an orthogroup are descended from a single ancestral gene. Thus, all the genes in an orthogroup started out with the same sequence and function. As gene duplication and loss occur frequently in evolution, one-to-one orthologs are rare and limitation of analyses to on-to-one orthologs limits an analysis to a small fraction of the available data. By analysing orthogroups you can analyse all of your data.

Orthogroups allow you to define the unit of comparison

It is important to note that with orthogroups you choose where to define the limits of the unit of comparison. For example, if you just chose to analyse human and mouse in the above figure then you would have two orthogroups.

Orthogroups are the only way to identify orthologs

Orthology is defined by phylogeny. It is not definable by amino acid content, codon bias, GC content or other measures of sequence similarity. Methods that use such scores to define orthologs in the absence of phylogeny can only provide guesses. The only way to be sure that the orthology assignment is correct is by conducting a phylogenetic reconstruction of all genes descended from a single gene the last common ancestor of the species under consideration. This set of genes is an orthogroup. Thus, the only way to define orthology is by analysing orthogroups.

Trees from MSA: `"-M msa"`

The following is not required for the standard OrthoFinder use cases and are only needed if you want to infer maximum likelihood trees from multiple sequence alignments (MSA). This is more costly computationally but more accurate. By default, MAFFT is used for the alignment and FastTree for the tree inference. The option for this is, "-M msa". You should be careful using any other tree inference programs, such as IQTREE or RAxML, since inferring the gene trees for the complete set of orthogroups using anything that is not as quick as FastTree will require significant computational resources/time. The executables you wish to use should be in the system path.

Advanced usage

Python Source Code Version

There is a standalone binary for OrthoFinder which do not require python or scipy to be installed which is therefore the easiest for many users. However, the python source code version is available from the github 'releases' page (e.g. 'OrthoFinder_source.tar.gz') and requires python 2.7 or python 3 plus scipy & numpy to be installed. Up-to-date and clear instructions for scipy/numpy are provided here: http://www.scipy.org/install.html. As websites can change, an alternative is to search online for "install scipy".

Manually Installing Dependencies

To perform an analysis OrthoFinder requires some dependencies. The OrthoFinder release package now contains these so you should just be able to download it and run.

Here are some brief instructions if you do need to download them manually. They will need to be in the system path, which you can check by using the 'which' command, e.g. which diamond. Each of these packages also contains more detailed installation instructions on their websites if you need them.

Standard workflow:

DIAMOND or MMseqs2 (recommended, although BLAST+ can be used instead)
The MCL graph clustering algorithm
FastME (The appropriate version for your system, e.g. 'fastme-2.1.5-linux64', should be renamed `fastme', see instructions below.)

MSA workflow:

Multiple sequence alignment program: MAFFT recommended
Tree inference program: FastTree* recommended

FastTree is highly recommended, especially for a first analysis. Note that even a program as fast as IQTREE will take a very large amount of time to run on a reasonable sized dataset. If you intend to do this, it is recommended to try a faster method first (e.g. the standard workflow). Once you've confirmed everything is ok, you can restart the previous analysis from the point where these workflows diverge using the -M msa option.

DIAMOND

Available here: https://github.com/bbuchfink/diamond/releases

Download the latest release, extract it and copy the executable to a directory in your system path, e.g.:

wget https://github.com/bbuchfink/diamond/releases/latest/download/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
sudo cp diamond /usr/local/bin

or alternatively if you don't have root privileges, instead of the last step above, add the directory containing the directory to your PATH variable. E.g.

mkdir ~/bin
cp diamond ~/bin
export PATH=$PATH:~/bin/

MCL

The mcl clustering algorithm is available in the repositories of some Linux distributions and so can be installed in the same way as any other package. For example, on Ubuntu, Debian, Linux Mint:

sudo apt-get install mcl

Alternatively, it can be built from source which will likely require the 'build-essential' or equivalent package on the Linux distribution being used. Instructions are provided on the MCL webpage, http://micans.org/mcl/.

FastME

FastME can be obtained from http://www.atgc-montpellier.fr/fastme/binaries.php. The package contains a 'binaries/' directory. Choose the appropriate one for your system and copy it to somewhere in the system path e.g. '/usr/local/bin'** and name it 'fastme'. I.e.:

sudo cp fastme-2.1.5-linux64 /usr/local/bin/fastme

Optional: BLAST+

BLAST may give 1-2% accuracy increase over DIAMOND, but with a runtime of approximately 20x longer. NCBI BLAST+ is available in the repositories from most Linux distributions and so can be installed in the same way as any other package. For example, on Ubuntu, Debian, Linux Mint:

sudo apt-get install ncbi-blast+

Alternatively, instructions are provided for installing BLAST+ on Mac and various flavours of Linux on the "Standalone BLAST Setup for Unix" page of the BLAST+ Help manual currently at http://www.ncbi.nlm.nih.gov/books/NBK1762/. Follow the instructions under "Configuration" in the BLAST+ help manual to add BLAST+ to the PATH environment variable.

Optional: MMseqs2

Available here: https://github.com/soedinglab/MMseqs2/releases

Download the appropriate version for your machine, extract it and copy the executable to a directory in your system path, e.g.:

wget https://github.com/soedinglab/MMseqs2/releases/download/3-be8f6/MMseqs2-Linux-AVX2.tar.gz
tar xzf MMseqs2-Linux-AVX2.tar.gz
sudo cp mmseqs2/bin/mmseqs /usr/local/bin

or alternatively if you don't have root privileges, instead of the last step above, add the directory containing the directory to your PATH variable

export PATH=$PATH:`pwd`/mmseqs2/bin/

config.json : Adding addtional programs for tree inference, local alignment or MSA

You can actually use any alignment or tree inference program you like the best! Be careful with the method you chose, OrthoFinder typically needs to infer about 10,000-20,000 gene trees. If you have many species or if the tree/alignment method isn't super-fast then this can take a very long time! MAFFT + FastTree provides a reasonable compromise. OrthoFinder already knows how to call:

mafft
muscle
iqtree
raxml
raxml-ng
fasttree

For example, to you muscle and iqtree, the command like arguments you need to add are: "-M msa -A muscle -T iqtree"

OrthoFinder also knows how to use the following local sequence alignment programs:

BLAST
DIAMOND
MMSeqs2

If you want to use a different program, there is a simple configuration file called "config.json" in the orthofinder directory and you can also create a file of the same format called "config_orthofinder_user.json" in your user home directory. You just need to add an entry to tell OrthoFinder what the command line looks like for the program you want to use. There are lots of examples in the file that you can follow. The "config.json" file is read first and then the "config_orthofinder_user.json", if it is present. The config_orthofinder_user.json file can be used to add user-specific options and to overwrite options from config.json. In most cases it is best to add additional options to the "config_orthofinder_user.json" since these will continue to apply if you update your version of OrthoFinder.

Adding Extra Species

OrthoFinder allows you to add extra species without re-running the previously computed BLAST searches:

orthofinder -b previous_orthofinder_directory -f new_fasta_directory

This will add each species from the 'new_fasta_directory' to existing set of species, reuse all the previous BLAST results, perform only the new BLAST searches required for the new species and recalculate the orthogroups. The 'previous_orthofinder_directory' is the OrthoFinder 'WorkingDirectory/' containing the file 'SpeciesIDs.txt'.

Removing Species

OrthoFinder allows you to remove species from a previous analysis. In the 'WorkingDirectory/' from a previous analysis there is a file called 'SpeciesIDs.txt'. Comment out any species to be removed from the analysis using a '#' character and then run OrthoFinder using:

orthofinder -b previous_orthofinder_directory

where 'previous_orthofinder_directory' is the OrthoFinder 'WorkingDirectory/' containing the file 'SpeciesIDs.txt'.

Adding and Removing Species Simultaneously

The previous two options can be combined, comment out the species to be removed as described above and use the command:

orthofinder -b previous_orthofinder_directory -f new_fasta_directory

Inferring Multiple Sequence Alignment (MSA) Gene Trees

Trees can be inferred using multiple sequence alignments (MSA) by using the option "-M msa". By default MAFFT is used to generate the MSAs and FastTree to generate the gene trees. Alternatively, any other program can be used in place of these. Many popular programs have already been configured by having an entry in the config.json file in the orthofinder directory. All options currently available can be seen by using the option "-h" to see the help file. The config.json file is user-editable to allow for any other desired program to be added. MAFFT, FastTree, or whatever programs are used instead need to be in the system path.

OrthoFinder performs light trimming of the MSA to prevent overly long runtimes & RAM usage caused by very long, gappy alignemnts. A column is trimmed from the alignment if is it greater than 90% gaps and provided two conditions are met. 1. The length of the trimmed alignment cannot go below 500 AA 2. No more than 25% of non-gap characters can be removed from the alignment. If either of these conditions are not met then the threshold for the percentage of gaps in removed columns is progressively increased beyond 90% until both conditions are met. The trimming can be turned off using the option "-z".

Parallelising OrthoFinder Algorithm

There are two separate options for controlling the parallelisation of OrthoFinder.

'-t number_of_threads': This option should always be used. It specifies the number of parallel processes for the BLAST/DIAMOND searches and tree inference steps. These steps represent most of the runtime and are highly-parallelisable and so you should typically use as many threads as there are cores available on your computer. This is the value it will default to if not specified by the user.
'-a number_of_orthofinder_threads' In addition to the above, all of the critical internal steps of the OrthoFinder algorithm have been parallelised. The number of threads for these steps is controlled using the '-a' option. These steps typically have larger RAM requirements and so using a value 4-8x smaller than that used for the '-t' option is usually a good choice. Since these steps are a small component of the overall runtime it is not important to set '-a' as high as possible in order to get good performance. Not running out of RAM is a more important consideration. If the '-a' parameter is not set it will default to 16 or one eighth of the '-t' parameter, whichever is smaller.

Running BLAST Searches Separately (-op option)

The '-op' option will prepare the files in the format required by OrthoFinder and print the set of BLAST commands that need to be run.

orthofinder -f fasta_files_directory -op

This is useful if you want to manage the BLAST searches yourself. For example, you may want to distribute them across multiple machines. Once the BLAST searches have been completed the orthogroups can be calculated using the '-b' command as described in Section "Using Pre-Computed BLAST Results".

Using Pre-Computed BLAST Results

It is possible to run OrthoFinder with pre-computed BLAST results provided they are in the correct format. They can be prepared in the correct format using the '-op' command and, equally, the files from a previous OrthoFinder run are also in the correct format to rerun using the '-b' option. The command is simply:

orthofinder -b directory_with_processed_fasta_and_blast_results

If you are running the BLAST searches yourself it is strongly recommended that you use the '-op' option to prepare the files first (see Section "Running BLAST Searches Separately"). Should you need to prepare them manually, the required files and their formats are described in the appendix of the PDF Manual (for example, if you already have BLAST search results from another source and it will take too much computing time to redo them).

Regression Tests

A set of regression tests are included in the directory 'Tests' available from the github repository. They can be run by calling the script 'test_orthofinder.py'. They currently require version 2.2.28 of NCBI BLAST and the script will exit with an error message if this is not the case.

Methods

The orthogroup inference stage of OrthoFinder is described in the first paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0721-2

The second stage from orthogroups to gene trees, the rooted species tree, orthologs, gene duplication events etc. is described in the second paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1832-y

The workflow figure at the top of this page summarises this.

The rooting of the unrooted species tree is described in the STRIDE paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5850722/

Species tree inference is described in the second OrthoFinder paper and in the STAG paper: https://www.biorxiv.org/content/10.1101/267914v1. A summary is provided below.

Species Tree Inference

OrthoFinder infers orthologs from rooted gene trees. Since tree inference methods return unrooted gene trees, OrthoFinder requires a rooted species tree in order to root the gene trees before ortholog inference can take place. There are two methods that can be used for unrooted species tree inference (plus a fallback method that is employed in rare circumstances when there is insufficient data for the other methods). Additionally, if the user knows the topology of the rooted species tree they can provide it to OrthoFinder (the branch lenghts aren't required). The rooted species tree is only required in the final step of the OrthoFinder analysis, the rooting of the gene trees and the inference of orthologs and gene duplication events. This step is comparitively fast and so it is easy to run just this last step using the '-ft' option and a corrected species tree if you want to use a different species tree to the one OrthoFinder used.

Default species tree method

The default species tree method is STAG, described here: https://www.biorxiv.org/content/10.1101/267914v1

The set of all orthogroups with all species present (regardless of gene copy number) is identified: X
For each orthogroup x in X, a matrix of pairwise species distances is calculated. For x, the distance between each species pair is the tree distance for the closest pair of genes from that species pair in the gene tree for x.
For each orthogroup x in X, a species tree is inferred from the distance matrix.
A consensus tree of all these individual species trees is calculated as the final species tree
The support value for each bipartition is the number of individual species trees that contained that bipartition.
When it is run, OrthoFinder outputs how many orthogroups it has identified with all species present. E.g. for the example dataset:

269 trees had all species present and will be used by STAG to infer the species tree

Multiple Sequence Alignment species tree method (-M msa)

The MSA species tree method is also described in the STAG paper: https://www.biorxiv.org/content/10.1101/267914v1, it is used whenever the MSA method is used for tree inference using the '-M msa' option. It infers the species tree from a concatenated MSA of single-copy genes. For many datasets there will not be many orthogroups that have exactly one gene in every species since gene duplication and loss events make such orthogroups rare. For this reason, OrthoFinder will identify orthogroups that are single-copy in a proportion (p%) of species and use the single-copy genes from these orthogroups as additional data to infer the species tree. This is standard practice in most papers in which a species tree is inferred. OrthoFinder provides a formalised procedure for determining a suitable value of p. Let S be the number of species.

Identify n, the number of orthogroups with exactly one gene in s species, where s is initially equal to S, the number of species in the analysis. If n >= 1000 stop here and use these orthogroups
While n < 1000

set s = s-1
recalculate n, number of orthogroups with at least s species single-copy
If n >= 100 and the proportional increase in the number of orthogroups, n, is less than two times the proportional decrease in s then stop here and use the n orthogroups. Reducing the minimum threshold for single-copy species is not giving a large amount of extra data and so it's not worth reducing this threshold further. if s<0.5xS then require a 4 times proportional increase in the number of orthogroups to for each decrement in s to avoid lowering s too far.

Create a concatenated species MSA from the single-copy genes in the selected orthogroups.
Trim the MSA of any column that has more than (S - 0.5s) gaps. (I.e. S-s species could be gaps anyway because of the inclusion threshold that was determined and then at most 50% gaps in a particular column for the s genes represented for that column).
When it is run, OrthoFinder outputs how many orthogroups it has identified and with what minimum threshold percentage of species single-copy in each orthogroup (100*s/S). E.g. for the example dataset:

Species tree: Using 246 orthogroups with minimum of 100.0% of species having single-copy genes in any orthogroup

Falback species tree method

In most datasets there will be thousands of genes present in all species and so the default species tree inference method can be used. In some extreme cases there may not be any such orthogroups. In these cases, instead of the default method, the pairwise distances are calculated in each tree for each species pair that is present in that tree. A single distance matrix is then calculated for the species tree rather than one distance matrix per orthogroup. The distance between each species pair is this matrix is the median of all the closest distances across all the orthogroup gene trees. The species trees is inferred from this distance matrix.

Command line options

Options for starting an analysis

-f <dir>: Start analysis from directory of FASTA files
-b <dir>: Start analysis from BLAST results in OrthoFinder directory
-b <dir1> -f <dir2>: Start analysis from BLAST results in OrthoFinder dir1 and add FASTA files from dir2
-fg <dir>: Start analysis from orthogroups OrthoFinder directory
-ft <dir>: Start analysis from gene trees in OrthoFinder directory

Options for stopping an analysis

-op: Stop after preparing input files for all-vs-all sequence search (e.g. BLAST/DIAMOND)
-og: Stop after inferring orthogroups
-os: Stop after writing sequence files for orthogroups (requires '-M msa')
-oa: Stop after inferring mulitple sequence alignments for orthogroups (requires '-M msa')
-ot: Stop after inferring gene trees for orthogroups

Options controlling the workflow

-M <opt>: Use MSA or DendroBLAST gene tree inference, opt=msa,dendroblast [default=dendroblast]

Options controlling the programs used

-S <opt>: Sequence search program opt=blast,diamond,mmseqs,... user-extendable [Default = diamond]
-A <opt>: MSA program opt=mafft,muscle,... user-extendable (requires '-M msa') [Default = mafft]
-T <opt>: Tree inference program opt=fasttree,raxml,iqtree,... user-extendable (requires '-M msa') [Default = fasttree]

Further options

-d: Input is DNA sequences -t <int>: Number of threads for sequence search, MSA & tree inference [Default is number of cores on machine]
-a <int>: Number of parallel analysis threads for internal, RAM intensive tasks [Default = 1]
-s <file>: User-specified rooted species tree
-I <int>: MCL inflation parameter [Default = 1.5]
-x <file>: Info for outputting results in OrthoXML format
-p <dir>: Write the temporary pickle files to <dir>
-1: Only perform one-way sequence search
-X: Don't add species names to sequence IDs in output files
-y: Split paralogous clades below root of a HOG into separate HOGs
-z: Don't trim MSAs (columns>=90% gap, min. alignment length 500)
-n <txt>: Name to append to the results directory
-o <txt>: Non-default results directory
-h: Print this help text

orthofinder's People

Contributors

Stargazers

Watchers

Forkers

litswu bruno-fs linhua-sun sujaikumar ekopylova lucventurini drorhilman chunceguo gsc0107 beig2048 thiesgehrmann wyim-pgl sophy7074 nylander macmanes-lab suz11001 biov robsyme nicholas-nvs arvin580 gbwellman jingjiesong abretaud herreralab tibebu4 sunnycqcn pythseq senaj altingia abdo3a pardojer mandel01 brittanymareeott emilhaegglund andrebolerbarros pvanheus nemochina2008 jlanga ggchange peiwenliu18 qiao-xin biostar77 photocyte zhanmengtao nijibabulu viraghmate tangtong715 deminu dayedepps kaydaramola wangpanqiao ianreid sauru5 shankarkshakya chensole feigeliudan01 xuelei-dai huilongshuyan liangminliu bioming yamasampo ghanashyamchalla floortrudeau frank-plant remyd1 emagallong azadeh-4232 fredericbga skiyaga juadiegaitan zhaokai2014 weirlab neato-nick ndhall benjaminguinet zhanglongls joscolgan zhaoc1 biosunyb xiangyang1984 liupfskygre aaltaher wangdi2014 dancingcode007 lingzl bmmoore43 biolittleboy frbot eernst skeffington biocko dilema liuke98 assafwww ferninfm aragornwubo jianshu93 cnyuanh 1398206876 jefdaj

orthofinder's Issues

orthofinder: cannot execute binary file

Hello,

I am trying to get Orthofinder up and running. When I try to execute -h in terminal, OrthoFinder-1.1.4/orthofinder -h, I receive this response, -bash: Orthofinder-1.1.2/orthofinder: cannot execute binary file. From what I read online this could be an inability for my mac os to interpret orthofinders language. Anyone else have this issue?

Thanks,
Rob

OrthoFinder and Reference species

Hi every one,

I'm wondering how OrthoFinder is bulding orthogroups ?
It looks like OrthoFinder is using a reference species (the last species in alphanumerical order ?) which is changing the number of orthogroups as well as the number of single copy orthologues predicted by orthofinder...

Looking for single copy orthologues in my set of sequences, I used the Orthogroups.csv file from Orthofinder, counted the number of genes found per orthogroups and per species. Then, I extracted the orthogoups for which no more than one gene were found per species. I printed a presence/absence (or gene occupancy) matrix per species and per orthogroups.
Alph.SCO.pdf :

As you can see on the SCO.png file, which represent the gene occupancy for every SCO identified by Orthofinder under my criteria, the first species (called "E") is showing 100% of presence for every SCO.
After I renamed the species (E corresponded to A specie in the first run), I found the same pattern for E, the last species in alphanumerical order.
RevAlph.SCO.pdf.

Am I doing something wrong ? Am'I pointing some limits of OrthoFinder ? Are my scripts unacurate ? Should I use some other tools already available ?
Could you explain me what's going on ?

Pierre

Some more precision about the exemple dataset I used :

From the Uniprot, proteome ( http://www.uniprot.org/proteomes/), I downloaded the five first entry UP000000212, UP000000214, UP000000223, UP000000224 and UP000000229 (ftp link in the file: ExplProteome.txt)
ExplProteome.txt

mkdir TestOrthoFinder
cd TestOrthoFinder
for i in $(cat ExplProteome.txt) ; do wget $i  ; done
for i in $(ls *.gz); do gunzip $i ; done

I renamed the proteome ()

#Prepare Iteration
number=(0 1 2 3 4)
Alph=(A B C D E)
RevAlph=(E D C B A)
listFasta=($(ls *.fasta))

#Prepare Directory for both analysis :
mkdir TestAlph
mkdir TestRevAlph
for i in ${number[@]}
do
        ln ${listFasta[$i]} TestAlph/${Alph[$i]}.fa
        ln ${listFasta[$i]} TestRevAlph/${RevAlph[$i]}.fa
done

Then, running Orthofinder version 1.1.3 :

orthofinder.py -t 32 -f TestAlph/
orthofinder.py -t 32 -f RevTestAlph/
cp TestAlph/Result*/Orthogroups.csv AlphOrthogroups.csv
cp RevTest/Result*/Orthogroups.csv RevAlphOrthogroups.csv

I used two home-made script (available on github at : https://github.com/npchar/Phylogenomic )
The first is counting the number of genes/contigs/sequence per orthogroups and per species. The second is identifying SCO, and is creating a graphical representation of the gene occupancy matrix (need R::optparse, R::reshape2, R::ggplot2 ).

git clone https://github.com/npchar/Phylogenomic.git
./Phylogenomic/OrthoFinder-NumberOfGenesPerOrthogroupsPerSpecies.pl RevAlphOrthogroups.csv > RevAlphOrthogroups.ocurrence.csv
./Phylogenomic/OrthoFinder-NumberOfGenesPerOrthogroupsPerSpecies.pl AlphOrthogroups.csv > AlphOrthogroups.ocurrence.csv
./Phylogenomic/OrthoFinder-SCOmatrixBuilder.R -f AlphOrthogroups.ocurrence.csv -o Alpha
./Phylogenomic/OrthoFinder-SCOmatrixBuilder.R -f RevAlphOrthogroups.ocurrence.csv -o RevAlpha

It produced the SCO matrix representation for different % of gene occupancy : SCO.pdf

Orthoxml Format

Hi David,

as I had trouble with this in the last two runs I want to make sure I've got the format right for the file provided by -x to obtain orthoxml format.

I'm am using the latest commit.
The example from the manual:
HomSap.faHomo sapiens36EnsemblHomo_sapiens.NCBI36.52.pep.all.fa

Does HomSap.fa need to be the file name as I initially gave it to orthofinder i.e. the same as value in the SpeciesID.txt column 2?

If I understand it correctly the only other field with restricted text is the NCBI taxonomy ID, the rest is not further checked and just used to fill xml content?

Best,
Daniel

Cannot execute binary file

Hello,
Sorry for what is surely a very basic question. I am trying to install OrthoFinder on a Mac running Sierra and have only managed to get as far as trying to test if I can run the program. When I enter OrthoFinder-1.1.4/orthofinder -h, I get an error message saying "cannot execute binary file." Is there anything I can do about this, or is this just due to the fact that I am running a mac and not linux?
Thank you!

dlcpar installation problems

Hi David (great to see you at UK GS2016!)

When installing dlcpar for the new functionalities, I encountered two problems:

python setup.py install in the dlcpar folder gave an error that it could not find dlcpar/deps/rasmus/ply. I got around that by running pip install ply and changing setup.py to import ply rather than requiring the package dlcpar.deps.rasmus.ply. The install ran fine after that, but we may not have done this the right way.

Can you tell us how you installed dlcpar?
After doing the above, on running the program (the script/libs are all in the path etc, I checked), I get this error:

dlcpar_search 
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 204, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 127, in main
    options, treefiles = parse_args()
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 46, in parse_args
    common.add_common_options(grp_io,
AttributeError: 'module' object has no attribute 'add_common_options'

And on checking the code, add_common_options is not a function in the module (It was removed a few versions ago it seems).

Could you confirm which version of dlcpar you are using?

Thanks!

faster sequence comparison

Hi,
For large projects it could be beneficial to use https://github.com/soedinglab/MMseqs or another fast sequence comparison tool (faster than blast) and perhaps nice to have it wrapped over within the same orthofinder tool.

thanks

AttributeError: 'SequencesInfo' object has no attribute 'inputdir'

Running OrthoFinder with some incorrectly formatted BLAST results files produces the following error:

  File "/home/david/software/OrthoFinder/OrthoFinder-master/orthofinder.py", line 658, in GetBLAST6Scores
    sys.stderr.write("Malformatted line in %sBlast%d_%d.txt\nOffending line was:\n" % (seqsInfo.inputdir, iSpecies, jSpecies))
AttributeError: 'SequencesInfo' object has no attribute 'inputdir

It should correctly print where the problem in the BLAST files was

WriteGraphParallel ignores specified number of cpu's

I ran OrthoFinder v0.6.1 on a 32Gb machine using pre-computed BLAST results totaling 256 files at 163.2Gb. I specified only one process, but would run out of RAM when final scores are written to graph files. This seemed to be occurring because multiple processes were spawning at this step even though only one process was specified.

We worked around this by modifying the WriteGraphParallel function, line 834, by changing pool=mp.pool() to pool=mp.pool(1) to restrict it to one process. This hard-coded change allowed the program to complete successfully in around 16 hours on this large amount of data. I thought you may be interested to know about this issue and work-around.

Thanks so much,

Rachel

orthofinder results: visualisation

Hi folks,

Now that I have the results from Orthofinder. I cannot find the number of genes overlapping and the one to one orthogroups.

Regards,
/SB

python error

Hi David

I am having problems getting OrthoFinder up and running on PSC Bridges.

/home/macmanes/OrthoFinder/orthofinder/orthofinder.py -f /home/macmanes/genomes/
OrthoFinder version 1.1.5 Copyright (C) 2014 David Emms

    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it under certain conditions.
    For details please see the License.md that came with this software.

16 thread(s) for highly parallel tasks (BLAST searches etc.)
1 thread(s) for OrthoFinder algorithm

Checking required programs are installed
----------------------------------------
Test can run "makeblastdb -help" - ok
Test can run "blastp -help" - ok
Test can run "mcl -h" - ok
Test can run "fastme -i /home/macmanes/genomes/SimpleTest.phy -o /home/macmanes/genomes/SimpleTest.tre" - ok
Test can run "dlcpar_search --version" - ok

Dividing up work for BLAST for parallel processing
--------------------------------------------------
2017-03-02 14:15:33 : Creating Blast database 1 of 45
Traceback (most recent call last):
  File "/home/macmanes/OrthoFinder/orthofinder/orthofinder.py", line 1375, in <module>
    CreateBlastDatabases(dirs)
  File "/home/macmanes/OrthoFinder/orthofinder/orthofinder.py", line 1204, in CreateBlastDatabases
    RunBlastDBCommand(command)
  File "/home/macmanes/OrthoFinder/orthofinder/orthofinder.py", line 71, in RunBlastDBCommand
    capture = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=my_env)
  File "/opt/packages/python/2_7_11_gcc/lib/python2.7/subprocess.py", line 710, in __init__
    errread, errwrite)
  File "/opt/packages/python/2_7_11_gcc/lib/python2.7/subprocess.py", line 1335, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

my fasta files are indeed at /home/macmanes/genomes/, and in fact in WorkingDirectory they are all copied there as per normal.

ls -lth /home/macmanes/genomes/
-rw-r--r-- 1 macmanes mc3bg6p 7.3M Jan 21  2016 Egretta_garzetta.fasta
-rw-r--r-- 1 macmanes mc3bg6p 6.1M Jan 21  2016 Fulmarus_glacialis.fasta
-rw-r--r-- 1 macmanes mc3bg6p 8.0M Jan 21  2016 Gallus_gallus.fasta
-rw-r--r-- 1 macmanes mc3bg6p 5.8M Jan 21  2016 Gavia_stellata.fasta
-rw-r--r-- 1 macmanes mc3bg6p 7.7M Jan 21  2016 Geospiza_fortis.gene.fasta
-rw-r--r-- 1 macmanes mc3bg6p 6.1M Jan 21  2016 Haliaeetus_albicilla.fasta
-rw-r--r-- 1 macmanes mc3bg6p 7.8M Jan 21  2016 Haliaeetus_leucocephalus.fasta
-rw-r--r-- 1 macmanes mc3bg6p 6.4M Jan 21  2016 Leptosomus_discolor.fasta
-rw-r--r-- 1 macmanes mc3bg6p 7.3M Jan 21  2016 Manacus_vitellinus.fasta
-rw-r--r-- 1 macmanes mc3bg6p 6.1M Jan 21  2016 Cariama_cristata.fasta
-rw-r--r-- 1 macmanes mc3bg6p 5.2M Jan 21  2016 Cathartes_aura.fasta

Any ideas?

support python 3.4 and later?

Are there plans to support python 3.4 and later?
I'm using scikit-bio for many of my algorithms and it requires python 3.4 and later making it incompatible with OrthoFinder :(
Thanks!

Did you think about making OrthoFinder a webapp?

Error: during Running Orthologue Prediction

Running Orthologue Prediction

1. Checking required programs are installed

Test can run "fastme -i myfilepath/SimpleTest.phy -o myfilepath/SimpleTest.tre" - ok
Test can run "dlcpar_search --version" - ok

2. Calculating gene distances

2016-09-15 09:39:04 : Done 20 of 49
2016-09-15 09:38:40 : Done 0 of 49
2016-09-15 09:38:53 : Done 10 of 49
2016-09-15 09:39:15 : Done 30 of 49
2016-09-15 09:39:29 : Processing species 0
2016-09-15 09:40:08 : Processing species 1
2016-09-15 09:40:43 : Processing species 2
2016-09-15 09:41:24 : Processing species 3
2016-09-15 09:43:03 : Processing species 4
2016-09-15 09:43:33 : Processing species 5
2016-09-15 09:44:03 : Processing species 6

3. Inferring gene and species trees

. Error: Invalid distance matrix : numerical value expected for taxon '5_6492' instead of '-inf'.
2016-09-15 09:45:33 : Done 4000 of 10413
2016-09-15 09:45:13 : Done 1000 of 10413

4. Best outgroup(s) for species tree

Traceback (most recent call last):
File "orthofinder.py", line 1190, in
orthologuesResultsFilesString = get_orthologues.GetOrthologues(workingDir, resultsDir, clustersFilename_pairs, nBlast)
File "/home/george/programs/OrthoFinder-master/orthofinder/scripts/get_orthologues.py", line 566, in GetOrthologues
roots, clusters, rootedSpeciesTreeFN, nSupport = rfd.GetRoot(spTreeFN_ids, os.path.split(db.treesPatIDs)[0] + "/", rfd.GeneToSpecies_dash, nProcesses, treeFmt = 1)
File "/home/george/programs/OrthoFinder-master/orthofinder/scripts/root_from_duplications.py", line 401, in GetRoot
list_of_lists = pool.map(SupportedHierachies_wrapper2, [(fn, GeneToSpeciesMap, species, dict_clades, clade_names) for fn in glob.glob(treesDir + "/*")])
File "/home/george/anaconda2/lib/python2.7/multiprocessing/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/home/george/anaconda2/lib/python2.7/multiprocessing/pool.py", line 567, in get
raise self._value
scripts.newick.NewickError: Unexisting tree file or Malformed newick tree structure.
2016-09-15 09:35:32 : Writen final scores for species 0 to graph file
2016-09-15 09:35:43 : Writen final scores for species 1 to graph file
2016-09-15 09:35:54 : Writen final scores for species 2 to graph file
2016-09-15 09:36:17 : Writen final scores for species 3 to graph file
2016-09-15 09:36:27 : Writen final scores for species 4 to graph file
2016-09-15 09:36:36 : Writen final scores for species 5 to graph file
2016-09-15 09:36:45 : Writen final scores for species 6 to graph file

How to handle this error?

How to incrementally add more genomes/spp. to a previous run?

Let's say I did a test run with 4 species. Is it possible to add an additional fasta file (or specie) to the analysis, taking into account the result from a previous run or I just have to start from scratch for the whole batch of 5 species. I read the Using pre-computed BLAST results section and took a look at the files in the Results/Working directory but I am lost.

-b option with results from AtomicBlast : ValueError: invalid literal for int() with base 10

Hello,

I just tried to run orthofinder with the "-b blast_result_directory" option, with the results of an atomicblast. It was a .tab format, and I renamed 'SpeciesIDs.txt', as required. I had the following error :
Traceback (most recent call last): File "orthofinder/orthofinder.py", line 1293, in <module> File "orthofinder/orthofinder.py", line 1121, in ProcessPreviousFiles File "orthofinder/scripts/util.py", line 260, in GetSpeciesToUse ValueError: invalid literal for int() with base 10: 'Pu3207_1/1_1.000_1010\tPu3207_1/1_1.000_1010\t100.00\t336\t0\t0\t1\t1008\t1\t1008\t0.0\t 783\n' Failed to execute script orthofinder

Could it be because of the .tab format ? The atomicblast results file has this format :
Ac450_1/1_1.000_646 Ac450_1/1_1.000_646 100.00 215 0 0 1 645 1 645 5e-145 509 Ac450_1/1_1.000_646 Ac450_1/1_1.000_646 100.00 215 0 0 2 646 2 646 2e-142 500 Ac450_1/1_1.000_646 Ac450_1/1_1.000_646 100.00 214 0 0 3 644 3 644 5e-138 486 Ac450_1/1_1.000_646 Ac450_1/1_1.000_646 100.00 177 0 0 644 114 644 114 4e-121 412 Ac450_1/1_1.000_646 Ac450_1/1_1.000_646 100.00 15 0 0 47 3 47 3 4e-121 39.6

Ortholog groups against a known set of proteins

Hello,
I have a set of 100 UniRef proteins. I want to find orthologs of this set of proteins against a set of 44 transcriptomes. Would OrthoFinder help me in this ? I know OrthoFinder can find ortholog groups amongst the 44 transcriptomes themselves, but i need orthologs of he 100 Uniref proteins in particular for a phylogenetics study.
Thanks,
Anuj,
University of Oklahoma

OSError: [Errno 39] Directory not empty

I'm attempting to get OrthoFinder working on my group's CentOS 7 system. When running the example data set I'm seeing:

conda run -- python orthofinder/orthofinder.py -f orthofinder/ExampleDataset/
Using Anaconda Cloud api site https://api.anaconda.org
Fetching package metadata: ....

This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it under certain conditions.
For details please see the License.md that came with this software.

16 thread(s) for highly parallel tasks (BLAST searches etc.)
1 thread(s) for OrthoFinder algorithm

1. Checking required programs are installed

Test can run "makeblastdb -help" - ok
Test can run "blastp -help" - ok
Test can run "mcl -h" - ok

2. Temporarily renaming sequences with unique, simple identifiers

3. Dividing up work for BLAST for parallel processing

2016-09-12 15:28:00 : Creating Blast database 1 of 4
2016-09-12 15:28:00 : Creating Blast database 2 of 4
2016-09-12 15:28:00 : Creating Blast database 3 of 4
2016-09-12 15:28:00 : Creating Blast database 4 of 4

4. Running BLAST all-versus-all

Using 16 thread(s)
2016-09-12 15:28:00 : This may take some time....
2016-09-12 15:28:00 : Done 0 of 16

5. Running OrthoFinder algorithm

2016-09-12 15:28:15 : Initial processing of each species
2016-09-12 15:28:15 : Initial processing of species 0 complete
2016-09-12 15:28:15 : Initial processing of species 1 complete
2016-09-12 15:28:16 : Initial processing of species 2 complete
2016-09-12 15:28:16 : Initial processing of species 3 complete
2016-09-12 15:28:21 : Connected putatitive homologs
2016-09-12 15:28:21 : Writen final scores for species 0 to graph file
2016-09-12 15:28:21 : Writen final scores for species 1 to graph file
2016-09-12 15:28:21 : Writen final scores for species 2 to graph file
2016-09-12 15:28:21 : Writen final scores for species 3 to graph file
2016-09-12 15:28:22 : Ran MCL

6. Writing orthogroups to file

Orthogroups have been written to tab-delimited files:
/scicomp/home/user/bin/OrthoFinder-master/orthofinder/ExampleDataset/Results_Sep12/Orthogroups.csv
/scicomp/home/user/bin/OrthoFinder-master/orthofinder/ExampleDataset/Results_Sep12/Orthogroups.txt (OrthoMCL format)
/scicomp/home/user/bin/OrthoFinder-master/orthofinder/ExampleDataset/Results_Sep12/Orthogroups_UnassignedGenes.csv

Running Orthologue Prediction

1. Checking required programs are installed

Test can run "fastme -i /scicomp/home/user/bin/OrthoFinder-master/orthofinder/ExampleDataset/Results_Sep12/WorkingDirectory/SimpleTest.phy -o /scicomp/home/user/bin/OrthoFinder-master/orthofinder/ExampleDataset/Results_Sep12/WorkingDirectory/SimpleTest.tre" - ok
Test can run "dlcpar_search --version" - ok

2. Calculating gene distances

2016-09-12 15:28:24 : Done 0 of 16
2016-09-12 15:28:25 : Processing species 0
2016-09-12 15:28:25 : Processing species 1
2016-09-12 15:28:25 : Processing species 2
2016-09-12 15:28:25 : Processing species 3

3. Inferring gene and species trees

2016-09-12 15:28:28 : Done 0 of 315
2016-09-12 15:28:28 : Done 100 of 315
2016-09-12 15:28:30 : Done 200 of 315

4. Best outgroup(s) for species tree

Observed 3 duplications. 3 support the best root and 0 contradict it.
Best outgroup for species tree:
Mycoplasma_agalactiae, Mycoplasma_hyopneumoniae

5. Reconciling gene and species trees

Outgroup: Mycoplasma_agalactiae, Mycoplasma_hyopneumoniae
2016-09-12 15:28:40 : Done 0 of 314
2016-09-12 15:28:48 : Done 100 of 314
2016-09-12 15:28:54 : Done 200 of 314

6. Inferring orthologues from gene trees

2016-09-12 15:29:06 : Processing orthologues for species 0
2016-09-12 15:29:06 : Processing orthologues for species 1
2016-09-12 15:29:06 : Processing orthologues for species 2
2016-09-12 15:29:06 : Processing orthologues for species 3
Traceback (most recent call last):
File "orthofinder/orthofinder.py", line 1190, in
orthologuesResultsFilesString = get_orthologues.GetOrthologues(workingDir, resultsDir, clustersFilename_pairs, nBlast)
File "/scicomp/home/user/bin/OrthoFinder-master/orthofinder/scripts/get_orthologues.py", line 602, in GetOrthologues
CleanWorkingDir(db)
File "/scicomp/home/user/bin/OrthoFinder-master/orthofinder/scripts/get_orthologues.py", line 540, in CleanWorkingDir
shutil.rmtree(dFull)
File "/scicomp/home/user/bin/anaconda2/lib/python2.7/shutil.py", line 256, in rmtree
onerror(os.rmdir, path, sys.exc_info())
File "/scicomp/home/user/bin/anaconda2/lib/python2.7/shutil.py", line 254, in rmtree
os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/scicomp/home/user/bin/OrthoFinder-master/orthofinder/ExampleDataset/Results_Sep12/Orthologues_Sep12/WorkingDirectory/Trees_ids_arbitraryRoot/'
user@monolith1> cd orthofinder/ExampleDataset/Results_Sep12/
user@monolith1> ls -a Orthologues_Sep12/WorkingDirectory/Trees_ids_arbitraryRoot/
. ..

I'm running Anaconda Python 2.7.11, MCL-edge 14-137, NCBI BLAST+ 2.3.0, Fastme 2.1.5, DLCpar 0.9.1.

The error seemed related to a discussion that I found here.

I'm very excited about this software and looking forward to hearing back about resolving this issue.

FastTree illegal characters

Hi David,
I had a problem running the; trees_for_orthogroups.py script.
The alignment stage worked and I received alignments for all orthogroups in /Alignments. However, quite a lot of the tree output files in /Trees were blank. After some searching I found that some of my species names had symbols which FastTree classes as illegal such as brackets so all the trees with those species in had not run. After finding this out it was easily fixed but I was wondering if there would be any chance you could add a check for illegal characters into the pipeline to stop this from happening.
Cheers

Syntax error

How can I solve this error?

File "orthofinder.py", line 229
with open(clustersFilename, 'rb') as clusterFile, open(newFilename, "wb") as output:
^
SyntaxError: invalid syntax

Include shebang line with the interpreter directive

Hi David,

I'm using OrthoFInder through a pipeline module I made for agalma. The aim is to integrate orthofinder into the general agalma pipelines as an alternative to the default methods used in agalma for building a concatenated matrix of orthologous loci.

These modules require that third-party software installed in the PATH as executables, so in the case of python scripts, like OrthoFinder, they would require a shebang pointing to python. I know that everyone can do it or that there is some workarounds (like using aliases), but I think that would be helpful to simplify the process of setting up the module if you can add the shebang #!/usr/bin/env python to your scripts rather than explain how to deal with that to some users.

Please, consider that cost-less addition.

Cheers,
Andrés

Allow gzipped externally-computed blast results?

Hi David

Would it be easy to add support for gzipped Blast results? eg Blast0_1.txt.gz files in the case of externally computed Blasts?

Thanks!

Having problems with the Blast run

Hi, I have 63 proteomes I am trying to run OrthoFinder on.
The script worked fine with the example datasest,
However, when I tried on my data-sets, it looks like the initial files were created fine, so does the SpeciesIDs.txt and SequenceIDs.txt, but the blast files came out empty!
So for each species I am getting the error: "WARNING: Too few hits between species %d and species %d to normalise the scores, these hits will be ignored".
So eventually I do not get any ortholog group!,
What could be the problem?

use of longest transcripts

Hi,

why is it best to use the longest transcript for Orthofinder analysis?

here is where you recommend this https://github.com/davidemms/OrthoFinder/blob/d86154429ea5ca5977b0b24df9ddfb248edc20cd/README.md#performing-your-own-orthofinder-analysis

Best,
/SB

search for paralogs on genes w/o orthologs

i think this suggestion is out of the scope of the program, but i'd be a nice addition an option to also cluster paralogs without orthologs.

in most of the cases users are really interested on orthogroups, but sometimes we also want to check families of proteins "exclusive" for a single organism (or, better saying, not yet found on other species)

Cannot allocate memory error

Hi David,

I keep getting the following error while running OrthoFinder on a cluster:

Traceback (most recent call last):
  File "/nfs/users/nfs_d/dd6/apps/OrthoFinder-0.4/orthofinder.py", line 1240, in <module>
    pool.map(RunCommandReport, commands)   
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 227, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
    raise self._value
OSError: [Errno 12] Cannot allocate memory

When I check the job log file, it seems I've allocated enough memory (8Gb with the job only using ~300 Mb)... so I'm not sure what is going on here.

Best,
Daryl

Precomputed BLAST results and -x without adding new species fails

Hi,

I precomputed BLAST results by running the BLAST jobs produced by the -p option on a cluster...

Now I'd like to start Orthofinder like this:
orthofinder.py -b phytozome/Results_Jul21/WorkingDirectory/ -a 8 -t 2 -x combined3.txt

where combined3.txt looks like this:
head -n2 combined3.txt
Species63.fa Zea mays 4577 Phytozome 11 Zmays_284_5b+.protein_primaryTranscriptOnly.fa
Species14.fa Carica papaya 3649 Phytozome 11 Cpapaya_113_ASGPBv0.4.protein_primaryTranscriptOnly.fa

The run fails at the
2b. Reading species information file stage
Traceback (most recent call last):
File "/home/pgsb/daniel.lang/software/OrthoFinder/orthofinder/orthofinder.py", line 1046, in
userFastaFilenames = previousFastaFiles + userFastaFilenames
NameError: name 'userFastaFilenames' is not defined

I tried to initialize userFastaFilenames as an empty list like previousFastaFiles but that also does not solve it because then
Traceback (most recent call last):
File "/home/pgsb/daniel.lang/software/OrthoFinder/orthofinder/orthofinder.py", line 1075, in
speciesInfo[iSpecies] = line
IndexError: list assignment index out of range

Somehow I suspect that the -x switch only works if you start from scratch.

The original Fasta files are named like in the last column of combined3 but in two locations this run does not really know about phytozome/ and another directory...

Do I need to put the original Fasta files someplace together or at least symlink them there?

Best,
Daniel

Option -os not working

Hi David,

I am running OrthoFinder 1.1.4 and I noticed in the help info (-h) that it is possible to stop orthofinder after the orthogroup inference, while still outputting a fasta file per orthogroup (option -os). However, when I try this, OrthoFinder infers the orthogroups just fine, but then starts running Orthologue Prediction. This seems strange to me since it should stop before that step? In my case, this also produces the error "fastme: command not found" since I didn't install fastme (because I don't want OrthoFinder to infer trees). When I check the OrthoFinder output, I cannot find fasta files per orthogroup.

Am I understanding the "-os" option incorrectly, or is it possible that this option does not show the desired behaviour?

I enclosed my output in the file "OrthoFinder_error.txt" and the command I ran is the following:
orthofinder -f $pin_sequences -og -t 4

Kinds regards,
Stijn
OrthoFinder_error.txt

cPickle

Another dependency:

File "orthofinder.py", line 40, in
import cPickle as pic # Y

ImportError: No module named 'cPickle'

VisibleDeprecationWarning

FYI, while Running OrthoFinder using the new 'removing species' functionality, I get a message that 'rank' is deprecated..

python /share/OrthoFinder/orthofinder.py -b ../Results_Jan21/WorkingDirectory/

OrthoFinder version 0.4.0 Copyright (C) 2014 David Emms

    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it under certain conditions.
    For details please see the License.md that came with this software.

Using previously calculated BLAST results in /mnt/data3/macmanes/bird/genomes/Results_Jan21/WorkingDirectory/

1. Checking required programs are installed
-------------------------------------------
Test can run "mcl -h" - ok

2. Temporarily renaming sequences with unique, simple identifiers
------------------------------------------------------------------
Skipping

3. Dividing up work for BLAST for parallel processing
-----------------------------------------------------
Skipping

4. Running BLAST all-versus-all
-------------------------------
Skipping

5. Running OrthoFinder algorithm
--------------------------------
2016-02-04 06:09:59.228121 : Started
2016-02-04 06:10:01.382200 : Got sequence lengths
2016-02-04 06:10:01.382320 : Initial processing of each species
/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py:2645: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)

fixed inflation parameter?

Hi
in reference to line 340
command = ["mcl", graphFilename, "-I", "1.5", "-o", clustersFilename]

Should "1.5" be str(inflation) ?

Malformatted blast result error

I've been running OrthoFinder successfully using both the included example files and a subset (~30) of the 266 genomes I eventually need to analyze. I'm including OrthoFinder as part of a larger pipeline, so I'm gradually scaling up and I build things out.

Last week I tried to run the full 266 genome set for the first time, and I got an error in the step where all of the blast results are being processed for each genome (Date time : Initial processing of species # complete). The error was "Malformatted line in /path/Blast#_#.txt, offending line was: ..."

It appeared that the error might be related to Issue #31, so I updated my installation to the current release (1.1.2). I am using MCL-edge/14-137 and ncbi-blast+/2.3.0. I'm only interested in orthogroup inference, so I am not using any other dependencies.

When I retried the same set of genomes, I got a similar error.

2017-01-08 17:33:05 : Initial processing of species 186 complete
2017-01-08 17:35:50 : Initial processing of species 187 complete
2017-01-08 17:38:30 : Initial processing of species 188 complete
Malformatted line in /scicomp/home/igy7/git-repos/ts4/results/aaORFs/Results_Jan06/WorkingDirectory/Blast189_45.txt
Offending line was:
189_3812        45_1697 22.273  220     156     8       474     679     8       226     0.001   38.5
ERROR: An error occurred, please review previous error messages for more information.

When I reviewed the file Blast189_45.txt, this is what I found:

189_3812        45_1697 22.273  220     156     8       474     679     8       226     0.001   38.5
189_3813        45_4358 98.436  5561    85^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@189_3814    45_4357 99.765  425     1       0       1       425     1       425     0.0     842
189_3814        45_820  22.857  420     246     15      10      408     27      389     3.54e-16        77.0
189_3814        45_4357 99.765  425     1       0       1       425     1       425     0.0     842
189_3814        45_820  22.857  420     246     15      10      408     27      389     3.54e-16        77.0

So 189_3813 seems to be the real problem. When I ran grep 45_4358 Blast45_189.txt, the results seem normal.

45_4358 189_3813        98.436  5561    85      1       1       5559    1       5561    0.0     10909

The fasta input files for both of these appear normal to me. Any suggestions?

Thank you!

Error during Running Orthologue Prediction, Step 5, on ExampleDataset

Hi David

I tried running orthofinder (checked out today at 5pm), with the new functionality, and am getting the errors below even with your Tests/Input/ExampleDataset of 4 Mycoplasma protein files.

Any thoughts on what could be causing the errors? The program gives errors during Running Orthologue Prediction, Step 5, but then seems to finish correctly.

However, when I then look at Species-by-species orthologues:
.../Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/Orthologues/Orthologues_Mycoplasma__/_ - the files there just have headers and nothing else in them.

From the error messages below, it looks like it is trying to create an output directory which is misnamed: eg

/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000310_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000310_tree_id.coal.tree

(which has the /scratch/skumar/.... bit in it twice)

Thanks,

Sujai

ps. A dump of the full ExampleDataset folder as run below is at: ftp://ftp.ed.ac.uk/edupload/ExampleDataset.tar.gz

orthofinder.py -f /scratch/skumar/Tests/Input/ExampleDataset -t 16

OrthoFinder version 1.0.5 Copyright (C) 2014 David Emms

    This program comes with ABSOLUTELY NO WARRANTY.
    This is free software, and you are welcome to redistribute it under certain conditions.
    For details please see the License.md that came with this software.

16 thread(s) for highly parallel tasks (BLAST searches etc.)
1 thread(s) for OrthoFinder algorithm

1. Checking required programs are installed
-------------------------------------------
Test can run "makeblastdb -help" - ok
Test can run "blastp -help" - ok
Test can run "mcl -h" - ok

2. Temporarily renaming sequences with unique, simple identifiers
------------------------------------------------------------------

3. Dividing up work for BLAST for parallel processing
-----------------------------------------------------
2016-09-26 18:30:33 : Creating Blast database 1 of 4
2016-09-26 18:30:33 : Creating Blast database 2 of 4
2016-09-26 18:30:33 : Creating Blast database 3 of 4
2016-09-26 18:30:33 : Creating Blast database 4 of 4

4. Running BLAST all-versus-all
-------------------------------
Using 16 thread(s)
2016-09-26 18:30:33 : This may take some time....
2016-09-26 18:30:33 : Done 0 of 16

5. Running OrthoFinder algorithm
--------------------------------
2016-09-26 18:30:53 : Initial processing of each species
2016-09-26 18:30:53 : Initial processing of species 0 complete
2016-09-26 18:30:54 : Initial processing of species 1 complete
2016-09-26 18:30:54 : Initial processing of species 2 complete
2016-09-26 18:30:54 : Initial processing of species 3 complete
2016-09-26 18:30:59 : Connected putatitive homologs
2016-09-26 18:30:59 : Writen final scores for species 0 to graph file
2016-09-26 18:30:59 : Writen final scores for species 1 to graph file
2016-09-26 18:30:59 : Writen final scores for species 2 to graph file
2016-09-26 18:30:59 : Writen final scores for species 3 to graph file
2016-09-26 18:31:00 : Ran MCL

6. Writing orthogroups to file
------------------------------
A duplicate accession was found using just first part: A1
Tried to use only the first part of the accession in order to list the sequences in each orthogroup
more concisely but these were not unique. The full accession line will be used instead.

Orthogroups have been written to tab-delimited files:
   /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthogroups.csv
   /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthogroups.txt (OrthoMCL format)
   /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthogroups_UnassignedGenes.csv

Running Orthologue Prediction
=============================

1. Checking required programs are installed
-------------------------------------------
Test can run "fastme -i /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/WorkingDirectory/SimpleTest.phy -o /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/WorkingDirectory/SimpleTest.tre" - ok
Test can run "dlcpar_search --version" - ok

2. Calculating gene distances
-----------------------------
2016-09-26 18:31:00 : Done 0 of 16
2016-09-26 18:31:02 : Processing species 0
2016-09-26 18:31:02 : Processing species 1
2016-09-26 18:31:02 : Processing species 2
2016-09-26 18:31:03 : Processing species 3

3. Inferring gene and species trees
-----------------------------------
2016-09-26 18:31:03 : Done 0 of 315
2016-09-26 18:31:03 : Done 100 of 315
2016-09-26 18:31:03 : Done 200 of 315
A duplicate accession was found using just first part: A1
Tried to use only the first part of the accession in order to list the sequences in each orthogroup
more concisely but these were not unique. The full accession line will be used instead.


4. Best outgroup(s) for species tree
------------------------------------
Observed 3 duplications. 3 support the best root and 0 contradict it.
Best outgroup for species tree:
  Mycoplasma_agalactiae_5632_FP671138, Mycoplasma_hyopneumoniae_AE017243

5. Reconciling gene and species trees
-------------------------------------
Outgroup: Mycoplasma_agalactiae_5632_FP671138, Mycoplasma_hyopneumoniae_AE017243
2016-09-26 18:31:07 : Done 0 of 314
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
Traceback (most recent call last):
    rootData=True)
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000014_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000014_tree_id.coal.tree'
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000008_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000008_tree_id.coal.tree'
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    stream = open(filename, mode)
IOError:     phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
[Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000006_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000006_tree_id.coal.tree'  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon

    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000011_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000011_tree_id.coal.tree'
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000015_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000015_tree_id.coal.tree'
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000004_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000004_tree_id.coal.tree'
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000013_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000013_tree_id.coal.tree'
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000009_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000009_tree_id.coal.tree'
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000007_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000007_tree_id.coal.tree'
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000003_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000003_tree_id.coal.tree'
Traceback (most recent call last):
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 209, in <module>
    sys.exit(main())
  File "/exports/virt_env/python/orthofinder/bin/dlcpar_search", line 201, in main
    phyloDLC.write_dlcoal_recon(out, coal_tree, maxrecon)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 339, in write_dlcoal_recon
    recon.write(filename, coal_tree, exts=exts, filenames=filenames)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/yjw/bio/phyloDLC.py", line 169, in write
    rootData=True)
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/treelib.py", line 602, in write_newick
    write_newick(self, util.open_stream(out, "w"),
  File "/exports/virt_env/python/orthofinder/lib/python2.7/site-packages/dlcpar/deps/rasmus/util.py", line 1170, in open_stream
    stream = open(filename, mode)
IOError: [Errno 20] Not a directory: '/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/Trees_ids_arbitraryRoot/OG0000010_tree_id.txt/scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/WorkingDirectory/dlcpar/OG0000010_tree_id.coal.tree'

(error messages continue for each Orthogroup)

6. Inferring orthologues from gene trees
----------------------------------------
2016-09-26 18:31:28 : Processing orthologues for species 0
2016-09-26 18:31:28 : Processing orthologues for species 1
2016-09-26 18:31:28 : Processing orthologues for species 2
2016-09-26 18:31:28 : Processing orthologues for species 3

7. Writing results files
------------------------
Orthogroups have been written to tab-delimited files:
   /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthogroups.csv
   /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthogroups.txt (OrthoMCL format)
   /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthogroups_UnassignedGenes.csv

Gene trees:
   /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Gene_Trees

Rooted species tree:
   /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26/SpeciesTree_rooted.txt

Species-by-species orthologues:
   /scratch/skumar/Tests/Input/ExampleDataset/Results_Sep26/Orthologues_Sep26

Orthogroup statistics:
   Statistics_PerSpecies.csv   Statistics_Overall.csv   Orthogroups_SpeciesOverlaps.csv

OrthoFinder assigned 1938 genes (70.9% of total) to 536 orthogroups. Fifty percent of all genes were in orthogroups
with 4 or more genes (G50 was 4) and were contained in the largest 300 orthogroups (O50 was 300). There were 280
orthogroups with all species present and 253 of these consisted entirely of single-copy genes.

When publishing work that uses OrthoFinder please cite:
    D.M. Emms & S. Kelly (2015), OrthoFinder: solving fundamental biases in whole genome comparisons
    dramatically improves orthogroup inference accuracy, Genome Biology 16:157.

Any suggestions for an easy way to generate files containing CDS sequences for ortholog groups?

Hello,
I used orthofinder to cluster orthologous groups from the proteomes of 7 plants and it worked great! I plan to infer gene trees for ~6000 of these OGs. I used trees_for_orthogroups.py to generate a separate file for each OG (output in the Sequences/ directory). My problem is that I would like to use DNA sequences for downstream analyses (rather than the protein sequence). SO, I would like to generate a separate file for each OG that contains the CDS (DNA) sequences of the orthologs. I have CDS fasta files for each of the species but I am faced with the challenge of extracting the appropriate sequences and putting them into the appropriate files. I don't imagine it would be too difficult to code but I figured I'd see if anyone already has a solution so that I don't have to re-invent the wheel!
Thank you very much!

Allow custom output directory?

Thank you very much for your work on Orthofinder; it's very useful to me!

I'm working with Orthofinder in the context of a larger pipeline, and the inability to specify minimally the location of the output directory is extremely vexing when I'm attempting to keep my output directory orderly. Optimally I'd like to be able to specify the location and the name of the output directory. This will save having to execute a separate mv command on the output once it's completed.

Again, thank you very much!

scipy.optimize

Hi there,

I am trying to run OrthoFinder, I have installed Python 2.7, MCL , BLAST but I still get this error:

Traceback (most recent call last):
File "orthofinder.py", line 34, in
from scipy.optimize import curve_fit # install
ImportError: No module named scipy.optimize

I thought was related to NumPy and I re-installed a newer version but still I get the same error.

Any idea of what is the problem?

thanks in advance for the reply

Regards
Roberto

Orthogroup IDs

The orthogroup IDs in the OrthologousGroups.txt and OrthologousGroups.csv file do not match exactly. The spacing of zero characters is out by 1. eg OG000002 in .txt file and OG0000002 in the csv file. This is using OrthoFinder version 0.2.8.

Blastp halting and not using all the threads

Hi. I have had to restart orthofinder three times in the last week because the blastp searches keep getting stuck and the number of threads goes down to 1. I even tried to run Orthofinder on an EC2 instance and had the same issue. Anyone has any idea why this keeps happening? Thanks!

"A duplicate accession was found..." error

Hi again, I met this error while running the phase 1 first

[I removed a bunch of the output to fit it here...]

1. Checking required programs are installed

Test can run "makeblastdb -help" - ok
Test can run "blastp -help" - ok
Test can run "mcl -h" - ok

2. Temporarily renaming sequences with unique, simple identifiers

Done

3. Dividing up work for BLAST for parallel processing

3a. Creating BLAST databases

4. Running BLAST all-versus-all

Maximum number of BLAST processer: 11
2015-10-19 12:30:51.644003 : This may take some time....
Done!

5. Running OrthoFinder algorithm

2015-10-22 02:51:58.743852 : Started
2015-10-22 02:51:59.966677 : Got sequence lengths
2015-10-22 02:51:59.966706 : Initial processing of each species

2015-10-22 07:47:18.320688 : Writen final scores for species 15 to graph file
[mclIO] reading <sixteen/Results_Oct19/WorkingDirectory/OrthoFinder_v0.2.8_graph.txt>
.......................................
[mclIO] read native interchange 517135x517135 matrix with 7956242 entries
[mcl] pid 27111
ite ------------------- chaos time hom(avg,lo,hi) expa expb expc fmv
1 ................... 121.04 5.24 1.00/0.03/9.22 2.88 2.52 2.52 0

42 ................... 0.00 0.13 1.00/1.00/1.00 1.00 1.00 0.08 0
[mcl] cut <14> instances of overlap
[mcl] jury pruning marks: <93,92,94>, out of 100
[mcl] jury pruning synopsis: <92.9 or scrumptious> (cf -scheme, -do log)
[mclIO] writing <sixteen/Results_Oct19/WorkingDirectory/clusters_OrthoFinder_v0.2.8_I1.5.txt>
.......................................
[mclIO] wrote native interchange 517135x99346 matrix with 517135 entries to stream <sixteen/Results_Oct19/WorkingDirectory/clusters_OrthoFinder_v0.2.8_I1.5.txt>
[mcl] 99346 clusters found
[mcl] output is in sixteen/Results_Oct19/WorkingDirectory/clusters_OrthoFinder_v0.2.8_I1.5.txt

Please cite:

2015-10-19 12:30:52.010419 : Running command: blastp -outfmt 6 -evalue 0.001 -query

sixteen/Results_Oct19/WorkingDirectory/Blast10_4.txt
2015-10-20 13:31:28.710983 : Running command: blastp -outfmt 6 -evalue 0.001 -query sixteen/Results_Oct19/WorkingDirectory/Species4.fa -db sixteen/Results_Oct19/WorkingDirectory/BlastDBSpecies10 -out sixteen/Results_Oct19/WorkingDirectory/Blast4_10.txt
2015-10-20 16:23:13.283020 : Finished command: blastp -outfmt 6 -evalue 0.001 -query sixteen/Results_Oct19/WorkingDirectory/Species4.fa -db sixteen/Results_Oct19/WorkingDirectory/BlastDBSpecies10 -out sixteen/Results_Oct19/WorkingDirectory/Blast4_10.txt
2015-10-20 16:23:13.283062 : Running command2015-10-22 07:50:39.753291 : Ran MCL

6. Creating files for Orthologous Groups

When publishing work that uses OrthoFinder please cite:
D.M. Emms & S. Kelly (2015), OrthoFinder: solving fundamental biases in whole genome comparisons
dramatically improves orthogroup inference accuracy, Genome Biology 16:157.

A duplicate accession was found using just first part: TR31292|c0_g1_i1|m.23117
Tried to use only the first part of the accession in order to list the sequences in each orthologous group more concisely but these were not unique. Will use the full accession line instead.
Orthologous groups have been written to tab-delimited files:
sixteen/Results_Oct19/OrthologousGroups.csv
sixteen/Results_Oct19/OrthologousGroups_UnassignedGenes.csv
And in OrthoMCL format:
sixteen/Results_Oct19/OrthologousGroups.txt

And then as I run the phase to obtain the alignments:

This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it under certain conditions.
For details please see the License.md that came with this software.

Generating trees for orthogroups in file:
sixteen/Results_Oct19/OrthologousGroups.txt

Using 11 threads for alignments and trees

Traceback (most recent call last):
File "trees_for_orthogroups.py", line 310, in
idDict = GetIDsDict(orthofinderWorkingDir)
File "trees_for_orthogroups.py", line 235, in GetIDsDict
idExtract = orthofinder.FirstWordExtractor(orthofinderWorkingDir + "SequenceIDs.txt")
File "/home/compartido2/andres/OrthoFinder/orthofinder.py", line 159, in init
raise RuntimeError("A duplicate accession was found using just first part: % s" % accession)
RuntimeError: A duplicate accession was found using just first part: TR31292|c0_g1_i1|m.23117

And then it stopped.
How can I fix this?
Thanks in advance

Statistics_PerSpecies.csv

Hi guys,

First, thank you for the great tool.

I am looking at the Statistics_PerSpecies.csv file for a few of my runs. There are four tables (below). The second & fourth are percentages but what is the difference between Table 1 & Table 3.

Best,
/SB

Number of genes per-species in orthogroup   Number of orthogroups   Number of orthogroups   Number of orthogroups   Number of orthogroups   Number of orthogroups
'0  8423    9583    9024    2421    2613
'1  16014   14982   15708   25100   24982
'2  3677    3556    3581    2253    2192
'3  1303    1270    1148    489 455
'4  544 581 532 157 173
'5  267 237 241 71  73
'6  143 168 150 34  29
'7  86  84  77  21  25
'8  44  51  55  16  25
'9  40  25  29  13  11
'10 28  20  22  13  10
11-15   44  46  34  19  20
16-20   8   14  12  10  9
21-50   13  18  17  11  10
51-100  2   1   4   1   3
101-150 2   2   3   5   4
151-200 0   0   1   2   0
201-500 0   0   0   2   4
501-1000    0   0   0   0   0
'1001+  0   0   0   0   0

Number of genes per-species in orthogroup   Percentage of orthogroups   Percentage of orthogroups   Percentage of orthogroups   Percentage of orthogroups   Percentage of orthogroups
'0  27.5    31.3    29.5    7.9 8.5
'1  52.3    48.9    51.3    81.9    81.5
'2  12  11.6    11.7    7.4 7.2
'3  4.3 4.1 3.7 1.6 1.5
'4  1.8 1.9 1.7 0.5 0.6
'5  0.9 0.8 0.8 0.2 0.2
'6  0.5 0.5 0.5 0.1 0.1
'7  0.3 0.3 0.3 0.1 0.1
'8  0.1 0.2 0.2 0.1 0.1
'9  0.1 0.1 0.1 0   0
'10 0.1 0.1 0.1 0   0
11-15   0.1 0.2 0.1 0.1 0.1
16-20   0   0   0   0   0
21-50   0   0.1 0.1 0   0
51-100  0   0   0   0   0
101-150 0   0   0   0   0
151-200 0   0   0   0   0
201-500 0   0   0   0   0
501-1000    0   0   0   0   0
'1001+  0   0   0   0   0

Number of genes per-species in orthogroup   Number of genes Number of genes Number of genes Number of genes Number of genes
'0  0   0   0   0   0
'1  16014   14982   15708   25100   24982
'2  7354    7112    7162    4506    4384
'3  3909    3810    3444    1467    1365
'4  2176    2324    2128    628 692
'5  1335    1185    1205    355 365
'6  858 1008    900 204 174
'7  602 588 539 147 175
'8  352 408 440 128 200
'9  360 225 261 117 99
'10 280 200 220 130 100
11-15   555 566 416 235 249
16-20   142 242 217 174 160
21-50   401 562 509 359 310
51-100  178 94  270 58  188
101-150 259 229 380 589 485
151-200 0   0   178 381 0
201-500 0   0   0   658 1085
501-1000    0   0   0   0   0
'1001+  0   0   0   0   0

Number of genes per-species in orthogroup   Percentage of genes Percentage of genes Percentage of genes Percentage of genes Percentage of genes
'0  0   0   0   0   0
'1  44.3    39.5    44.8    67.1    67.5
'2  20.3    18.7    20.4    12.1    11.8
'3  10.8    10  9.8 3.9 3.7
'4  6   6.1 6.1 1.7 1.9
'5  3.7 3.1 3.4 0.9 1
'6  2.4 2.7 2.6 0.5 0.5
'7  1.7 1.5 1.5 0.4 0.5
'8  1   1.1 1.3 0.3 0.5
'9  1   0.6 0.7 0.3 0.3
'10 0.8 0.5 0.6 0.3 0.3
11-15   1.5 1.5 1.2 0.6 0.7
16-20   0.4 0.6 0.6 0.5 0.4
21-50   1.1 1.5 1.5 1   0.8
51-100  0.5 0.2 0.8 0.2 0.5
101-150 0.7 0.6 1.1 1.6 1.3
151-200 0   0   0.5 1   0
201-500 0   0   0   1.8 2.9
501-1000    0   0   0   0   0
'1001+  0   0   0   0   0

Removing species and aligning the new output

Hi David,
I reran OF by removing some species and the new OrthologousGroups.txt is now in the WorkingDirectory but the old ones are still in the main Results_date dir. I would like to run the tree script next but I just want to make sure that I give the correct dir path to -f option when running that script. It would be path to where the new OrthologousGroups.txt right? Please let me know. Thanks!
Taruna

"same species" proteomes in a larger taxon set

Hi David,

first of all thank you for your continued effort into this excellent tool!

I need to infer gene families for a larger taxon set that also comprises several taxonomically dispersed cases where I have proteomes from two/three isolates/strains of one species (e.g. maize B73/PH207). I am worrying that they might end up in a lot of "lineage specific" clusters like I observed in my last run when I accidentally included two different annotation versions of the same species. Is this a behavior you'd expect? If so is there any means to circumvent/improve it?

Best,
Daniel

OF failure with -b option

Hi David,

I'm trying to run OF on a large dataset. Because of the size (191 spp) I performed the all-by-all blast manually as a series of job arrays on our research cluster, and am now using the -b option to run the actual orthofinder algorithm.

When I do so, however, it fails, with the following errors:

Process Process-1:
Traceback (most recent call last):
File "/nfs/research2/marioni/claumer/anaconda2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/nfs/research2/marioni/claumer/anaconda2.7/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(_self._args, *_self._kwargs)
File "orthofinder.py", line 800, in AnalyseSequences
wfAlg.RunWaterfallMethod(graphFilename)
File "orthofinder.py", line 572, in RunWaterfallMethod
Bij = self.thisBfp.GetBLAST6Scores(iSpecies, jSpecies)
File "orthofinder.py", line 544, in GetBLAST6Scores
if score > B[sequence1ID, sequence2ID]:
File "/nfs/research2/marioni/claumer/anaconda2.7/lib/python2.7/site-packages/scipy/sparse/lil.py", line 246, in getitem
i, j)
File "_csparsetools.pyx", line 58, in scipy.sparse._csparsetools.lil_get1 (scipy/sparse/_csparsetools.c:3299)
File "_csparsetools.pyx", line 81, in scipy.sparse._csparsetools.lil_get1 (scipy/sparse/_csparsetools.c:2944)
IndexError: row index (18054) out of bounds
___ [mcxIO] cannot open file </nfs/research2/marioni/claumer/metazoa/Blastout2/OrthoFinder_v0.3.0_graph.txt> in mode r
___ [mcl] no jive
___ [mcl] failed
Traceback (most recent call last):
File "orthofinder.py", line 1146, in
MCL.ConvertSingleIDsToIDPair(speciesStartingIndices, clustersFilename, clustersFilename_pairs)
File "orthofinder.py", line 230, in ConvertSingleIDsToIDPair
with open(clustersFilename, 'rb') as clusterFile, open(newFilename, "wb") as output:
IOError: [Errno 2] No such file or directory: '/nfs/research2/marioni/claumer/metazoa/Blastout2/clusters_OrthoFinder_v0.3.0_I1.5.txt'

I've had this error with v. 0.2.8 and v. 0.3.0. The test suite runs just fine. I'm using python 2.7, with scipy etc installed using the Anaconda suite.

Any ideas on the source of this? A true bug, or could it be a format error in my input data? It's dimly possible that a few of the blast outputs might have truncated a bit early due to stochastic node failure, but I'm a bit surprised if this causes the entire algorithm to fail.

Grateful for what attention you can give to this,

Regards,
Chris L

Error in trees_for_orthogroups.py

Hi.
I'm trying to run trees_for_orthogroups.py but I got this error:
Traceback (most recent call last):
File "trees_for_orthogroups.py", line 14, in
import orthofinder
ImportError: No module named orthofinder

Any suggestions?
Thanks in advance.
Felix Enciso

Expose mcl's inflation parameter

It would be very useful to have a way to specify a custom value for mcl's -I.

Indirectly related: #18

Incorrect estimated species trees / user-specified option

I ran Orthofinder on a set of species for which the phylogenetic tree is relatively certain. But the two alternative species trees that are estimated by OrthoFinder are both completely off the mark: There are two extremely closely related species in the data set and they are never sisters in the estimated trees. In addition, the true root node is never selected as such. This renders the orthogue assessments basically useless.

Species tree estimation requires thorough phylogenetic analysis and usually this has already been done before. I strongly recommend allowing for a user-specified species tree (perhaps with polytomies in case some nodes are uncertain) and use that as a guide tree instead of estimating the tree based on blast clustering results.

Error orthofinder.py : problems with BLAST

Hi,
I met this error while running BLAST all-versus-all:

Command line argument error: Argument "query". File is not accessible: ~/WorkingDirectory/Species32.fa' Command line argument error: Argument "query". File is not accessible: ~/WorkingDirectory/Species4.fa'
Traceback (most recent call last):
File " ~/OrthoFinder/orthofinder.py", line 1129, in
File " ~/OrthoFinder/orthofinder.py", line 504, in GetNumberOfSequencesInFileFromDir
ValueError: invalid literal for int() with base 10: ''

I was trying to achieve the orthologous groups for 56 fasta files (with amino acid sequences) and the program stopped with over a 70 BLAST files remaining. The program has created over 3000 BLAST files without any problem (no empty files). Everything seems to be fine in the Working Directory and I have no idea how I can fix it.

Any clues?
I really appreciate any help you can provide.

María

OrthoFinder can't run dlcpar_search

I've downloaded OrthoFinder-1.0.7.tar.gz
from https://github.com/davidemms/OrthoFinder/releases/download/1.0.7/OrthoFinder-1.0.7.tar.gz

and set up to my folder. I also installed software below and configuered in system path
*1.*BLAST+
*2.*The MCL
*3.*FastME
*4.*DLCpar

Now, I ran command "./orthofinder -f ExampleData/ -t 24" and errors appeared, it is:

This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it under certain conditions.
For details please see the License.md that came with this software.

24 thread(s) for highly parallel tasks (BLAST searches etc.)
1 thread(s) for OrthoFinder algorithm

Checking required programs are installed

Test can run "makeblastdb -help" - ok
Test can run "blastp -help" - ok
Test can run "mcl -h" - ok

Temporarily renaming sequences with unique, simple identifiers

Dividing up work for BLAST for parallel processing

2016-11-03 23:30:02 : Creating Blast database 1 of 4
2016-11-03 23:30:03 : Creating Blast database 2 of 4
2016-11-03 23:30:03 : Creating Blast database 3 of 4
2016-11-03 23:30:03 : Creating Blast database 4 of 4

Running BLAST all-versus-all

Using 24 thread(s)
2016-11-03 23:30:03 : This may take some time....

Running OrthoFinder algorithm

2016-11-03 23:30:27 : Initial processing of each species
2016-11-03 23:30:28 : Initial processing of species 0 complete
2016-11-03 23:30:28 : Initial processing of species 1 complete
2016-11-03 23:30:29 : Initial processing of species 2 complete
2016-11-03 23:30:29 : Initial processing of species 3 complete
2016-11-03 23:30:33 : Connected putatitive homologs
2016-11-03 23:30:33 : Writen final scores for species 0 to graph file
2016-11-03 23:30:34 : Writen final scores for species 1 to graph file
2016-11-03 23:30:34 : Writen final scores for species 2 to graph file
2016-11-03 23:30:34 : Writen final scores for species 3 to graph file
2016-11-03 23:30:34 : Ran MCL

Writing orthogroups to file

Orthogroups have been written to tab-delimited files:
/biosoft/OrthoFinder-1.0.7/ExampleData/Results_Nov03_7/Orthogroups.csv

/biosoft/OrthoFinder-1.0.7/ExampleData/Results_Nov03_7/Orthogroups.txt
(OrthoMCL format)

/biosoft/OrthoFinder-1.0.7/ExampleData/Results_Nov03_7/Orthogroups_UnassignedGenes.csv

Running Orthologue Prediction

Checking required programs are installed

Test can run "fastme -i
/biosoft/OrthoFinder-1.0.7/ExampleData/Results_Nov03_7/WorkingDirectory/SimpleTest.phy -o /biosoft/OrthoFinder-1.0.7/ExampleData/Results_Nov03_7/WorkingDirectory/SimpleTest.tre" - ok
Test can run "dlcpar_search --version" - failed
ERROR: Cannot run dlcpar_search
Please check DLCpar is installed and that the executables are in the
system path.

Orthogroups have been inferred but the dependencies for inferring gene trees and
orthologues have not been met. Please review previous messages for more information.

But, when I prompt command "dlcpar_search --version" ,it ran correctly and shown:

*#*dlcpar_search --version

dlcpar_search 1.0

*#*dlcpar_search

Usage: dlcpar_search [options] ...

dlcpar_search is a phylogenetic program for (heuristically) finding the most
parsimonious gene tree-species tree reconciliation by inferring speciation,
duplication, loss, and deep coalescence events. See
http://compbio.mit.edu/dlcpar for details.

Options:
Input/Output:

  -s <species tree>, --stree=<species tree>
                      species tree file in newick format

etc.

So, I don't konw what is wrong with it, and I hope you can give me some help, many thanks!

Looking forward to hearing from you!
(submitted on behalf of user)

blastp error message prevents Orthofinder usage

Hello,

I have used Orthofinder for the past year or so. Recently, I needed to install it on a colleague's cluster. All the necessary programs are working, but when Orthofinder runs the initial checks blastp fails.
Test can run "makeblastdb -help" - ok
Test can run "blastp -help" - failed
ERROR: Cannot run BLAST+
Please check BLAST+ is installed and that the executables are in the system path

All of the needed executables are in the system path, but one issue I had with the NCBI install is that each time I run blastp searches I get the following error message.

blastp: /lib64/libz.so.1: no version information available (required by blastp)

Although blastp searches will continue to run with this error message, it appears that Orthofinder stops when it gets this error message. My colleague that created the cluster tried compiling NCBI to avoid this error but so far without success. Also, changing the library file is not an option.

Is there any way to edit Orthofinder code to allow it to run even with the blastp error message?

Thanks

Allowing for multiple inflation values

Hi,

would it be possible to change the command line argument -I to accept multiple (comma-separated) Inflation-values?

cheers,

dom

Get alignment without searching for trees

I don't know how computationally demanding is the stage of using FastTree in the trees_for_orthogroups stage but is there a way to skip this step to obtain just the fasta alignments. Thanks for your help.

davidemms / orthofinder Goto Github PK

orthofinder's Introduction

Interested in a single gene? Try SHOOT.bio, the phylogenetic search engine: https://SHOOT.bio

In addition to this README there is a set of OrthoFinder tutorials here: https://davidemms.github.io/

OrthoFinder: phylogenetic orthology inference for comparative genomics

What does OrthoFinder do?

Table of Contents

Getting started with OrthoFinder

Installing OrthoFinder on Linux

Installing OrthoFinder on Mac & Windows

Running OrthoFinder

OrthoFinder Results Files

Phylogenetic Hierarchical Orthogroups Directory

Orthologues Directory

Orthogroups Directory (deprecated)

Gene Trees Directory

Resolved Gene Trees Directory

Species Tree Directory

Comparative Genomics Statistics Directory

Gene Duplication Events Directory

Orthogroup Sequences

Single Copy Orthologue Sequences

WorkingDirectory

Understanding Orthology

Orthogroups, Orthologs & Paralogs

Why Orthogroups

Orthogroups allow you to analyse all of your data

Orthogroups allow you to define the unit of comparison

Orthogroups are the only way to identify orthologs

Trees from MSA: "-M msa"

Advanced usage

Python Source Code Version

Manually Installing Dependencies

DIAMOND

MCL

FastME

Optional: BLAST+

Optional: MMseqs2

config.json : Adding addtional programs for tree inference, local alignment or MSA

Adding Extra Species

Removing Species

Adding and Removing Species Simultaneously

Inferring Multiple Sequence Alignment (MSA) Gene Trees

Parallelising OrthoFinder Algorithm

Running BLAST Searches Separately (-op option)

Using Pre-Computed BLAST Results

Regression Tests

Methods

Species Tree Inference

Default species tree method

Multiple Sequence Alignment species tree method (-M msa)

Falback species tree method

Command line options

Options for starting an analysis

Options for stopping an analysis

Options controlling the workflow

Options controlling the programs used

Further options

orthofinder's People

Contributors

Stargazers

Watchers

Forkers

orthofinder's Issues

Running Orthologue Prediction

1. Checking required programs are installed

2. Calculating gene distances

3. Inferring gene and species trees

4. Best outgroup(s) for species tree

1. Checking required programs are installed

2. Temporarily renaming sequences with unique, simple identifiers

3. Dividing up work for BLAST for parallel processing

4. Running BLAST all-versus-all

5. Running OrthoFinder algorithm

6. Writing orthogroups to file

Running Orthologue Prediction

1. Checking required programs are installed

2. Calculating gene distances

3. Inferring gene and species trees

4. Best outgroup(s) for species tree

Trees from MSA: `"-M msa"`