naturalis / barcode-constrained-phylogeny Goto Github PK

View Code? Open in Web Editor NEW

1.0 9.0 3.0 3.52 MB

Pipeline for building topologically-constrained phylogenies from DNA barcode data

Home Page: https://naturalis.github.io/barcode-constrained-phylogeny/

License: Apache License 2.0

Python 97.61% Shell 2.39%

dna-barcoding metabarcoding phylogenetic-diversity phylogenetic-placement phylogenetics

barcode-constrained-phylogeny's Introduction

Bactria: BarCode TRee Inference and Analysis

This repository contains code and data for building very large, topologically-constrained barcode phylogenies through a divide-and-conquer strategy. Such trees are useful as reference materials for curating barcode data by detecting rogue terminals (indicating incorrect taxonomic annotation) and in the comparable calculation of alpha and beta biodiversity metrics across metabarcoding assays.

The input data for the approach we develop here currently comes from BOLD data dumps. The international database BOLD Systems contains DNA barcodes for hundreds of thousands of species, with multiple barcodes per species. The data dumps we use here are TSV files whose columns conform to the nascent BCDM (barcode data model) vocabulary. As such, other data sources that conform to this vocabulary could in the future be used as well, such as UNITE.

Theoretically, such data could be filtered and aligned per DNA marker to make phylogenetic trees. However, there are two limiting factors: building very large phylogenies is computationally intensive, and barcodes are not considered ideal for building big trees because they are short (providing insufficient signal to resolve large trees) and because they tend to saturate across large patristic distances.

Both problems can be mitigated by using the Open Tree of Life as a further source of phylogenetic signal. The BOLD data can be split into chunks that correspond to Open Tree of Life clades. These chunks can be made into alignments and subtrees. The OpenTOL can be used as a constraint in the algorithms to make these. The chunks are then combined in a large synthesis by grafting them on a backbone made from exemplar taxa from the subtrees. Here too, the OpenTOL is a source of phylogenetic constraint.

In this repository this concept is developed for the COI-5P marker, but the aim is to achieve equivalent functionality for plant barcoding markers (matK, rbcL) and for some part of the ITS region.

Installation

The pipeline and its dependencies are managed using conda. On a Linux-like system, you can follow these steps to set up the bactria Conda environment using the environment.yml file (for standalone executables that the pipeline needs) and a requirements.txt file (for Python packages that the pipeline scripts use):

Clone the Repository:
Clone the repository containing the environment files to your local machine:

git clone https://github.com/naturalis/barcode-constrained-phylogeny.git
cd barcode-constrained-phylogeny

Create the Conda Environment: Create the bactria Conda/Mamba environment using the environment.yml file with the following command:
```
mamba env create -f workflow/envs/environment.yml
```
This command will create a new Conda environment named bactria with the packages specified in the environment.yml file. This step is largely a placeholder because most of the dependency management is handled at the level of individual pipeline steps, which each have their own environment specification.
Activate the Environment: After creating the environment, activate it using the conda activate command:
```
mamba activate bactria
```
Verify the Environment: Verify that the bactria environment was set up correctly and that all packages were installed using the conda list command:
```
mamba list
```
This command will list all packages installed in the active conda environment. You should see all the packages specified in the environment.yml file and the requirements.txt file.

It is recommended that the mamba environment is configured to use strict channel priorities. This is crucial for having robust and correct environments (for details, see here). Consider configuring strict priorities by executing conda config --set channel_priority strict.

How to configure

The pipeline is configured using the config.yaml file. With the settings in this file you can affect, among other things:

Which higher taxonomic group to build the tree for. The example configuration shows how this is defined for the order Primates.
Where the BOLD data package TSV file is located. Note that the pipeline assumes data packages formatted according to the BCDM syntax from 2024 onwards.
Where the OpenToL synthetic Newick tree file is located. This has to be the tree file with only IDs as tip and node labels, not taxon names. (The file name should follow the same pattern as the example configuration.)
How many families are in the higher taxon for which to build the tree. At time of of writing this is a crucial variable that must match the number of families in the taxon according to BOLD, as pipeline parallelization is based on this. (Note that we are working on a better solution for this.)
Which marker gene to use. Note that any other value besides COI-5P (the default) means you need to provide an HMM file for that marker.
How verbose the pipeline needs to be in its log files.

How to run

The pipeline is implemented using snakemake, which is available within the conda environment that results from the installation.

How to run the entire pipeline:

snakemake -j {number of threads} --use-conda

Snakemake rules can be performed separately:

snakemake --until {Rule} -j {number of threads} --use-conda

Enter the same number at {number of threads} as you filled in previously in src/config.yaml. In {Rule} insert the rule to be performed.

Here is an overview of all the rules in the Snakefile:

(zoomed view is available here)

More detailed documentation of the individual rules is provided here.

Repository layout

Below is the top-level layout of the repository. This layout is in line with community standards and must be adhered to. All of these subfolders contains further explanatory READMEs to explain their contents in more detail.

config - configuration files
doc - documentation and background literature
logs - where log files are written during pipeline runtime
resources - external data resources (from BOLD and OpenTree) are downloaded here
results - intermediate and final results are generated here
workflow - script source code and driver snakefile

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

barcode-constrained-phylogeny's People

Contributors

Stargazers

Watchers

Forkers

pecholleyn bge-barcoding deyuanyang

barcode-constrained-phylogeny's Issues

Data thinning

We don't need all sequences to end up on the tree. There are several reasons why data thinning would be a good idea:

unevenness in the tip density introduces spurious uncertainty when running pplacer (Lena's thesis)
data thinning uses less drive space
data thinning reduces the running time of the alignment and tree-building processes

On the other hand, the placement of a sequence and its branch length could be an indicator used in reference library
curation. A sequence with a suspiciously long terminal branch or a problematic placement (e.g. not next to another
BIN within the same species, or not in the expected place within the family) might be a misidentification or a
sequencing error. Hence, data thinning should be optional. If enabled, it should be something along the following lines:

one sequence >= config['minlength'] per BIN
probably the centroid or otherwise the most common sequence

Reroot unrooted raxml subtree

When the outgroup of the subtree is not monophyletic with respect to the ingroup, RAxML issues a warning but then merely writes out the unrooted tree. We can still root this with fairly defensible logic but this needs to be applied such that there is a tree file for the next steps to work with.

macse in bin/

as per Noble, 2009

Handling empty FASTA files

if file is empty in the results/fasta/family/ directory it will still be used and iterated over in the family fasta steps. Current fix: delete the file.

Use logging, not print statements

import logging

# Create a custom logger
logger = logging.getLogger(__name__)

logger.warning('This is a warning message')
logger.error('This is an error message')

family_fasta.py feature needed after bugfix map_opentol.py

Need to select only barcodes with an opentol ID mapping. Right now this is not the case because it only has a very small percentage of mappings because of a request error in map_opentol.py

OPTION: Integrate perl tool db into custom db

Feedback Rutger: The database from perl-bio-phylo-forest-dbtree (rule megatree-loader) could be integrated into our own custom db (both use Opentol ID's).

Add different Raxml command for AA alignments

Raxml needs a different commandline/model to work with AA sequences.

hmmalign parameterization

The current wrapper around hmmalign has resulted in unaligned sequences (e.g. in Naomi's report). It's possible that this is because the produced output, in Stockholm format, is not parsed and interpreted correctly. What looks less error prone is to use a different output format. For example, with hmmalign --outformat afa [options] [hmm] [query] the result is in aligned FASTA. A quick check with some variable sequences (quite distantly related primates, with COI sequences of different lengths) resulted in pretty good alignments nonetheless:

ERROR: Github when running a 'build-linux' job after pushing branch

Error alignment_output = snakemake.output[0]
^
4 F821 undefined name 'snakemake'

Does not recognize snakemake in the python files? Has to do with conda env
https://github.com/naturalis/barcode-constrained-phylogeny/actions/runs/4989555728/jobs/8934184582

Open Tree of Life - extracting subtrees

Here is a SQLite representation of the opentree 13.4 supertree labelled with ott IDs. From this database, the DBTree toolkit can be used to extract subtrees with the megatree-pruner command. This will produce a newick tree. This tree still needs to be expanded so that tip labels that occur multiple times in the alignment are expanded into polytomies to disambiguate for raxml (see #27)

create_database produces inconsistencies

taxa and barcodes need to be loaded differently, review

Submit to WorkflowHub

At time of writing, the pipeline appears technically ready so submission to WFH except for the testing, which might be doable by generating tests from the results (?)

OpenToL 'broken' taxa

When requesting a subtree by its tip ott IDs, there are cases where one or more of the IDs are for species that OpenToL views as 'broken'. For example, Alouatta seniculus, or Callicebus personatus.

This is because at the subspecies level they are entangled with other species (respectively Alouatta sara and Callicebus coimbrai). In the requested subtree, such broken tips are only identified by a label of the form mrcaott126846ott126847, which means we can't immediately figure out which species triggered this.

The solution appears to be to:

parse the label to get the IDs (126846 and 126847)
requesting the induced subtree for those IDs
fetching the tip labels

Whichever label corresponds for the first parts with a species in the input (unaligned.fa) that hasn't otherwise been seen in previous output should be the one to keep and to graft onto the tree.

Rule family_fasta

Not sure how to run this one: for me it finishes in a second without extra output. Do I have to specify which family or sth?

raxml with parwise alignments

raxml only takes msa file inputs, how the rule raxml is right now it will not work with the new hmm alignments.
Does it work on raxml branch with hmm alignments @NI-S ?

Picking exemplars

When picking the shallowest exemplars for the backbone, exemplar clades where shallow tips have near zero distance to their subtree root result in a backbone where nearly all change is allocated to the root branch of the of the exemplar clade. When grafting the subtrees onto the basal split, the subtrees become "too tall". Here's the grafted example:

Here's the same topology but with the branch lengths re-estimated by raxml:

A follow up experiment should be to pick the tallest exemplars instead. It's possible that this will make them "too shallow" instead, especially when there is a lot of homoplasy within the clade.

Outgroup selection with blast

Plan:

progressively index unaligned.fa, only those sequences that have ott IDs with makeblastdb -in {unaligned.fa} -dbtype nucl -out {blastdb} -parse_seqids
when add outgroups, blast while ignoring the ingroup and only keep n outgroups, e.g. blastn -query {unaligned.fa} -db {blastdb} -out {outfile.txt} -outfmt 6 -max_target_seqs n -negative_seqidlist {exclude_ids.txt}
incorporate operations in snakefile
include outgroups in constraint tree

Documentation

I put some placeholders in the top-level README.md for bits of information that should be added. It would be good if that document could be updated. The idea is that this document is the 'landing page' for the project, so (potential) users will want to know quite a bit.

add user arguments to script retrieve_bold_data.py

Make it possible to for user to give arguments (like custom directory names) when calling script through commandline.

Release generated files using Zenodo

Rewrite into snakemake pipeline

Rewrite code to be able to run it with a snakemake file. See closed issue #23 for details

Skip `temporary_database.db` step

A rule in the Snakefile has as an output target a file (temporary_database.db) that is renamed later on. Once that renaming has happened, the output target has disappeared and so snakemake decides to rerun everything to recover that target. This needs to be fixed by bypassing this renaming stuff and going straight for the final database file name.

FIX: No constraint tree needed for families with one ott

Raxml gives an error when a family alignment with only one ott is being used (multiple barcodes from the same ott).
Change raxml rule from shell command line to python script that checks if a constraint tree newick file is empty first. If so run raxml without constraint tree, else with a constraint tree.

The backbone logic

The following consideration factor into the inference of the backbone:

For each family F, the two tips F1, F2 that are most distant must be the exemplars for that family. This ensures that they will be on opposite sides of the root so that they form a fork Y in the backbone that can be replaced by the subtree for that family. The approach using the distance matrix is good for this. Within the constraint tree for the backbone, the two exemplars will be defined as a monophyletic group in the newick syntax, i.e. ...(F1,F2),....
For the relationships between the families (F,G,H,I), we want constraints from the OpenTree backbone. The challenge here is that F1 and F2 might not be in the OpenTree backbone. Instead, any single one of the tips from the families that did occur in there (e.g. any of the ones that went into the family-level constraint) will do. This might give a constraint like ((F,G),(H,I));. That constraint then needs to be updated so that for each family, the exemplars are grafted in, e.g. (((F1,F2),(G1,G2)),((H1,H2),(I1,I2)));.

Tip1: you can copy the statements that end with ;, such as (((F1,F2),(G1,G2)),((H1,H2),(I1,I2))); and paste them directly into figtree to see the tree. Do this to check the validity of the trees. They can't just be single Y splits.

Tip2: you can check the alignments in mesquite so that you see the amino acids, not the nucleotides, like so: https://www.youtube.com/watch?v=YsotPrE0KwE - do this to make sure the alignments are still OK.

Change raxml prefix

To better organize the raxml ouput, add a directory for each family within the raxml/ directory

Roadmap next release

refine input sequence curation as per BGE 10.1 criteria
make OpenTol coverage denser by ssp expansion/contraction
ensure outgroup selection avoids sequences within ingroup distances
test exemplar selection: tallest, shallowest, average
cleanup log file proliferation
run on Odonata, then Mammalia, then Lepidoptera
fix rooting/grafting
fix database import
finish workflow/documentation.md

Change alignment method

Use HMMer to align sequences to a COI-5P model with pairwise alignment for scalability. With the output of HMMer the sequences can be checked if its forward or reverse.

Get refs.hmm from zip file https://github.com/psomervuo/FinPROTAX and save it in repo
HMM align using profile and fasta file
Use stockholm file to check which one of the is the forward sequence (most * in alignment)

Check barcode length

Do a check for barcode length when making FASTA files (minimum 600 for COI-5P), make it adjustable in config.yaml

ERROR: Cannot find tree species

When making a subtree (create_tree.py) using a newick tree from OTL this error pops up for the first ott id (ott406633). The id is present in both the newick constraint tree and the input fasta file.

Additional indexes needed

To be able to generate area phylogenies, such as the tree for the Netherlands, the country field in the barcode should be indexed.

Some bold entries do not have a species name (retrieve_barcody.py)

Some queried data from BOLD systems do not contain a species name (or are not on species level).

Do we want those in a later stadium? If so an alternative name needs to be made so that the FASTA header is usable.

Unit tests

The pipeline needs to be testable so that the CI action can run the tests that are now commented out. This can be accomplished along these lines.

Remove dashes (-) FASTA files

HMMer can’t align sequences which have dashes in them, remove the dashes before putting them into a FASTA file.

MACSE to HMM

Hmm does not cut the headers short as MACSE did. Is it safe to delete replaceme alignment id script/rule?
Also the MACSE files are as of now still used in the snakemake file so this can be changed to hmm generated files. Wait untill we have a matk/rcbl model too?

Incompatible dependency graphs across rules

The failures in the github CI/CD workflow indicate that different tools end up resolving into different dependency versions (e.g. in libxml2). The way to address this would be to insulate the snakemake rules from each other and have them sit in separate conda environments or docker containers.

Correspondence between constraint tree and alignment

This is not so much an issue but an elaboration that perhaps should end up in various reports and documentation files. This is about how the tips in a constraint tree must logically correspond with the sequences in an alignment. When errors occur surrounding this topic, then this might be a reference in trying to resolve the issue. Here are some possible patterns in which tree and alignment relate to each other:

There is no constraint tree at all. This is actually the most common way in which raxml is used. You could rephrase this situation as one where there is a constraint tree but it has zero tips. That may sound trivial but it's worth keeping in mind that the tool normally places sequences in a tree based only on the signal that's in the DNA so there's no need for any tips to be in a constraint.
There's a constraint tree with some taxa from the alignment. In this case the constraint tree has a subset of the tips that are in the alignment. This is fine because it just means that there is a bit more signal besides the DNA, for example to make it so that a few tips are forced to be next to each other. This is basically what we're going for.
The constraint tree has tips that are not in the alignment. This is impossible so it produces errors. The tree searcher is now being asked to place tips in a tree but it has no DNA data to do so. Tips in the constraint tree absent from the alignment must be pruned.
The constraint tree or the alignment has multiple entries with the same name. This is also impossible. The solution here is to make the constraint tree have a polytomy (e.g. ((A1,A2,A3,A4),...)) and make the alignment also have A1, A2, A3 and A4 (instead of multiple, identical As).

I think that's all the possible arrangements :)

Create v1.1.0

Checklist bank request

Worked last week, now got a "HTTP 405 Method Not Allowed" message.

Rough to do list/workflow

Make rough ERD
BOLD data dump -> SQLite (with marker code/kingdom filter as commandline input), do not index yet
Filter and rearrange data into ERD tables (barcode records -> atleast family taxonomy available, …)
Index data (unique ID for every barcode and taxon)
Map opentol ID to taxonomy table (with family name as constraint!)
Get barcode data into chunks (group by family name) and make per family a fasta file with barcodes (header with own DB id from barcode table) | OUTPUT FASTA?
Make alignment per family (MUSCLE? HMER? For HMER nt -> aa) | OUTPUT PYLIP?
Make tree per family/alignment, use opentol tree as constraint (get tree with opentol id) | INPUT PHYLIP & OUTPUT NEWICK?

RAxML placement and branch length optimization

It should be possible to use RAxML slightly different. Summarizing this thread, the idea would be to:

Do phylogenetic placement of any family-level sequences that are not in the opentol
Then do branch length optimization
This obviates the need for outgroup selection and any ensuing rerooting issues.

COI-5P filtering in API request not working (retrieve_bold.py)

Other markers are kept in the records. At this point it is filtered using the pandas dataframe. Only 1536 barcodes remain (needs to be looked at if it is correct).

Implementation status beyond raxml

The current implementation runs until the raxml target. At present, all subsequent steps need to be done manually.

Barcode database has no indexes or primary keys

The file BOLD_COI-5P_barcodes.db has no indexes. For example, the column barcode.taxon_id should be indexed so that joins with taxon.taxon_id go MUCH faster. Also, taxon.taxon_id seems like it should be a primary key but it is not unique.

Put ott_id in fasta header in retrieve_bold.py

To account for the name clipping of the header during in MUCLE alignment output, change the FASTA header to ott_id (OpenTol taxnomic id) while matching with OpenTol.

Documentation README /src

Explain each rule from snakefile_phylogeny in readme

tree inference & data management

It is probably a good idea to run the steps incrementally, by which I mean: writing the alignments is a task that you can kill and then it resumes where it left of, until all families have an alignment. Same for the constraint trees: you probably want to have a tree for every alignment and store this in a folder structure. What you want to avoid is having to run everything in one giant loop that can never fail.

With that in mind, it's probably better to not have a temporary tree that has a throwaway name here, but rather have a system where the raxml run assumes that there is a constraint tree for every alignment (and they probably have the same name but with different extensions).

Make families.txt's for both COI-5P familys and matK_rbcL

Some families are too large right now, we can specify which families (like all mammalia families) we want to use. Save a list with family names to family_COI-5P.txt and family_matK_rbcL.txt which can be used as wildcards in snakemake.

Add date/time stamp to log messages

import logging

# Create a logger
logger = logging.getLogger(__name__)

# Set the log level
logger.setLevel(logging.DEBUG)

# Create a console handler
handler = logging.StreamHandler()

# Define the format for the log messages,
# including the date and time stamp
log_format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'

# Set the date format
date_format = '%Y-%m-%d %H:%M:%S'

# Create a formatter using the specified format and date format
formatter = logging.Formatter(log_format, datefmt=date_format)

# Set the formatter for the handler
handler.setFormatter(formatter)

# Add the handler to the logger
logger.addHandler(handler)

# Example log message
logger.debug('This is a debug message.')

Spurious pandas dependency errors

Sometimes snakemake claims pandas is not installed. Current fix: sudo apt-get install python3 pandas (in the terminal).