Coder Social home page Coder Social logo

naturalis / barcode-constrained-phylogeny Goto Github PK

View Code? Open in Web Editor NEW
1.0 9.0 3.0 3.52 MB

Pipeline for building topologically-constrained phylogenies from DNA barcode data

Home Page: https://naturalis.github.io/barcode-constrained-phylogeny/

License: Apache License 2.0

Python 97.61% Shell 2.39%
dna-barcoding metabarcoding phylogenetic-diversity phylogenetic-placement phylogenetics

barcode-constrained-phylogeny's Introduction

workflow License: Apache-2.0 DOI

Logo

Bactria: BarCode TRee Inference and Analysis

This repository contains code and data for building very large, topologically-constrained barcode phylogenies through a divide-and-conquer strategy. Such trees are useful as reference materials for curating barcode data by detecting rogue terminals (indicating incorrect taxonomic annotation) and in the comparable calculation of alpha and beta biodiversity metrics across metabarcoding assays.

The input data for the approach we develop here currently comes from BOLD data dumps. The international database BOLD Systems contains DNA barcodes for hundreds of thousands of species, with multiple barcodes per species. The data dumps we use here are TSV files whose columns conform to the nascent BCDM (barcode data model) vocabulary. As such, other data sources that conform to this vocabulary could in the future be used as well, such as UNITE.

Theoretically, such data could be filtered and aligned per DNA marker to make phylogenetic trees. However, there are two limiting factors: building very large phylogenies is computationally intensive, and barcodes are not considered ideal for building big trees because they are short (providing insufficient signal to resolve large trees) and because they tend to saturate across large patristic distances.

concept

Both problems can be mitigated by using the Open Tree of Life as a further source of phylogenetic signal. The BOLD data can be split into chunks that correspond to Open Tree of Life clades. These chunks can be made into alignments and subtrees. The OpenTOL can be used as a constraint in the algorithms to make these. The chunks are then combined in a large synthesis by grafting them on a backbone made from exemplar taxa from the subtrees. Here too, the OpenTOL is a source of phylogenetic constraint.

In this repository this concept is developed for the COI-5P marker, but the aim is to achieve equivalent functionality for plant barcoding markers (matK, rbcL) and for some part of the ITS region.

Installation

The pipeline and its dependencies are managed using conda. On a Linux-like system, you can follow these steps to set up the bactria Conda environment using the environment.yml file (for standalone executables that the pipeline needs) and a requirements.txt file (for Python packages that the pipeline scripts use):

  1. Clone the Repository:
    Clone the repository containing the environment files to your local machine:
    git clone https://github.com/naturalis/barcode-constrained-phylogeny.git
    cd barcode-constrained-phylogeny
  2. Create the Conda Environment: Create the bactria Conda/Mamba environment using the environment.yml file with the following command:
    mamba env create -f workflow/envs/environment.yml
    This command will create a new Conda environment named bactria with the packages specified in the environment.yml file. This step is largely a placeholder because most of the dependency management is handled at the level of individual pipeline steps, which each have their own environment specification.
  3. Activate the Environment: After creating the environment, activate it using the conda activate command:
    mamba activate bactria
  4. Verify the Environment: Verify that the bactria environment was set up correctly and that all packages were installed using the conda list command:
    mamba list
    This command will list all packages installed in the active conda environment. You should see all the packages specified in the environment.yml file and the requirements.txt file.

It is recommended that the mamba environment is configured to use strict channel priorities. This is crucial for having robust and correct environments (for details, see here). Consider configuring strict priorities by executing conda config --set channel_priority strict.

How to configure

The pipeline is configured using the config.yaml file. With the settings in this file you can affect, among other things:

  • Which higher taxonomic group to build the tree for. The example configuration shows how this is defined for the order Primates.
  • Where the BOLD data package TSV file is located. Note that the pipeline assumes data packages formatted according to the BCDM syntax from 2024 onwards.
  • Where the OpenToL synthetic Newick tree file is located. This has to be the tree file with only IDs as tip and node labels, not taxon names. (The file name should follow the same pattern as the example configuration.)
  • How many families are in the higher taxon for which to build the tree. At time of of writing this is a crucial variable that must match the number of families in the taxon according to BOLD, as pipeline parallelization is based on this. (Note that we are working on a better solution for this.)
  • Which marker gene to use. Note that any other value besides COI-5P (the default) means you need to provide an HMM file for that marker.
  • How verbose the pipeline needs to be in its log files.

How to run

The pipeline is implemented using snakemake, which is available within the conda environment that results from the installation.

How to run the entire pipeline:

snakemake -j {number of threads} --use-conda

Snakemake rules can be performed separately:

snakemake --until {Rule} -j {number of threads} --use-conda

Enter the same number at {number of threads} as you filled in previously in src/config.yaml. In {Rule} insert the rule to be performed.

Here is an overview of all the rules in the Snakefile:

graphviz (1) (zoomed view is available here)

More detailed documentation of the individual rules is provided here.

Repository layout

Below is the top-level layout of the repository. This layout is in line with community standards and must be adhered to. All of these subfolders contains further explanatory READMEs to explain their contents in more detail.

  • config - configuration files
  • doc - documentation and background literature
  • logs - where log files are written during pipeline runtime
  • resources - external data resources (from BOLD and OpenTree) are downloaded here
  • results - intermediate and final results are generated here
  • workflow - script source code and driver snakefile

License

© 2023-2024 Naturalis Biodiversity Center

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

barcode-constrained-phylogeny's People

Contributors

annemiekeschonthaler avatar jwesseling avatar n-scheffer avatar naomivanes avatar ni-s avatar rvosa avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

barcode-constrained-phylogeny's Issues

Data thinning

We don't need all sequences to end up on the tree. There are several reasons why data thinning would be a good idea:

  • unevenness in the tip density introduces spurious uncertainty when running pplacer (Lena's thesis)
  • data thinning uses less drive space
  • data thinning reduces the running time of the alignment and tree-building processes

On the other hand, the placement of a sequence and its branch length could be an indicator used in reference library
curation. A sequence with a suspiciously long terminal branch or a problematic placement (e.g. not next to another
BIN within the same species, or not in the expected place within the family) might be a misidentification or a
sequencing error. Hence, data thinning should be optional. If enabled, it should be something along the following lines:

  • one sequence >= config['minlength'] per BIN
  • probably the centroid or otherwise the most common sequence

Reroot unrooted raxml subtree

When the outgroup of the subtree is not monophyletic with respect to the ingroup, RAxML issues a warning but then merely writes out the unrooted tree. We can still root this with fairly defensible logic but this needs to be applied such that there is a tree file for the next steps to work with.

Handling empty FASTA files

if file is empty in the results/fasta/family/ directory it will still be used and iterated over in the family fasta steps. Current fix: delete the file.

Use logging, not print statements

import logging

# Create a custom logger
logger = logging.getLogger(__name__)

logger.warning('This is a warning message')
logger.error('This is an error message')

hmmalign parameterization

The current wrapper around hmmalign has resulted in unaligned sequences (e.g. in Naomi's report). It's possible that this is because the produced output, in Stockholm format, is not parsed and interpreted correctly. What looks less error prone is to use a different output format. For example, with hmmalign --outformat afa [options] [hmm] [query] the result is in aligned FASTA. A quick check with some variable sequences (quite distantly related primates, with COI sequences of different lengths) resulted in pretty good alignments nonetheless:

Screenshot 2023-11-08 at 16 58 19

Open Tree of Life - extracting subtrees

Here is a SQLite representation of the opentree 13.4 supertree labelled with ott IDs. From this database, the DBTree toolkit can be used to extract subtrees with the megatree-pruner command. This will produce a newick tree. This tree still needs to be expanded so that tip labels that occur multiple times in the alignment are expanded into polytomies to disambiguate for raxml (see #27)

Submit to WorkflowHub

At time of writing, the pipeline appears technically ready so submission to WFH except for the testing, which might be doable by generating tests from the results (?)

OpenToL 'broken' taxa

When requesting a subtree by its tip ott IDs, there are cases where one or more of the IDs are for species that OpenToL views as 'broken'. For example, Alouatta seniculus, or Callicebus personatus.

This is because at the subspecies level they are entangled with other species (respectively Alouatta sara and Callicebus coimbrai). In the requested subtree, such broken tips are only identified by a label of the form mrcaott126846ott126847, which means we can't immediately figure out which species triggered this.

The solution appears to be to:

  1. parse the label to get the IDs (126846 and 126847)
  2. requesting the induced subtree for those IDs
  3. fetching the tip labels

Whichever label corresponds for the first parts with a species in the input (unaligned.fa) that hasn't otherwise been seen in previous output should be the one to keep and to graft onto the tree.

Rule family_fasta

Not sure how to run this one: for me it finishes in a second without extra output. Do I have to specify which family or sth?

raxml with parwise alignments

raxml only takes msa file inputs, how the rule raxml is right now it will not work with the new hmm alignments.
Does it work on raxml branch with hmm alignments @NI-S ?

Picking exemplars

When picking the shallowest exemplars for the backbone, exemplar clades where shallow tips have near zero distance to their subtree root result in a backbone where nearly all change is allocated to the root branch of the of the exemplar clade. When grafting the subtrees onto the basal split, the subtrees become "too tall". Here's the grafted example:

Screenshot 2024-01-11 at 12 33 27

Here's the same topology but with the branch lengths re-estimated by raxml:

Screenshot 2024-01-11 at 12 34 17

A follow up experiment should be to pick the tallest exemplars instead. It's possible that this will make them "too shallow" instead, especially when there is a lot of homoplasy within the clade.

Outgroup selection with blast

Plan:

  • progressively index unaligned.fa, only those sequences that have ott IDs with makeblastdb -in {unaligned.fa} -dbtype nucl -out {blastdb} -parse_seqids
  • when add outgroups, blast while ignoring the ingroup and only keep n outgroups, e.g. blastn -query {unaligned.fa} -db {blastdb} -out {outfile.txt} -outfmt 6 -max_target_seqs n -negative_seqidlist {exclude_ids.txt}
  • incorporate operations in snakefile
  • include outgroups in constraint tree

Documentation

I put some placeholders in the top-level README.md for bits of information that should be added. It would be good if that document could be updated. The idea is that this document is the 'landing page' for the project, so (potential) users will want to know quite a bit.

Skip `temporary_database.db` step

A rule in the Snakefile has as an output target a file (temporary_database.db) that is renamed later on. Once that renaming has happened, the output target has disappeared and so snakemake decides to rerun everything to recover that target. This needs to be fixed by bypassing this renaming stuff and going straight for the final database file name.

FIX: No constraint tree needed for families with one ott

Raxml gives an error when a family alignment with only one ott is being used (multiple barcodes from the same ott).
Change raxml rule from shell command line to python script that checks if a constraint tree newick file is empty first. If so run raxml without constraint tree, else with a constraint tree.

The backbone logic

The following consideration factor into the inference of the backbone:

  • For each family F, the two tips F1, F2 that are most distant must be the exemplars for that family. This ensures that they will be on opposite sides of the root so that they form a fork Y in the backbone that can be replaced by the subtree for that family. The approach using the distance matrix is good for this. Within the constraint tree for the backbone, the two exemplars will be defined as a monophyletic group in the newick syntax, i.e. ...(F1,F2),....
  • For the relationships between the families (F,G,H,I), we want constraints from the OpenTree backbone. The challenge here is that F1 and F2 might not be in the OpenTree backbone. Instead, any single one of the tips from the families that did occur in there (e.g. any of the ones that went into the family-level constraint) will do. This might give a constraint like ((F,G),(H,I));. That constraint then needs to be updated so that for each family, the exemplars are grafted in, e.g. (((F1,F2),(G1,G2)),((H1,H2),(I1,I2)));.

Tip1: you can copy the statements that end with ;, such as (((F1,F2),(G1,G2)),((H1,H2),(I1,I2))); and paste them directly into figtree to see the tree. Do this to check the validity of the trees. They can't just be single Y splits.

Tip2: you can check the alignments in mesquite so that you see the amino acids, not the nucleotides, like so: https://www.youtube.com/watch?v=YsotPrE0KwE - do this to make sure the alignments are still OK.

Change raxml prefix

To better organize the raxml ouput, add a directory for each family within the raxml/ directory

Roadmap next release

  • refine input sequence curation as per BGE 10.1 criteria
  • make OpenTol coverage denser by ssp expansion/contraction
  • ensure outgroup selection avoids sequences within ingroup distances
  • test exemplar selection: tallest, shallowest, average
  • cleanup log file proliferation
  • run on Odonata, then Mammalia, then Lepidoptera
  • fix rooting/grafting
  • fix database import
  • finish workflow/documentation.md

Change alignment method

Use HMMer to align sequences to a COI-5P model with pairwise alignment for scalability. With the output of HMMer the sequences can be checked if its forward or reverse.

  1. Get refs.hmm from zip file https://github.com/psomervuo/FinPROTAX and save it in repo
  2. HMM align using profile and fasta file
  3. Use stockholm file to check which one of the is the forward sequence (most * in alignment)

Check barcode length

Do a check for barcode length when making FASTA files (minimum 600 for COI-5P), make it adjustable in config.yaml

ERROR: Cannot find tree species

When making a subtree (create_tree.py) using a newick tree from OTL this error pops up for the first ott id (ott406633). The id is present in both the newick constraint tree and the input fasta file.

Additional indexes needed

To be able to generate area phylogenies, such as the tree for the Netherlands, the country field in the barcode should be indexed.

Remove dashes (-) FASTA files

HMMer can’t align sequences which have dashes in them, remove the dashes before putting them into a FASTA file.

MACSE to HMM

Hmm does not cut the headers short as MACSE did. Is it safe to delete replaceme alignment id script/rule?
Also the MACSE files are as of now still used in the snakemake file so this can be changed to hmm generated files. Wait untill we have a matk/rcbl model too?

Incompatible dependency graphs across rules

The failures in the github CI/CD workflow indicate that different tools end up resolving into different dependency versions (e.g. in libxml2). The way to address this would be to insulate the snakemake rules from each other and have them sit in separate conda environments or docker containers.

Correspondence between constraint tree and alignment

This is not so much an issue but an elaboration that perhaps should end up in various reports and documentation files. This is about how the tips in a constraint tree must logically correspond with the sequences in an alignment. When errors occur surrounding this topic, then this might be a reference in trying to resolve the issue. Here are some possible patterns in which tree and alignment relate to each other:

  1. There is no constraint tree at all. This is actually the most common way in which raxml is used. You could rephrase this situation as one where there is a constraint tree but it has zero tips. That may sound trivial but it's worth keeping in mind that the tool normally places sequences in a tree based only on the signal that's in the DNA so there's no need for any tips to be in a constraint.
  2. There's a constraint tree with some taxa from the alignment. In this case the constraint tree has a subset of the tips that are in the alignment. This is fine because it just means that there is a bit more signal besides the DNA, for example to make it so that a few tips are forced to be next to each other. This is basically what we're going for.
  3. The constraint tree has tips that are not in the alignment. This is impossible so it produces errors. The tree searcher is now being asked to place tips in a tree but it has no DNA data to do so. Tips in the constraint tree absent from the alignment must be pruned.
  4. The constraint tree or the alignment has multiple entries with the same name. This is also impossible. The solution here is to make the constraint tree have a polytomy (e.g. ((A1,A2,A3,A4),...)) and make the alignment also have A1, A2, A3 and A4 (instead of multiple, identical As).

I think that's all the possible arrangements :)

Rough to do list/workflow

  • Make rough ERD
  • BOLD data dump -> SQLite (with marker code/kingdom filter as commandline input), do not index yet
  • Filter and rearrange data into ERD tables (barcode records -> atleast family taxonomy available, …)
  • Index data (unique ID for every barcode and taxon)
  • Map opentol ID to taxonomy table (with family name as constraint!)
  • Get barcode data into chunks (group by family name) and make per family a fasta file with barcodes (header with own DB id from barcode table) | OUTPUT FASTA?
  • Make alignment per family (MUSCLE? HMER? For HMER nt -> aa) | OUTPUT PYLIP?
  • Make tree per family/alignment, use opentol tree as constraint (get tree with opentol id) | INPUT PHYLIP & OUTPUT NEWICK?

RAxML placement and branch length optimization

It should be possible to use RAxML slightly different. Summarizing this thread, the idea would be to:

  1. Do phylogenetic placement of any family-level sequences that are not in the opentol
  2. Then do branch length optimization
    This obviates the need for outgroup selection and any ensuing rerooting issues.

Barcode database has no indexes or primary keys

The file BOLD_COI-5P_barcodes.db has no indexes. For example, the column barcode.taxon_id should be indexed so that joins with taxon.taxon_id go MUCH faster. Also, taxon.taxon_id seems like it should be a primary key but it is not unique.

tree inference & data management

It is probably a good idea to run the steps incrementally, by which I mean: writing the alignments is a task that you can kill and then it resumes where it left of, until all families have an alignment. Same for the constraint trees: you probably want to have a tree for every alignment and store this in a folder structure. What you want to avoid is having to run everything in one giant loop that can never fail.

With that in mind, it's probably better to not have a temporary tree that has a throwaway name here, but rather have a system where the raxml run assumes that there is a constraint tree for every alignment (and they probably have the same name but with different extensions).

Make families.txt's for both COI-5P familys and matK_rbcL

Some families are too large right now, we can specify which families (like all mammalia families) we want to use. Save a list with family names to family_COI-5P.txt and family_matK_rbcL.txt which can be used as wildcards in snakemake.

Add date/time stamp to log messages

import logging

# Create a logger
logger = logging.getLogger(__name__)

# Set the log level
logger.setLevel(logging.DEBUG)

# Create a console handler
handler = logging.StreamHandler()

# Define the format for the log messages,
# including the date and time stamp
log_format = '%(asctime)s - %(name)s - %(levelname)s - %(message)s'

# Set the date format
date_format = '%Y-%m-%d %H:%M:%S'

# Create a formatter using the specified format and date format
formatter = logging.Formatter(log_format, datefmt=date_format)

# Set the formatter for the handler
handler.setFormatter(formatter)

# Add the handler to the logger
logger.addHandler(handler)

# Example log message
logger.debug('This is a debug message.')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.