robaina / metatag Goto Github PK

metaTag: functional and taxonomical annotation of metagenomes through phylogenetic tree placement

Home Page: https://robaina.github.io/MetaTag/

License: Apache License 2.0

Python 95.57% Shell 4.11% Dockerfile 0.31%

bioinformatics metagenomics-toolkit marine-microbiology microbial-ecology metagenomics evolutionary-placement taxonomy-assignment function-annotation

metatag's Introduction

💡 What is MetaTag?

This repository contains tools to assign taxonomy and function annotations to short reads through pylogenetic tree evolutionary placement.

🔧 Setup

The easiest way to use MetaTag is through the provided docker container. To use it, pull the image:

docker pull ghcr.io/robaina/metatag:latest

Then run the container interactively:

docker run -i ghcr.io/robaina/metatag:latest

Otherwise, you can install MetaTag like this:

Fork git repo into local machine (click on fork) and clone, or simply clone main branch with

git clone https://github.com/Robaina/MetaTag.git

CD to project MetaTag and set conda environment if not already set:

conda env create -n metatag -f envs/metatag-dev.yml

Build and install MetaTag:

conda activate metatag
(metatag) poetry build && pip install dist/metatag*.whl

Installation test

To check that everything is working properly, you can run a test what will perform the entire workflow on a minimal dataset. To run it, call bash script:

conda activate metatag
(metatag) bash tests/run_test.sh

or through Python's unittest module:

conda activate metatag
(metatag) python -m unittest tests/test_pipeline.py

It should produce a final tree with query sequences placed on it, as well as a bunch of intermediary files without any errors.

🚀 Usage

There are two main ways to use MetaTag: through the command line interface (CLI) or through the Python API. You can find an example of the API usage in the following Notebook:

MetaTag API example

🔄 Dependencies

MetaTag would not work without these awesome projects:

Thanks!

Contributing

Contributions are always welcome! If you don't know where to start, you may find an interesting issue to work in here. Please, read our contribution guidelines first.

✒️ Citation

If you use MetaTag in your research, please cite it as follows:

Robaina Estévez, S., Fernández González, N., & González Hernández, J. M. (2023). metaTag: functional and taxonomical annotation of metagenomes through phylogenetic tree placement (Version v0.1.1) [Computer software]. https://doi.org/10.5281/zenodo.7949949

📄 License

This project is licensed under the terms of the Apache 2.0 license.

metatag's People

Contributors

Stargazers

Watchers

metatag's Issues

add some warning when mixing protein and aminoacid sequences

Due to a problem with SqueezeMeta results, I was inadvertently using a fasta file containing protein and DNA query sequences while working on the placement of environmental sequences in a tree. Because there is no warning either in SqueezeMeta or in TRAITS pipeline, I did not realized about the mistake until I was reviewing the results in the tree and only because one of the DNA sequences was placed in a paralogous cluster. To my surprise, the DNA sequences are aligned in the query alignment (using papara).

I think, it would be useful to add a checkpoint in the preprocess.py script to hault the script and give a warning if there are mixed types of sequences in the fasta files.

I would like to take care of that, I would just take more time (way more) than you to fix it.

Match query sequence duplicates to classified queries in labelplacements.py

This new function would retrieve possible query duplicates from the environmental database and assign the same taxonomy and cluster than the (duplicated) placed query

Filtering out placements

We need a way to deal with queries that are placed in different clusters. Currently, we remove those sequences. However, we are losing valuable information in some cases. Particularly, when two queries are placed in different clusters that contain the same "good" function.

In these cases, we would like to preserve these placements. We could take the placement that gets the highest LWR. And, as before, remove queries that were placed in clusters with different functions

Better placement count output table

This issue proposes to change the format of the output table generated by:

https://github.com/Robaina/TRAITS/blob/b370d9336e2e33a8479928945c5e4dc38ab24ce9/code/phyloplacement/placement.py#L145-L148

Merge counts from "Unspecified" and any empty taxonomic level, such as "f__" if counting at family level.
Include header with info about sample name
Possibly compute counts for all taxa at the same time?
Also, selecting "good" clusters by id and by score threshold is redundant. Either one or the other would make more sense.

Query names no longer appear in labelplacements.py output table

Allow merging additional sequences to generated database in makedatabase.py

This functionally would avoid calling preprocess.py to mege additional sequences, e.g., classified references from publicatioons, into the reference database created via hmms.

Add option to relabeltree.py assign gtdb taxonomy to references from MARdb

Enable adding gtdb taxonomy to MARdb sequences in reference tree in relabeltree.py

Label placement

Hi @Robaina

I am trying to label some Arctic sequences introduced in the reference tree from NrtA. I have already defined clusters and scores. My command and the subsequent error is:

python /data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py \
--jplace epa_result.jplace \
--labels /data/mcm/rlaso/Traits/Nitrate_transport/NrtA_structure_and_non_synthenic/data/K15576/ref_database_nrtA_processed_id_dict.pickle /data/mcm/rlaso/Traits/Nitrate_transport/NrtA_structure_and_non_synthenic/data/Arctic/K15576_arctic_processed_id_dict.pickle /data/mcm/rlaso/Traits/Nitrate_transport/NrtA_structure_and_non_synthenic/data/Swapni/NrtA_sequence_Swapni_processed_id_dict.pickle \
--ref_clusters ../cluster_id.tsv --ref_cluster_scores ../cluster_score.tsv --prefix arct_

Traceback (most recent call last):
  File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 103, in <module>
    main()
  File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 91, in main
    assignLabelsToPlacements(
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 402, in assignLabelsToPlacements
    ref_clusters = parseTreeClusters(ref_clusters_file, cluster_as_key=False, sep='\t')
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 311, in parseTreeClusters
    return dict(df.values)
ValueError: dictionary update sequence element #0 has length 10; 2 is required

I am sure where the problem is. In principle, I use the same dictionaries for relabeling the tree, so they should be fine.

Enable selection of duplicated sequences to keep after preprocess.py

Export file with list of duplicates
Enable selection of duplicte to keep based on name tag

Build general sequence label parser class to exrtract genome ID

We are using different databases and label formats. It would make sense to build a general label parser to extract the genome ID from any label that we use. The parser would guess the database from the label format and return the genome ID or an exception/None if the ID cannot be recovered or if there is a format conflict among databases.

This issue would fix #61

Test phylotree taxonomy assignment with classified reads

Short read placement may fail at placing sequences in the right place to resolve taxonomy. To test the accuracy of this method, we could place test sequences that have been already (taxonomically) classified. We could artificially fragment these sequences to simulate short reads.

Assign GTDB taxonomy to UniProt reference sequences

Dear all,
I have worked to assign GTDB taxonomy to UniProt reference sequences that we might add when building a tree. I have an script create_table.py to modify the merged_taxonomy.tsv file to include the NCBI taxIDs and the corresponding GTDB taxonomy, and a few lines to capture the ncbi taxID from the UniProt fasta headers, in the ncbitogtdb.py.

I am not able to integrate those into the quite complex, but wonderful, scripts that we already have in the project. So, please, integrate these files/code into them. I have attached both scripts:
ncbitogtdb.py.zip
create_table.py.zip

To build the new merged_taxonomy.tsv, you will need the metadata.tsv files from GTDB. They are quite big, so it is better to directly download them from the GTDB repository of the latest release:
https://data.gtdb.ecogenomic.org/releases/latest/
For Archaea:
ar122_metadata.tar.gz
For Bacteria:
bac120_metadata.tar.gz

Notice that there are other files with a similar name but called taxonomy. Do not use those.

Missing documentation in third-party tools

Many of the third-party tools employed by the code don't have proper documentation (such as the wrappers in wrappers.py).

Required to add minimal documentation in these cases

Export list of queries with the "right" function in countplacements.py

Export text file with list of query sequences that were placed in a cluster with the "right" functionality. This is useful to assign taxo info by an external method

Allow query cluster assignment without taxonomy

So far, labelplacements.py only works for MARdb sequences with valid MMP identifiers (so a taxonomic path can be retrieved from the GTDB database). However, this method prevents cluster assignment to query sequences when reference sequences don't come from MARdb.

This issue will modify:

https://github.com/Robaina/TRAITS/blob/7ca208a0bb593dea9ebec5c20a99877e7d60ad3f/code/phyloplacement/placement.py#L371-L389

to allow taxonomy-free cluster assignment.

To this end, we can create a fake taxopath only containing the cluster name (perhaps additional fields if required, it doesn't matter). Gappa examine assign should then assign clusters to placed queries as if it had a valid taxopath

Testing direct short-read recovery with HMMs to compare with SQM results: genes: nosZ and nas

Problems with GTDB taxonomy assigments to MARdb sequences

I have encounter two problems when using the relabeltree.py --taxonomy option to assign taxonomy to MARdb sequeces:

some genomes from the initial MARdb files that Jose sent are missing the MARdb ID and therefore, those sequences cannot be classified with their ID. To solve it, we need to:
a) First, figure out if all teams are working with the same files (Complete and Partial genomes, QC by Jose)
b) Check with José if the MAR IDs were removed during the QC he did - I will do that once we figure out the firs part.
For those sequences with MAR IDs the script is assigning the NCBI taxonomy and not the GTDB. For example:

One of the genera that appears in my tree classified as the NCBI (Beta) but in GTDB is Gamma.
This happens with several Betas that have been reclassified as Gammas, for example in the tree

The label color is the GTDB web classification and the square in the right the taxonomy applied by the script. Green: Zproteobacteria, Pink: Gamma, Orange: Beta
I have another example with a Streptomyces that has the NCBI taxonomy in the tree label and not the GTDB.
Are we are using an older release??

Select cluster by representative sequences with experimental evidence

Created this new issue to expand on the idea of selecting tree clusters by representative sequences with experimental evidence via EFE3, introduced in #32

Error in makedatabase_struct.py

I am trying to redo the old trees with the new databases using an updated version of the pipeline. I updated the environment as well using conda update and the yaml file. I am having an error with the makedatabase_struct.py. Here is the code:

python3 /data/mcm/rlaso/Programs/TRAITS/code/makedatabase_struct.py --in /data/mcm/databases/TRAITs_reference_database/final_ref_database.fasta --outdir /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Nrt/data/reference_data_NrtA_K15576/ --hmms /data/mcm/databases/kofam/profiles/K15576.hmm /data/mcm/databases/kofam/profiles/K15577.hmm --hmm_struc "K15576 4 K15577" --hmmsearch_args "-T 501.70,-T 278.07" --target_hmm "K15576"
* Making peptide-specific reference database...
Running Hmmer...
Filtering results by HMM structure...
Filtering Fasta...
RuntimeError: get seq count and length error

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase_struct.py", line 157, in <module>
    main()
  File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase_struct.py", line 115, in main
    filterFastaByHMMstructure(
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/database/manipulation.py", line 459, in filterFastaByHMMstructure
    filterFASTAbyIDs(input_fasta, record_ids=record_ids,
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/database/manipulation.py", line 76, in filterFASTAbyIDs
    fa = pyfastx.Fasta(input_fasta)
SystemError: <class 'Fasta'> returned a result with an error set

Previously the makedatabase_struct.py did work for me, so I am unsure if the error is in the new code or in the new conda environment.

Enable passing arguments to hmmer from makedatabase.py

Popular request:

create optional argument in makedatabase.py to pass arguments to Hmmer so the file doesn't have to be edited

Integrate makedatabase_struct.py into makedatabase.py

Search by synteny structure has been now tested for a while. Hence, it makes sense to integrate this functionality into makedatabase.py

Export merged label dict to pickle

Export merged ID-Label dictionary (created when inputting several ref sequences) into pickle file for further use. Required, for instance, by #6 and also by relabeltree.py

Include score when counting placements

Include quality score when filtering placements to count taxonomic representation of the assigned function

Include Oceans Microbiome taxonomy database in taxonomy assigner

So far, we have relied on the taxonomy info that MAR database assigns to its genomes. This issue proposes to expand the code to include the taxonomy info from the Oceans Microbiomics Database, which ultimately comes from GTDB.

Specifically, we have to adapt

https://github.com/Robaina/TRAITS/blob/08e0e1a11ac3d2daf9bd3268a04eaeb48b9707d6/code/phyloplacement/database/parsers/mardb.py#L80-L86

to include the new taxonomy info.

Also, consider relocating it to a separate module dedicated to taxonomy assignment.

This issue partially solves #40.

Manually select phylogenetic tree clusters in the browser

Writing a web app to be able to manually select (i.e, with the mouse) clusters in a phylogenetic tree would be useful. We could employ cytoscape.js to this end. Here are some resources:

Small function to be added to the main jupyter nb for clusters' definition

Hi,

As I was working with the Jupyter notebook to define the clusters, I realized a small functionality was missing. So I wrote some lines to help, that you can add to the jupyter notebook in the main branch.

Once the user has manually defined the clusters based on internal nodes in iTOL for instance, there was not (or at least I have not seen it) a quick way to build a dictionary with the user-defined cluster names and the leaves corresponding to them. So here is my solution.

First, create a tsv file with two columns: the first one with the cluster name, for instance: NasA_Gammaproteobacteria, or AioA_paralog; and the second one must be the internal node name, for instance, IN_2_100 or IN_65_87
You should create this file while checking the tree to define the clusters
Then using pandas import that tsv file into a dictionary and then from it, create another one populating the items with the getAllDescendantsOfTargetNode function. I am using a dictionary to import the information in case we might want to join several branches into a same kind of cluster. Imaging that you have a group of paralogs inserted in the middle of a larger branch of orthologs, resulting in two groups of orthologs. In that case you might want to consider these two groups as a single one...

Here is the code

'''
Create a dictionary with cluster names as keys 
and all leaves belonging to that cluster
starting with a tsv file that contains the cluster's names
in the first column and the internal node names in the 
second column 
'''

import pandas as pd
# Import tsv file into a dictionary
dict_from_tsv = pd.read_csv('test.tsv', sep='\t', header=None, index_col=0, squeeze=True).to_dict()

tree_clusters = {}
for key in dict_from_tsv.keys():
    tree_clusters[key] = phylotree.getAllDescendantsOfTargetNode(dict_from_tsv[key])

two output files after preprocess.py

when running preprocess.py with a folder containing several files, two files are created: merged_data.fasta and the output that the user specifies, i.e. cleaned.faa. The file sizes are different, what are the differences? which one should be used in the rest of the analysis?

Allow multiple hmms in makedatabase.py

This is a convenient enhancement to allow multiple HMMs (e.g. TIGRFAMs) in code/makedatabase.py. This way, makedatabase.py can be called a single time instead of multiple callings for each HMM.

We should allow the selection of total databse size as well as individual sizes for each HMM...

Check CART classification model for nifH

Few sequences get classified outside of cluster IV in some samples, there may be a problem with either the CART model or its implementation.

Using built HMMs or from COGs

I am trying to build the tree for NrtC. NrtC has the TIGR01184, that is also present in the neighbor protein NrtD, but NrtC is longer and has a domain that hits the COG1116.

I have run the TIGR01184 in the MAR database smoothly, but now I wanted to run an hmm from the COG1116 to obtain the sequences and compared with the TIGR results, but it does not work.

First, I used the hmm model obtained from this website. The command was:

(traits) rlaso@elbrus:/data/mcm/rlaso/Traits/Nrt$ python3  /data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py \
> --in /data/mcm/rlaso/Traits/Databases/MAR_faa/fullMARHQ_cleaned.faa \
> --outdir /data/mcm/rlaso/Traits/Nrt/results/COG1116/ \
> --hmm /data/mcm/rlaso/Traits/Nrt/hmms/COG1116.hmm \
> --prefix "ref_nrtC" --relabel
* Making peptide-specific reference database...
Running Hmmer...

Error: NC bit thresholds unavailable on model NOG.COG1116.clustalo_raw

Parsing Hmmer output file...
Traceback (most recent call last):
  File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py", line 118, in <module>
    main()
  File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py", line 81, in main
    filterFASTAByHMM(
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/database/manipulation.py", line 110, in filterFASTAByHMM
    if not hmmer_hits.id.values.tolist():
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/core/generic.py", line 5487, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'id'

I understood the problem was that the COG model (see COG1116_hmm.txt) does not have the corresponding NC value present in TIGR hmms.
COG1116_hmm.txt

I tried to fix it adding the corresponding line, where the value was inferred from the ThresholdBitScore from the COG model. I know this value would not be "right" since the COG model is based on Blast and TIGR in hmm, but I wanted to give it a try.
NC 207.409

Still an error occurred, but it was not informative enough (bad file format HMM file)

(traits) rlaso@elbrus:/data/mcm/rlaso/Traits/Nrt/hmms$ python3  /data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py --in /data/mcm/rlaso/Traits/Databases/MAR_faa/fullMARHQ_cleaned.faa --outdir /data/mcm/rlaso/Traits/Nrt/results/COG1116/ --hmm /data/mcm/rlaso/Traits/Nrt/hmms/COG1116.hmm --prefix "ref_nrtC" --relabel
* Making peptide-specific reference database...
Running Hmmer...

Error: bad file format in HMM file /data/mcm/rlaso/Traits/Nrt/hmms/COG1116.hmm
Parsing Hmmer output file...
Traceback (most recent call last):
  File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py", line 118, in <module>
    main()
  File "/data/mcm/rlaso/Programs/TRAITS/code/makedatabase.py", line 81, in main
    filterFASTAByHMM(
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/database/manipulation.py", line 110, in filterFASTAByHMM
    if not hmmer_hits.id.values.tolist():
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/core/generic.py", line 5487, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'id'

I then tried to build the hmm myself. I used the whole raw alignment present in the eggnote website of the COG and hmm:

hmmbuild -n COG1116 COG1116_manual.hmm COG1116_alignment_eggnote.txt 
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             COG1116_alignment_eggnote.txt
# output HMM file:                  COG1116_manual.hmm
# name (the single) HMM:            COG1116
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen eff_nseq re/pos description
#---- -------------------- ----- ----- ----- -------- ------ -----------
1     COG1116               3486  2241   949    51.01  0.590 

# CPU time: 1.34u 0.02s 00:00:01.36 Elapsed: 00:00:01.37

However, the NC error happened again and adding the NC value gave the same mistake.

I am unsure how to calculate this NC (noise criteria) since it is not indicated in the hmmbuild, and I am not such an expert in this regard. Does anyone knows how to do it? I just found this [paper] (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6288931)

Would it be possible, if not, to extract the sequences from hmm without the NC?

Assign lowest possible taxonomy to placements by traversing tree

Currently, taxonomy and function of placed sequences are assigned according to their landing cluster. However, sometimes clusters may be too broad to provide a meaningful, non-trivial taxonomy to queries. In such cases, may be better to directly assign taxonomy based on the closest reference sequence to the placement of the query. To this end, we need to traverse the tree from the point of placement to find out the closest reference sequences. PhyloIO from Bio provides some limited tools, perhaps other, general tools to parse DAGs (networkx?) are more helpful.

Conflicts between numerical reference labels and internal node labels in gappa examine assign

Either:

Force temporal short IDs of references to be alphanumeric or
Relabel internal nodes of the tree with something different than the bootstrap value

otherwise gappa examine assign cannot properly assign taxonomy since some references and internal nodes may have the same numerical label

Replace filterFastaByIDs by new implementation

This is a self-reminder to replace filterFASTAbyIDs by its new, fastest implemention using seqkit

Remove outliers

I have been trying to introduce Nrt arctic sequences into the Nrt reference tree. The procedure works well but there are some outliers. When I try to remove them using the removetreeoutliers.py nothing changes. The command line is:

python3 /data/mcm/rlaso/Programs/TRAITS/code/removetreeoutliers.py \
 --tree /data/mcm/rlaso/Traits/Nrt/results/Arctic/epa_result.newick \
 --outdir /data/mcm/rlaso/Traits/Nrt/results/Arctic/ \
 --aln /data/mcm/rlaso/Traits/Nrt/results/Arctic/TIGR01184_COG1116_arctic_90filtered_processed.faln

The trees and alignment are attached. The extension is modified to txt to allow upload to github
epa_result_relabel.newick.txt
epa_result_shrink_relabel.newick.txt
TIGR01184_COG1116_arctic_90filtered_processed.faln.txt

I am not sure if this is an issue and the stringency of the removal can be modified.

Please come up with a simpler system to name output files 😃

Also, make it compatible across alternative methods, such as fasttree and iqtree (currently iqtree producing a .contree file) 👍🏼

ValueError in makedatabase.py

I am trying to make a database with a mix of hmms, one pfam, that does not require additional arguments, and two KOfams that require them. This is part of a test I am doing for something else, so I have changed the hmmersearch call in wrappers.py, but I have only changed the type of output file, from tblout to domtblout. I do not think this is causing the error.

Because, I got a ValueError; at some point the pipeline tries to convert the Pfam name to float. Probably, I am making a mistake in the module call..... This is the call and the error (forget about the last \ in the call):

(traits) nfernandez@elbrus:/data/mcm/nfernandez/TRAITS$  python3 ./code/makedatabase.py \
>  --in data/databases/final_ref_database.faa \
>  --outdir genes/prueba/results \
>  --hmms genes/prueba/hmms/PF18582.hmm genes/prueba/hmms/K00367.hmm genes/prueba/hmms/K00370.hmm \
>  --hmmsearch_args "None","-T 1033.77","-T 1075.37"\
> 
* Making peptide-specific reference database...
 * Processing hmm PF18582.hmm with additional arguments: --cut_nc
Running Hmmer...
Parsing Hmmer output file...
Traceback (most recent call last):
  File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/File.py", line 72, in as_handle
    with open(handleish, mode, **kwargs) as fp:
TypeError: expected str, bytes or os.PathLike object, not TextIOWrapper

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/mcm/nfernandez/TRAITS/./code/makedatabase.py", line 194, in <module>
    main()
  File "/data/mcm/nfernandez/TRAITS/./code/makedatabase.py", line 132, in main
    filterFASTAByHMM(
  File "/data/mcm/nfernandez/TRAITS/code/phyloplacement/database/manipulation.py", line 114, in filterFASTAByHMM
    hmmer_hits = parseHMMsearchOutput(hmmer_output)
  File "/data/mcm/nfernandez/TRAITS/code/phyloplacement/database/manipulation.py", line 62, in parseHMMsearchOutput
    for queryresult in SearchIO.parse(handle, 'hmmer3-tab'):
  File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/SearchIO/__init__.py", line 306, in parse
    yield from generator
  File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/SearchIO/HmmerIO/hmmer3_tab.py", line 33, in __iter__
    yield from self._parse_qresult()
  File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/SearchIO/HmmerIO/hmmer3_tab.py", line 98, in _parse_qresult
    cur = self._parse_row()
  File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/Bio/SearchIO/HmmerIO/hmmer3_tab.py", line 51, in _parse_row
    hit["evalue"] = float(cols[4])  # evalue (full sequence)
ValueError: could not convert string to float: 'PF18582.4'

Integrate new script tandempfams3 into pipeline 👍🏼

We have a new script:

https://github.com/Robaina/TRAITS/blob/2ab7e59662fe27e154fc9a617bc9d905f52a41af/legacy/scripts/tandempfams3.py

The script runs a double hmm search to retrieve regions in a contig including these hmm sequences, as long as they are separated by no more than 10 genes.

Would be nice to integrate this script into the pipeline and refactor code 😃

labelplacement error in gappa

Hi, I am trying to apply labelplacement on my amt tree but I keep getting an error related to gappa:

(traits) rlaso@elbrus:/data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/results/Arctic$ python /data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py   --jplace epa_result_modified.jplace   --labels /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/reference_data_Amt_TIGR00836/ref_amtref_database_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/McDonald_2016/mcdonald2016_prokaryotes_processed_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/McDonald_2016/mcdonald2016_plant_processed_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/McDonald_2016/mcdonald2016_rh_processed_id_dict.pickle /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/data/reviewed_sequences/SwissProt_Amt_prokaryotes_processed_id_dict.pickle   --ref_clusters ../cluster_id.tsv   --ref_cluster_scores ../cluster_score.tsv   --prefix arctic
                                              ....      ....
                                             '' '||.   .||'
                                                  ||  ||
                                                  '|.|'
     ...'   ....   ... ...  ... ...   ....        .|'|.
    |  ||  '' .||   ||'  ||  ||'  || '' .||      .|'  ||
     |''   .|' ||   ||    |  ||    | .|' ||     .|'|.  ||
    '....  '|..'|'. ||...'   ||...'  '|..'|.    '||'    ||:.
    '....'          ||       ||
                   ''''     ''''   v0.7.1 (c) 2017-2021
                                   by Lucas Czech and Pierre Barbera

Invocation:                        gappa examine assign --jplace-path epa_result_modified.jplace --taxon-file temp_oxulnwpxnw --out-dir
                                   /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/results/Arctic --file-prefix arctic --allow-file-overwriting
                                   --per-query-results --best-hit
Command:                           gappa examine assign

Input:
  --jplace-path                    epa_result_modified.jplace
  --taxon-file                     temp_oxulnwpxnw
  --root-outgroup
  --taxonomy
  --ranks-string                   superkingdom|phylum|class|order|family|genus|species

Settings:
  --sub-taxopath
  --max-level                      0
  --distribution-ratio             -1
  --consensus-thresh               1
  --resolve-missing-paths          false

Output:
  --out-dir                        /data/mcm/rlaso/Traits/Phylogenetic_trees/Nitrogen_cycle/Amt/results/Arctic
  --file-prefix                    arctic
  --file-suffix
  --cami                           false
  --sample-id
  --krona                          false
  --sativa                         false
  --per-query-results              true
  --best-hit                       true

Global Options:
  --allow-file-overwriting         true
  --verbose                        false
  --threads                        52
  --log-file

Run the following command to get the references that need to be cited:
`gappa tools citation Czech2020-genesis-and-gappa`

Started 2022-03-29 21:09:28

Found 1 jplace file
Running the assignment
Not all leafs in the reference tree were taxonomically labelled!(1000 / 1164)
Please check tree leaf label and taxon file taxa name congruency!
Segmentation fault (core dumped)
Traceback (most recent call last):
  File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 115, in <module>
    main()
  File "/data/mcm/rlaso/Programs/TRAITS/code/labelplacements.py", line 100, in main
    assignLabelsToPlacements(
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 413, in assignLabelsToPlacements
    parseGappaAssignTable(
  File "/data/mcm/rlaso/Programs/TRAITS/code/phyloplacement/placement.py", line 298, in parseGappaAssignTable
    table = pd.read_csv(input_table, sep='\t')
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/data/mcm/rlaso/Programs/Miniconda/envs/traits/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 549, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

The corresponding cluster files are here:
cluster_id.txt
cluster_score.txt

I thought the mistake could be related to the fact that I didn't provide a number to cluster_score, but I changed the string for numbers, and the same error appear again. Then I read more carefully the error, and I think it is related to pandas, but not sure what is the issue exactly

missing GTDB taxonomy in some reference sequences

Once the reference database was extended, I updated the database, the repo and the conda environment to include the new the changes on relabeltree.py --taxonomy to assign the GTDB taxonomy to the DB sequences included in a tree, I run the run_test.sh script and everything worked fine. But I got somre errors when using it with one of my trees:

First, the script did not work:

$ python3 ./code/relabeltree.py \
>    --tree  genes/nosZ/results/tree_before_placement/ref_database_shrink.newick \
>    --aln genes/nosZ/results/tree_before_placement/ref_database_shrink.faln \
>    --labels \
>    genes/nosZ/results/nosZ_ref_database_id_dict.pickle \
>    genes/nosZ/results/nosz_hallin18_short_ids_id_dict.pickle \
>    --label_prefixes “refdb_” “hal_” \
>    --taxonomy
* Relabelling tree...
Traceback (most recent call last):
  File “/media/disk5/nfernandez/TRAITS/./code/relabeltree.py”, line 135, in <module>
    main()
  File “/media/disk5/nfernandez/TRAITS/./code/relabeltree.py”, line 116, in main
    tree_label_dict, export_label_dict = assignTaxonomyToLabels(label_dict)
  File “/media/disk5/nfernandez/TRAITS/./code/relabeltree.py”, line 89, in assignTaxonomyToLabels
    taxopath = taxonomy.assignTaxonomyToLabel(label)
  File “/media/disk5/nfernandez/TRAITS/code/phyloplacement/taxonomy.py”, line 112, in assignTaxonomyToLabel
    return self._taxodata.loc[genome_id].item()
  File “/media/disk5/nfernandez/opt/miniconda3/envs/traits/lib/python3.9/site-packages/pandas/core/generic.py”, line 5487, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: ‘DataFrame’ object has no attribute ‘item’

After Robaina fixed the first problem, I encountered a second one. Once the tree was relabel, several sequences missed the taxonomy. I checked again the following:

update the project repo
update conda environment
test the script with the test data and everything worked fine.
This is the script I run:

python3 ./code/relabeltree.py \
   --tree  genes/nosZ/results/tree_before_placement/ref_database_shrink.newick \
   --aln genes/nosZ/results/tree_before_placement/ref_database_shrink.faln \
   --labels \
   genes/nosZ/results/nosZ_ref_database_id_dict.pickle \
   genes/nosZ/results/nosz_hallin18_short_ids_id_dict.pickle \
   --label_prefixes “refdb_” “hal_” \
   --taxonomy

But I reproduced the problem, and many reference sequences were missing the taxonomy, as you can see in the tsv produced by relabletree.py --taxonomy:

I checked if the sequence ID of those missing the taxonomy were included in the merged_taxonomy.tsv file and that was not the problem.

$ grep “OceanDNA_b4030” merged_taxonomy.tsv
OceanDNA_b4030  d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Cytophagales;f__Cyclobacteriaceae;g__ELB16-189;s__

Consider automating pipeline with config file

Currently, a bash pipeline has to be prepared for each gene. Would be useful to set general options in a config file (e.g. config.json) to automatize bash script

Add taxonomical representation to makedatabase.py

Adding an optional argument to write a tsv file with the assigned taxonomy of reference sequences as well as a pie chart depicting the frequencies at the chosen taxonomical level.

This feature would be useful to get a sense of the diversity captured in the reference database

error with tree_model in buildtree.py

When building a tree with buildtree.py if you want to specify the model with the --tree_model parameter, the script does not recognize the argument. For example:

$ python3 ./code/buildtree.py \

--in genes/nas/results/ref_seqs_final.faa
--outdir genes/nas/results/ref_tree_nas_final
--msa_method "muscle"
--tree_method "iqtree"
--tree_model "LG+F+I+G4"
usage: buildtree.py [-h] --in DATA [--outdir OUTDIR] [--msa_method {muscle,mafft}] [--tree_method {iqtree,fasttree}]
[--tree_model {iqtest,modeltest,a model name}]
buildtree.py: error: argument --tree_model: invalid choice: 'LG+F+I+G4' (choose from 'iqtest', 'modeltest', 'a model name')

Enable verbosity level control

Currently, stdout is filled with output from multiple tools employed along the pipeline. Would be helpful to be able to control the verbosity level of the pipeline, e.g. output only indicators to stdout and full output to log file

Automatically extract outgroup ref sequences from fasta

Port into python:

grep "out_" tests/test_results/ref_database.faa | cut -c 2- > tests/test_results/outgroup_ids.txt

required by gappa examine assign to root the tree

Informative error message for hmmsearch failure in makedatabase.py

When hmmsearch doesn't retrieve any result, makedatabase.py outputs the error:

$ python3 ./code/makedatabase.py \
>  --in /data/mcm/nfernandez/nosZ/test_cytC/data/cytc533_RefSeq_seqs_clean.fasta \
>  --hmm /data/mcm/nfernandez/nosZ/test_cytC/hmms/NosZ.hmm \
>  --outdir /data/mcm/nfernandez/nosZ/test_cytC/results
* Making peptide-specific reference database...
Running Hmmer...
Parsing Hmmer output file...
Filtering Fasta...
Traceback (most recent call last):
  File "/data/mcm/nfernandez/TRAITS/./code/makedatabase.py", line 118, in <module>
    main()
  File "/data/mcm/nfernandez/TRAITS/./code/makedatabase.py", line 81, in main
    filterFASTAByHMM(
  File "/data/mcm/nfernandez/TRAITS/code/phyloplacement/database/manipulation.py", line 111, in filterFASTAByHMM
    filterFASTAbyIDs(input_fasta, record_ids=hmmer_hits.id.values,
  File "/data/mcm/nfernandez/opt/envs/traits/lib/python3.9/site-packages/pandas/core/generic.py", line 5487, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'id'

A more informative error could be displayed instead