robaina / pynteny Goto Github PK

View Code? Open in Web Editor NEW

13.0 1.0 1.0 22.64 MB

Query sequence database by HMMs arranged in predefined synteny structure

Home Page: https://robaina.github.io/Pynteny

License: Apache License 2.0

Python 88.03% TeX 10.91% Dockerfile 1.06%

bioinformatics genomics hmmer metagenomics prokaryotic-genomes synteny hmm python synteny-block computational-biology

pynteny's Introduction

Synteny-aware hmm searches made easy

1. 💡 What is Pynteny?

Pynteny is Python tool to search for synteny blocks in (prokaryotic) sequence data through HMMs of the ORFs of interest and HMMER. By leveraging genomic context information, Pynteny can be employed to decrease the uncertainty of functional annotation of unlabelled sequence data due to the effect of paralogs. Pynteny can be accessed (i) through the command line or (ii) as a Python module.

Get more info in the documentation pages!

Check out the Pynteny paper in the Journal of Open Source Software!

2. 🔧 Setup

Install with conda:

Pynteny requires Python 3.10. The easiest way to handle dependencies is by creating a dedicated conda environment:

conda create -n pynteny -c bioconda -c conda-forge python=3.10 pynteny
conda activate pynteny

Check that installation worked fine:

(pynteny) pynteny --help

2.1. Installing on Windows

Pynteny is designed to run on Linux machines. However, it can be installed within the Windows Subsystem for Linux via conda.

2.2. Installing on MacOS with the latest ARM64 architecture

Pynteny doesn't currently support the latest ARM64 architecture of silicon processors (e.g. MacBook M1 and M2). If that is your case, you can install Pynteny using the workaround below (based on this post):

CONDA_SUBDIR=osx-64 conda create -n pynteny_x86 python=3.10
conda activate pynteny_x86
conda config --env --set subdir osx-64
conda install -c bioconda pynteny

3. 🚀 Usage

Consider the following toy example of a syntenic block:

Here, we are interested in four genes which colocate according to the pattern above: genes A-C show consecutive locations in the positive strand, followed by three (untargeted) genes and followed by gene D, which is located in the negative strand.

Pynteny can be run either as a command line tool or as a Python module. To run pynteny in the command line, execute:

conda activate pynteny
pynteny <subcommand> <options>

There are a number of available subcommands, which can be explored in the documentation pages.

For intance, to first download the PGAP's database containing a collection of profile HMMs as well as metadata:

pynteny download --outdir data/hmms --unpack

Next, to build a labelled peptide database from DNA assembly data:

pynteny build \
    --data assembly.fa \
    --outfile labelled_peptides.faa

Finally, to search the peptide database for the syntenic structure displayed above: >gene_A 0 >gene_B 0 >gene_C 3 <gene_D, and using the downloaded PGAP database:

pynteny search \
    --synteny_struc ">gene_A 0 >gene_B 0 >gene_C 3 <gene_D" \
    --data labelled_peptides.faa \
    --outdir results/ \
    --gene_ids

4. 📔 Examples

Here are some Jupyter Notebooks with examples to show how Pynteny works:

You can find more notebooks in the examples directory. Find more info in the documentation.

5. 🔄 Dependencies

Pynteny would not work without these awesome projects:

Thanks!

6. Contributing

Contributions are always welcome! If you don't know where to start, you may find an interesting issue to work in here. Please, read our contribution guidelines first.

7. ✒️ Citation

If you use this software, please cite it as below:

Semidán Robaina Estévez. (2023). Pynteny: synteny-aware hmm searches made easy (Version 1.0.0). Zenodo. https://zenodo.org/record/7696204

pynteny's People

Contributors

Stargazers

Watchers

Forkers

batalex

pynteny's Issues

Download PFAM database and split models into separate files

PFAM database as an alternative to PGAP. The problem is that all HMMs are merged into a single file in this database. They have to be split into different files first: https://stackoverflow.com/questions/70789571/how-to-split-hmm-databse-pfam-a-hmm-into-individual-files

Here is the link to PFAM database

Check HMM name retrieval from synteny structura

It seems like pynteny is taking the HMM name from the matched file name instead of the HMM name provided by the user. This causes discrepancies when assigning codes to HMMs

Enable usage of file name as genome ID in pynteny build

Actually, this is still an issue when using a directory of fasta/gbk files to build the peptide database. Since genome ID will coincide with the contig and thus the genome of origin may be lost if this info is not within the contig name.

We could add an option to use the file name as a genome ID in pynteny build. This would solve the issue. The user would have to ensure that file names would carry genome ID info.

PyOpenSci REVIEW - Static classes

Convert static classes to modules

Build.getTestDataPath() no longer works after relocating tests outside package

We could include minimal example data files within the package to recover the previous functionality

Provide meaningful use case

Search for a meaningful use case to show how pynteny works

PyOpenSci REVIEW - Minor changes

Update pynteny in bioconda and check installation

There have been reports of conda failing to solve dependencies when installing from bioconda. Conda also seems to fail at solving dependencies in MacOS.

Check if updating pynteny in bioconda solves the issue with the dependencies. Further restricting conda package versions in the environment.yml may also help resolve dependencies (perhaps using conda lock)

Removing config file upon package uninstall

Config file is generated after install within the package directory. Hence, it is not targeted for removal during uninstall. Following this suggestion, we could create an empty config file that gets installed and then overwrite it.

Create a devcontainer for codespaces?

GitHub codespaces allow specifying a development container (with required dependencies installed) through a metadata file within a special directory (.devcontainer) in the root. This feature is useful to set up homogeneous development environments very easily (even directly in the browser). Would be a nice addition.

File manager not showing up correctly in Firefox

Seems like css selectors not working in Firefox as in Chrome

Add codecov to github action

PyOpenSci - REVIEW: enhance documentation and contributing.md

Addressing:

Documentation

I recommend adding "Installation" & "Getting Started" sections in the documentation. The rationale is as follows: any part of the project may be a potential user's first contact. This includes the README file, the package description on Pypi (not applicable here) / a conda channel, and the documentation.

Agreed!

I would advise adding instructions on how to locally build the documentation in CONTRIBUTING.md in the section

Yes, that would be necessary. Will do.

3. Improving the documentation
The following mention is incorrect in the example notebook:

To follow this example, you don't need to download _E. coli's_ genome, since it has been already downloaded during Pynteny's installation.

The MG1655.gb file is stored in the tests folder, which is not included in the package. Therefore, when the package is installed using conda, it is not downloaded.

Ensure file inputs are objects of class Path

For instance in class FASTA and LabelledFASTA but also throughout the code.

Save log to text file

Implement database build from already generated peptide database

it is a potentially useful feature in cases where the user has already generated a peptide database but wants to be sure record labels are formatted correctly for Pynteny search.

Deal with genome ID when building peptide database from genbank files

Currently Pynteny build displays contig IDs in the labels of the peptide database when input files are in GeneBank format. However, it is useful to know the genome accession ID from which the contigs came. Particularly in these cases in which we depart from already annotated datasets (not the typical use case of Pynteny, which is unannotated assembly data).

Since genome ID is not always included within the GeneBank file, the only option I see right now is to implement the possibility of adding a tag with the genome ID to each peptide sequence, these tags are provided by the user (associated to each GenBank file). These tags would then be inherited by all contigs coming from the same genome / Genbank file.

Check docstrings

Add missing docstrings and check format

Pynteny search results table: sorting when "--unordered"

Entries in Pynteny's main results table are sorted by gene number within each contig. However, this sorting doesn't make much sense when "--unordered"

Python version lower than 3.10?

Currently, Pynteny requires Python >=3.10, could we make it at least >= 3.8?

PyOpenSci REVIEW: pynteny.download and config default paths

Addressing reviewer comments:

CLI

The defaults paths are quite inconvenient if I run the cli from the installed package rather than the source directory, the config file as well as the downloaded database ends up in /home/<user>/miniconda3/envs/Pynteny/lib/python3.10/site-packages/

Right, that's inconvenient. What about adding the option to choose the directory where to download the database / write the config file? I think this may fix the issue.

pynteny/subcommands.py

wget is quite old and does not seem to be active. How about replacing it with a more maintained alternative? e.g. httpx, requests

Will try with requests.

As I said in the general comments, I would prefer if the default dirs were not relative to the package files. When installing pynteny inside a virtual env, this means that the database could be downloaded inside the venv, in a totally different place from the current working dir. I suggest using ~/.pynteny

pynteny/utils.py

As I said in the comments above, using the file location is quite inconvenient because when installed by conda it ends up in the venv directory

Is there a reason for not using https://docs.python.org/3/library/tarfile.html?highlight=tar#tarfile.is_tarfile?

Prepare GitHub Action for CI

Would be useful to set an Action to build and test Pynteny upon pushes. Actions servers already have miniconda installed, so it should be possible to set a conda environment with dependencies and then install pynteny: https://autobencoder.com/2020-08-24-conda-actions/

Allow initializing preprocess with multiple GenBank/FASTA files

Enable creating position-labelled database from multiple GenBank files or multiple fasta files

Add option to match hmm hits on the same strand only (either one of the two)

Either a symbol in synteny_struct or a flag in command. Currently, if no constraints are set in strand location, results may include hits in opposite strands.

Note that in this case, if on the negative strand, hmms may be found in reverse order

Deal with cases where hmms in hmm groups don't map to the same sequences

Remove missing (from PGAP directory) entries from PGAP meta data upon donwload

GitHub Action "test" suddenly failing for linux

Error when building conda environment with mamba

Double check peptide labels in database

There may be an issue with labels containing underscores since pynteny currently uses underscores as delimiters in labels (contig, gene position, etc). We could change the delimiter or store contig and positional information of each label in a dictionary. While storing this info outside the label seems like a good option, there may be problems tracking labels down after the search is completed.

Change repo structure to avoid issues with testing

Adopting /src layout as recommended here to avoid testing with repo modules instead of the installed package.

Use python logger instead of print statements

Refactor print statements and employ python's logging module instead

Report of multi-threading in pynteny build not working

Add minimal graphical example of a syntenic block

Some people may not know what synteny is. A graphical representation in README / example file of syntenic block could help visualize what synteny is (and so what is the benefit of using Pynteny). The sox operon could work.

Update docs

Create online documentation

This issue ports Pynteny's wiki pages to an online documentation service: currently considering readthedocs.

Download and prepare PFAM-A database

Is your feature request related to a problem? Please describe.
Add support to download the PFAM-A database, besides the PGAP database, to be able to use PFAM models alongside TIGRFAM.

Describe the solution you'd like
Add support to download PFAM database, also automatically split original multi-HMM file into separate files (as required by pynteny search). This can be achieved in bash like this:

#!/bin/bash
# Input hmm file path as a param to this script

csplit --digits=2  --quiet --prefix=hmm $1 "////+1" "{*}"

while read -r id filename
do
    mv "$filename" "$id".hmm
done < <(awk '$1 == "ACC" {print $2,FILENAME; nextfile}' hmm*)

But ideally would be integrated in a dedicated python function.

Optional prepend file name when multiple input files

Add code snippets from examples in README.md

Upload Streamlit-Aggrid to bioconda or remove dependency

Streamlit-Aggrid hosted in a private conda channel. Should be installable from a public channel. Either upload to bioconda (contact developer) or remove it (static results table in web application)

Add search optional parameter to reuse hmmer search result tables

Currently reused by default, set a flag to select this option

PyOpenSic: Some more minor changes

Addressing a few more minor changes.

There are a few discrepancies in types, some of them could be fixed by converting argparse.ArgumentParser instances to CommandArgs before using them
My personal recommendation is to define the text encoding whenever a text file descriptor is opened. Even if this is not feasible at the moment, this is a tiny step toward Windows compatibility, and it removes a little bit of uncertainty.
The defaults paths are quite inconvenient if I run the cli from the installed package rather than the source directory, the config file as well as the downloaded database ends up in /home//miniconda3/envs/Pynteny/lib/python3.10/site-packages/
The following mention is incorrect in the example notebook:

To follow this example, you don't need to download _E. coli's_ genome, since it has been already downloaded during Pynteny's installation.

The MG1655.gb file is stored in the tests folder, which is not included in the package. Therefore, when the package is installed using conda, it is not downloaded.

Since the author is using pathlib, they might use / or Path.joinpath instead of os.path.join

Add meta info to hits when provided

Add metadata (gene symbol, EC number, product, etc) to sequence hits when hmm metadata file available (default behavior if using PGAP database)

Change docstring format to Google format

Currently, docstrings are not following any particular format. However, docstrings should follow a standard format to allow documentation builders to parse them. Both mkdocs and sphinx support the Google docstring format, since this format is also quite readable, I think it makes sense to apply it to Pynteny's docstrings.

Incomplete documentation for --unordered

The parameter "--unordered" of pynteny search currently takes the largest maximum gene distance in the provided synteny structure. This is not stated in the documentation

Remove default install dir of PGAP database, make --outdir param required in pynteny download

Restructure Project

This issue restructures Pynteny's project. Specifically:

directory "tests" has been moved outside of the package to locate it at the root of the directory. As a consequence, Pynteny tests subcommand is also removed since tests are no longer installed with the package.
source files are now directly located within directory "pynteny" and pynteny/src has been removed.

Final output hmm name tags may be misleading when employing alternative hmm names for same target

Double-check final output, some of the hmms in each hmm group may not have returned any hit yet they appear in results (HMM1 | HMM2)

Making Pynteny pip-installable

So far, Pynteny depends on bioconda packages and has been distributed as a conda package. However, there are pip-installable alternatives to all dependencies. Particularly, PyHMMER and Pyrodigal could be used instead of the native HMMER and Prodigal (which are not pip-installable).

It is worth exploring those two packages to see how well would integrate into Pynteny, to make Pynteny also pip-installable.

Refactor code to meet PEP8

Following recommendations in pyOpenSci/software-submission#67 (comment)

Include prodigal translation in Database object

Include translation by prodigal as a method of the object Database, which should return an object of type LabelledFASTA

PyOpenSci REVIEW - Extract / Refactor nested functions

Reviewer suggested extracting all nested functions found in the codebase as this would simplify the code. I partially agree with this statement. Some nested functions are justified in my opinion, since these are small, helper functions that are only required by the "parent" function.

I propose extracting some of the nested functions and refactor others. Particularly the ones found in class LabelledFASTA.from_genebank