Coder Social home page Coder Social logo

robaina / pynteny Goto Github PK

View Code? Open in Web Editor NEW
13.0 1.0 1.0 22.64 MB

Query sequence database by HMMs arranged in predefined synteny structure

Home Page: https://robaina.github.io/Pynteny

License: Apache License 2.0

Python 88.03% TeX 10.91% Dockerfile 1.06%
bioinformatics genomics hmmer metagenomics prokaryotic-genomes synteny hmm python synteny-block computational-biology

pynteny's Introduction

logo

Synteny-aware hmm searches made easy

tests codecov docs

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Anaconda-Server Badge license Contributor Covenant

Bioconda Anaconda-Server Badge GitHub release

Anaconda-Server Badge python Code style: black

pyOpenSci DOI

1. 💡 What is Pynteny?

Pynteny is Python tool to search for synteny blocks in (prokaryotic) sequence data through HMMs of the ORFs of interest and HMMER. By leveraging genomic context information, Pynteny can be employed to decrease the uncertainty of functional annotation of unlabelled sequence data due to the effect of paralogs. Pynteny can be accessed (i) through the command line or (ii) as a Python module.

Get more info in the documentation pages!

Check out the Pynteny paper in the Journal of Open Source Software!

2. 🔧 Setup

Install with conda:

  1. Pynteny requires Python 3.10. The easiest way to handle dependencies is by creating a dedicated conda environment:
conda create -n pynteny -c bioconda -c conda-forge python=3.10 pynteny
conda activate pynteny
  1. Check that installation worked fine:
(pynteny) pynteny --help

2.1. Installing on Windows

Pynteny is designed to run on Linux machines. However, it can be installed within the Windows Subsystem for Linux via conda.

2.2. Installing on MacOS with the latest ARM64 architecture

Pynteny doesn't currently support the latest ARM64 architecture of silicon processors (e.g. MacBook M1 and M2). If that is your case, you can install Pynteny using the workaround below (based on this post):

CONDA_SUBDIR=osx-64 conda create -n pynteny_x86 python=3.10
conda activate pynteny_x86
conda config --env --set subdir osx-64
conda install -c bioconda pynteny

3. 🚀 Usage

Consider the following toy example of a syntenic block:

synteny example

Here, we are interested in four genes which colocate according to the pattern above: genes A-C show consecutive locations in the positive strand, followed by three (untargeted) genes and followed by gene D, which is located in the negative strand.

Pynteny can be run either as a command line tool or as a Python module. To run pynteny in the command line, execute:

conda activate pynteny
pynteny <subcommand> <options>

pynyeny-cli

There are a number of available subcommands, which can be explored in the documentation pages.

For intance, to first download the PGAP's database containing a collection of profile HMMs as well as metadata:

pynteny download --outdir data/hmms --unpack

Next, to build a labelled peptide database from DNA assembly data:

pynteny build \
    --data assembly.fa \
    --outfile labelled_peptides.faa

Finally, to search the peptide database for the syntenic structure displayed above: >gene_A 0 >gene_B 0 >gene_C 3 <gene_D, and using the downloaded PGAP database:

pynteny search \
    --synteny_struc ">gene_A 0 >gene_B 0 >gene_C 3 <gene_D" \
    --data labelled_peptides.faa \
    --outdir results/ \
    --gene_ids

4. 📔 Examples

Here are some Jupyter Notebooks with examples to show how Pynteny works:

You can find more notebooks in the examples directory. Find more info in the documentation.

5. 🔄 Dependencies

Pynteny would not work without these awesome projects:

Thanks!

6. :octocat: Contributing

Contributions are always welcome! If you don't know where to start, you may find an interesting issue to work in here. Please, read our contribution guidelines first.

7. ✒️ Citation

If you use this software, please cite it as below:

Semidán Robaina Estévez. (2023). Pynteny: synteny-aware hmm searches made easy (Version 1.0.0). Zenodo. https://zenodo.org/record/7696204

pynteny's People

Contributors

batalex avatar robaina avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

batalex

pynteny's Issues

Enable usage of file name as genome ID in pynteny build

Actually, this is still an issue when using a directory of fasta/gbk files to build the peptide database. Since genome ID will coincide with the contig and thus the genome of origin may be lost if this info is not within the contig name.

We could add an option to use the file name as a genome ID in pynteny build. This would solve the issue. The user would have to ensure that file names would carry genome ID info.

Update pynteny in bioconda and check installation

There have been reports of conda failing to solve dependencies when installing from bioconda. Conda also seems to fail at solving dependencies in MacOS.

Check if updating pynteny in bioconda solves the issue with the dependencies. Further restricting conda package versions in the environment.yml may also help resolve dependencies (perhaps using conda lock)

Create a devcontainer for codespaces?

GitHub codespaces allow specifying a development container (with required dependencies installed) through a metadata file within a special directory (.devcontainer) in the root. This feature is useful to set up homogeneous development environments very easily (even directly in the browser). Would be a nice addition.

PyOpenSci - REVIEW: enhance documentation and contributing.md

Addressing:

Documentation

  • I recommend adding "Installation" & "Getting Started" sections in the documentation. The rationale is as follows: any part of the project may be a potential user's first contact. This includes the README file, the package description on Pypi (not applicable here) / a conda channel, and the documentation.

Agreed!

  • I would advise adding instructions on how to locally build the documentation in CONTRIBUTING.md in the section

Yes, that would be necessary. Will do.

3. Improving the documentation
  • The following mention is incorrect in the example notebook:

To follow this example, you don't need to download _E. coli's_ genome, since it has been already downloaded during Pynteny's installation.

The MG1655.gb file is stored in the tests folder, which is not included in the package. Therefore, when the package is installed using conda, it is not downloaded.

Deal with genome ID when building peptide database from genbank files

Currently Pynteny build displays contig IDs in the labels of the peptide database when input files are in GeneBank format. However, it is useful to know the genome accession ID from which the contigs came. Particularly in these cases in which we depart from already annotated datasets (not the typical use case of Pynteny, which is unannotated assembly data).

Since genome ID is not always included within the GeneBank file, the only option I see right now is to implement the possibility of adding a tag with the genome ID to each peptide sequence, these tags are provided by the user (associated to each GenBank file). These tags would then be inherited by all contigs coming from the same genome / Genbank file.

PyOpenSci REVIEW: pynteny.download and config default paths

Addressing reviewer comments:

CLI

  • The defaults paths are quite inconvenient if I run the cli from the installed package rather than the source directory, the config file as well as the downloaded database ends up in /home/<user>/miniconda3/envs/Pynteny/lib/python3.10/site-packages/

Right, that's inconvenient. What about adding the option to choose the directory where to download the database / write the config file? I think this may fix the issue.

pynteny/subcommands.py

  • wget is quite old and does not seem to be active. How about replacing it with a more maintained alternative? e.g. httpx, requests

Will try with requests.

  • As I said in the general comments, I would prefer if the default dirs were not relative to the package files. When installing pynteny inside a virtual env, this means that the database could be downloaded inside the venv, in a totally different place from the current working dir. I suggest using ~/.pynteny

pynteny/utils.py

Double check peptide labels in database

There may be an issue with labels containing underscores since pynteny currently uses underscores as delimiters in labels (contig, gene position, etc). We could change the delimiter or store contig and positional information of each label in a dictionary. While storing this info outside the label seems like a good option, there may be problems tracking labels down after the search is completed.

Add minimal graphical example of a syntenic block

Some people may not know what synteny is. A graphical representation in README / example file of syntenic block could help visualize what synteny is (and so what is the benefit of using Pynteny). The sox operon could work.

Create online documentation

This issue ports Pynteny's wiki pages to an online documentation service: currently considering readthedocs.

Download and prepare PFAM-A database

Is your feature request related to a problem? Please describe.
Add support to download the PFAM-A database, besides the PGAP database, to be able to use PFAM models alongside TIGRFAM.

Describe the solution you'd like
Add support to download PFAM database, also automatically split original multi-HMM file into separate files (as required by pynteny search). This can be achieved in bash like this:

#!/bin/bash
# Input hmm file path as a param to this script

csplit --digits=2  --quiet --prefix=hmm $1 "////+1" "{*}"

while read -r id filename
do
    mv "$filename" "$id".hmm
done < <(awk '$1 == "ACC" {print $2,FILENAME; nextfile}' hmm*)

But ideally would be integrated in a dedicated python function.

PyOpenSic: Some more minor changes

Addressing a few more minor changes.

  • There are a few discrepancies in types, some of them could be fixed by converting argparse.ArgumentParser instances to CommandArgs before using them

  • My personal recommendation is to define the text encoding whenever a text file descriptor is opened. Even if this is not feasible at the moment, this is a tiny step toward Windows compatibility, and it removes a little bit of uncertainty.

  • The defaults paths are quite inconvenient if I run the cli from the installed package rather than the source directory, the config file as well as the downloaded database ends up in /home//miniconda3/envs/Pynteny/lib/python3.10/site-packages/

  • The following mention is incorrect in the example notebook:

To follow this example, you don't need to download _E. coli's_ genome, since it has been already downloaded during Pynteny's installation.

The MG1655.gb file is stored in the tests folder, which is not included in the package. Therefore, when the package is installed using conda, it is not downloaded.

  • Since the author is using pathlib, they might use / or Path.joinpath instead of os.path.join

Add meta info to hits when provided

Add metadata (gene symbol, EC number, product, etc) to sequence hits when hmm metadata file available (default behavior if using PGAP database)

Change docstring format to Google format

Currently, docstrings are not following any particular format. However, docstrings should follow a standard format to allow documentation builders to parse them. Both mkdocs and sphinx support the Google docstring format, since this format is also quite readable, I think it makes sense to apply it to Pynteny's docstrings.

Incomplete documentation for --unordered

The parameter "--unordered" of pynteny search currently takes the largest maximum gene distance in the provided synteny structure. This is not stated in the documentation

Restructure Project

This issue restructures Pynteny's project. Specifically:

  1. directory "tests" has been moved outside of the package to locate it at the root of the directory. As a consequence, Pynteny tests subcommand is also removed since tests are no longer installed with the package.

  2. source files are now directly located within directory "pynteny" and pynteny/src has been removed.

Making Pynteny pip-installable

So far, Pynteny depends on bioconda packages and has been distributed as a conda package. However, there are pip-installable alternatives to all dependencies. Particularly, PyHMMER and Pyrodigal could be used instead of the native HMMER and Prodigal (which are not pip-installable).

It is worth exploring those two packages to see how well would integrate into Pynteny, to make Pynteny also pip-installable.

PyOpenSci REVIEW - Extract / Refactor nested functions

Reviewer suggested extracting all nested functions found in the codebase as this would simplify the code. I partially agree with this statement. Some nested functions are justified in my opinion, since these are small, helper functions that are only required by the "parent" function.

I propose extracting some of the nested functions and refactor others. Particularly the ones found in class LabelledFASTA.from_genebank

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.