Coder Social home page Coder Social logo

dsimb / poincaremsa Goto Github PK

View Code? Open in Web Editor NEW
9.0 5.0 2.0 99.97 MB

Detection of the structural, evolutionary and functional relationship between different proteins and protein families requires comparative analysis of their amino acid sequences through multiple sequence alignment. While a direct analysis of the resulting data allows to identify conserved regions, it does not provide a general and broad view on the proteins’ hierarchical organization and their distribution in the sequence, structural or functional spaces. Our project aims to develop a novel strategy for multiple sequence alignment visualization and interpretation using a powerful method demonstrated its efficiency for the hierarchy detection in multi-dimensional data sets called Poincaré maps.

Python 45.68% Jupyter Notebook 53.42% Shell 0.90%

poincaremsa's Introduction

PoincareMSA logo

PoincaréMSA is a tool for protein family vizualisation starting from a multiple sequence alignment (either provided by the user or built by homologous search for a target sequence). It is available in the form of an interactive Google Colab notebooks and the underlying algorithm is described in Susmelj et al. [1]. PoincaréMSA takes as input a multiple sequence alignment (MSA) and builds its projection on a Poincaré disk using the method developed by Klimovskaia et al. in [2]. For the detailed tutorial and contacts please see: https://www.dsimb.inserm.fr/POINCARE_MSA

About

PoincareMSA builds an interactive projection of an input protein multiple sequence alignemnt (MSA) using a method developed by Susmelj et al. [1] based on Poincaré maps [2]. It reproduces both local proximities of protein sequences and hierarchy contained in the given data. Thus, sequences located closer to the center of projection correspond to the proteins sharing the most general functional properites and/or appearing at the earlier stages of evolution.

Colab version

We provide three different Google Colab notebooks for interactive visualization of multiple sequence alignments:

  • PoincareMSA_colab.ipynb takes as input a MSA in .mfasta format provided by a user. The user can also provide an annotation in .csv format which will be used for coloring, as well as an UniProt IDs list used to automatically fetch taxonomy informations for coloring.
  • PoincareMSA_colab_examples.ipynb builds PoincareMSA projections from the example alignments available in examples directory.
  • PoincareMSA_colab_MMseqs2.ipynb performs a homologous sequence search for a target sequence and filtering of the resulting alignment with further projection by PoincaréMSA.

Version for local installation

To get a local copy of the software run:

git clone [email protected]:DSIMB/PoincareMSA.git
cd PoincareMSA

The program is implemented in python3.7 using pytorch library for Poincaré disk construction and plotly for interactive visualisation of the resulting projections.

If you are working in Linux, you can use a conda environment to access all the necessary libraries:

conda env create -f env_poincare.yml
conda activate env_poincare

Otherwise here is a list of necessary dependencies to install:

pytorch 1.7.1
sklearn 0.24.1
numpy 1.19.2
pandas 1.2.3
scipy 1.6.0
seaborn 0.11.1
plotly 5.8.0
jax / jaxlib 0.3.25

Python notebooks

The best way to try PoincaréMSA is by launching python notebooks with provided examples. To launch a particular example one needs to put the corresponding jupyter notebook to the repository root.

For example, to run PoincaréMSA on kinase dataset on should execute:

cp examples/kinases/PoincareMSA_kinases.ipynb ./
jupyter-notebook PoincareMSA_kinases.ipynb

The notebook can be then easily modified to work with any user-provided dataset.

Otherwise, the used can also launche the projection generation step by step as described below.

Command line step-by-step version

Data preparation

The user is invited to provide its MSA in the classical .mfasta format. Each sequence of the alignment is translated to a profile using position-specific scoring matrix (PSSM) according to the pseudo-count algortihm of Henikoff & Henikoff. The related scripts are located in scripts/prepare_data/ directory and driver scripts are provided for every example as create_projection.sh.

The resulting PSSM profiles representing each protein of MSA are stored in the directory fastas0.9, where 0.9 indicate the threshold percentage of gaps per position used to filter initial alignment. To build a Poincaré disk from this data, one needs to run a command from scripts/build_poincare_map/ directory:

python main.py --input_path path_to_files/fastas0.9/ --output_path output_dir --knn 5 --gamma 2 --batchsize 4 --epochs 1000

which will create an output .csv file with protein coordinates in the final projection and .png images reflecting the learning process. The .csv file can be further used to build interactive visualisation.

Examples of use

We provide several examples of PoincareMSA usage for different protien families in the examples folder. Each example comes with a bash script alloqing to reproduce the results starting from MSA and labels contained in data.

References

When using PoincaréMSA, please cite the following research:

[1] A. K. Susmelj, Y. Ren, Y. Vander Meersche, J.-C. Gelly, T. Galochkina. Poincaré maps for visualization of large protein families, Briefings in Bioinformatics, bbad103 (2023). https://doi.org/10.1093/bib/bbad103

The projection construction is adapted from the original code: https://github.com/facebookresearch/PoincareMaps developed for RNA sequence data visualization as described in the following paper:

[2] A. Klimovskaia, D. Lopez-Paz, L. Bottou et al. Poincaré maps for analyzing complex hierarchies in single-cell data. Nat Commun 11, 2966 (2020). https://doi.org/10.1038/s41467-020-16822-4

Contact

For scientific collaboration please contact Dr. Tatiana Galochkina at [email protected] and Dr. Jean-Christophe Gelly at [email protected].

poincaremsa's People

Contributors

yannvm avatar galochkina avatar yani9 avatar klanita avatar

Stargazers

Massinissa avatar Markus Rauhalahti avatar Alex Mattausch avatar Jim Procter avatar MORRO avatar Gabriel Cretin avatar Qin Lin avatar George Pearse avatar Ziheng Wang avatar

Watchers

Jim Procter avatar  avatar Kostas Georgiou avatar  avatar  avatar

poincaremsa's Issues

Clean code

  • Add readdocs
  • Renome unnecessary functions

Split main to projection and visualisaion scripts

It would be great to split main script to two. One that produces projection taking a path to Nfasta as input and second one that takes projection + pkl file and plots it (so the user can calculate a map only once and then plot it with different colors as he/she wants)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.