lbm-epfl / pesto Goto Github PK

View Code? Open in Web Editor NEW

98.0 2.0 14.0 387.4 MB

Geometric deep learning method to predict protein binding interfaces from a protein structure.

Home Page: https://pesto.epfl.ch

License: Other

Jupyter Notebook 89.40% Python 10.36% Shell 0.19% TeX 0.06%

deep-learning geometric-deep-learning protein-binding-site protein-protein-interface protein-structure pytorch

pesto's Introduction

PeSTo: parameter-free geometric deep learning for accurate prediction of protein binding interfaces

PeSTo (Protein Structure Transformer) is a parameter-free geometric deep learning method to predict protein interaction interfaces from a protein structure. It is available for free without registration as an online tool (pesto.epfl.ch).

Installation

Download the source code and examples by cloning the repository.

git clone https://github.com/LBM-EPFL/PeSTo.git
cd PeSTo

The primary requirements for PeSTo are GEMMI to parse PDB files and PyTorch for the deep learning framework. During training, h5py is used to store the processed data in an optimized format. The predicted interfaces can be visualized using PyMOL or ChimeraX.

Using Anaconda

All the specific dependencies are listed in pesto.yml. The specific dependencies can be easily installed using Anaconda. Create and activate the environement with:

conda env create -f pesto.yml
conda activate pesto

Or installing manually de dependencies

conda create -n pesto python=3.9
conda activate pesto
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
conda install numpy scipy pandas matplotlib scikit-learn h5py tqdm
conda install gemmi tensorboard -c conda-forge

Using virtualenv

Alternatively, it is possible to create a local environment using virtualenv and install all the dependencies.

virtualenv pesto
source pesto/bin/activate
pip install -r requirements.txt

Application

A set a Jupyter notebooks and python scripts are available to apply our trained model. Start a JupyterLab session with:

jupyter-lab

Interfaces predictions

The PeSTo model can be applied to PDB files using the apply_model.ipynb notebook. Specify the path to the folder containing the PDB files using the data_path variable. We recommand using the latest model (i_v4_1) but other pre-trained variants are available. The predictions can be run on CPU or GPU.

The predictions for the interfaces are stored in the b-factor field of the PDB files using a value from 0 (no interface) to 1 (interface). The predicted interfaces can be visualized with a color gradient per residue. This can be done in PyMOL with,

spectrum b, blue_white_red, all, 0, 1

Or in ChimeraX with

color bfactor palette "#2B59C3:#D1D1D1:#D7263D" range 0,1

Reproducibility

We provide all the Jupyter notebooks and scripts used to obtain and process the data, train and evaluate the model. The latest model (i_v4_1) is used for the benchmarks and results shown in the paper.

Interfaces prediction

All bioassemblies used are downloaded from RCSB PDB. The subunits are split into training, testing and validation dataset according to 30% sequence similarity clusters (processing/split_dataset.ipynb). Finaly, we preprocess the structure, detect the interfaces within complexes and store the features and labels into an optimized HDF5 format (processing/build_dataset.py).

The model folder contains the scripts to train the model as well as the selected pre-trained models in model/save. The benchmark and comparison can be reproduced with the interface_*.ipynb notebooks.

MD analysis

Scripts and functions to perform predictions and analysis on MD are found in the md_analysis folder. Molecular dynamics are loaded using MDTraj. An utility tool called data_manager was developed to easily locate simulations within a defined tree-folder structure. We also developed analysis tools based on MDTraj (mdtraj_utils).

Interfaceome

The interfaceome folder contains the Jupyter notebooks and python scripts used to download, process and analyse the data. All the AlphaFold-predicted structures used can be downloaded freely from the AlphaFold Protein Structure Database. Only the corresponding UniProt data is downloaded (interfaceome/download_uniprot.py). We also download the PAE from the AlphaFold Protein Structure Database (interfaceome/download_af_pae.py).

Other available pre-trained models

We provide 4 variants of the trained PeSTo models:

i_v3_0 is composed of 16 geometric transformers and uses both atom element and residue type information
i_v3_1 is composed of 16 geometric transformers and uses both atom element and residue type information but only predicts protein-protein interfaces
i_v4_0 is composed of 16 geometric transformers and uses only atom element
i_v4_1 is composed of 32 geometric transformers and uses only atom element

Web server

It is possible to use PeSTo without requiring the user to install it using our web server freely available at pesto.epfl.ch. PDB ID, UniProt ID or PDB files are accepted. The predictions are fast and can be visualized directely in the browser or downloaded as PDB files.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Reference

Krapp, L.F., Abriata, L.A., Cortés Rodriguez, F. et al. PeSTo: parameter-free geometric deep learning for accurate prediction of protein binding interfaces. Nat Commun 14, 2175 (2023). https://doi.org/10.1038/s41467-023-37701-8

pesto's People

Contributors

Stargazers

Watchers

Forkers

geng-lee trellixvulnteam asclepiusinformatica byun-jinyoung tong2200 nimijkrap biogeeker lyndonlens zerodesigner 123jt kehan777 torresmateo poseidonfluids riddhisht

pesto's Issues

Questions about training

Could you please share the details of the training process? Thank you!

missing files

Hi,
the following files are missed in this repo.
training_exclusion_lists = [ # "data/lists/ppdb5_set.txt", # "data/lists/masif-site_test_set.txt", # "data/lists/skempi_v2.txt", # "data/lists/memcplxdb.txt", # "data/lists/excluded.txt" ]
would you like to upload them?
Best,
Zhangzhi

PeSTo/md_analysis/data_manager /data_manager.py, "meta" does not exist

I have tried to run PeSTo/md_analysis/apply_model_md.ipynb.
The folder "model" does not exist here, but I copied it from the root directory.
The from X import statements need to refer to "model."

In
from CLoNe.clone import CLoNe
"CLoNe" does not exist

The data "meta" does not exist, so the data_manager.py just throws a
FileNotFoundError: [Errno 2] No such file or directory: 'database/meta'

Also, the symlink for directory
datasets/
points to an non-existent location

Could you upload the data, for testing?

AttributeError: module 'numpy' has no attribute 'object'

Dear authors,

Thanks for developing this very useful tool. I am testing PeSTo, and got an error with numpy:

model/save/i_v4_1_2021-09-07_11-21/src/structure.py", line 99, in tag_hetatm_chains
structure = tag_hetatm_chains(structure)
AttributeError: module 'numpy' has no attribute 'object'

This is perhaps a numpy version issue. When I created the environment, conda automatically installed numpy 1.25. Can you please tell the numpy version that works? Or even better specify versions in pesto.yml file? Thanks.

Datasets about protien-lipid interactions

Hello,
Thank you for sharing such a good job PeSTo.
I'm doing some work about protein-lipid interactions.
Is it convenient for you to tell me how can I get these protein-lipid data in PeSTo ?
Thanks !
CJ Wu

interface_ppi_benchmark.ipynb, incorrect paths and missing benchmark data

Hello, I noticed that some module calls do not match the directory structure in the .ipynb file.
Eg.
from config import config_model, config_data
should be (according to the github structure)
from model.config import config_model, config_data

Also, the benchmark data is missing in the github. Where could I retrieve it from?
dataset = Dataset("datasets/contacts_rr5A_64nn_8192.h5")

interface_ppi_profiling_analysis.ipynb, missing

The results file below is missing from the repository, so the .ipynb notebook cannot be run:

results/interface_ppi_cuda_profiling.csv

Is there a ready-made docker image ？

Is there a ready-made docker image ？
the conda install is Very time-consuming

Question regarding extracting protein representation from pre-train model

Hi author,
Huge fan of your work. I am currently trying to apply your code to a downstream task. Specifically, I am working with a protein PDB file and aim to extract the latent representation of the protein using your pretrained model. I was wondering if this is possible.
If it is, could you kindly show me which script I should run and which line of code store the representation of the protein?
Thank you so much for your time

Question: which prediction file(s) to use

Dear Authors,

Thanks for developing this very useful tool. I did a test on a PDB structure. The tool generated 5 files (with names of *_i[0-4].pdb). The b factor column in each of the five files is different. I didn't find documents about the these files. Please advise on how to use the files for interface inference, such as meaning of b factor column in each file, and cutoffs (like the mean values in a window and size of window to call interface, etc. Thanks!

interface_ppi_confidence.ipynb, missing data and incorrect module calling

Modules are not called according to the repository structure, eg:
from model import Model
should be
from model.model import Model

Also, example data missing:
dataset = Dataset("datasets/contacts_rr5A_64nn_8192.h5")

I could suggest to rerun all .ipynb files in a fresh conda enviroment with a cloned git, to ensure they work?

The longe range context seems to affect local interface prediction scores

Hi!
I've seen that the prediction scores may complete change depending on the long range context... For example, the same protein domain is predicted to have different interfaces if a second (native) domain is included in the calculation, even though the 2nd domain is in the opposite face and 2 nm away from the initially predicted interface...

Similarly, I've found that mutating high scoring residues can lead to the prediction of new interfaces of equally good scores at a completely different (and distant) surface in the protein...

I wander if these issues are "artifacts" of the attention layer, which may tend to prioritize and rank different interfaces, while just pointing "the best of the best" with a given confidence... if that is the case, then the computed score may not have a coherent or "absolute" scale, it will depend on how other interactions are being evaluated in the system... Hence, there seems to be room for finding even more hidden interaction interfaces...

Best

.yml file

Hi,

I am trying to install the environment with your .yml file it do not seems to work. Do you have an idea why ?