shiwentao00 / graphsite-classifier Goto Github PK

Ligand-binding site classification with deep graph neural networks.

License: MIT License

Python 100.00%

deeplearning pytorch biochemistry graph-neural-network binding-sites binding-site-classification protein molecular-biology

graphsite-classifier's Introduction

Graphsite-classifier

Graphsite-classifier is a deep graph neural network to classify ligand-binding sites on proteins. It is implemented with Pytorch and Pytorch-geometric. During training, the binding sites are transformed on-the-fly to graphs that contain both spacial and chemical features. A customized graph neural network (GNN) classifier is then trained on the graph representations of the binding pockets. The following figure illustrates the application pipeline:

For more details, please reference our paper. If you find this repo useful in your work please cite our paper :)
GraphSite: Ligand Binding Site Classification with Deep Graph Learning
Wentao Shi, Manali Singha, Limeng Pu, Gopal Srivastava, Jagannathan Ramanujam, and Michal Brylinski
Biomolecules 12, no. 8 (2022): 1053

Dataset

The dataset used in the experiment can be accessed via this OSF repo. The dataset consists of 21,125 binding pockets which are grouped into 14 classes. The details of the classes are described here. There are three files needed for training:

clusters.yaml: contains information about the initial clustering information of the binding sites. Multiple clusters will be merged into one class before training.
pocket-dataset.tar.gz: contains all binding site data in this project.
pops-dataset.tar.gz: contains information of node feature contact surface area.

If you want to generate your own data, the procedures and scripts to create the .mol2, .pops, and .profile files can be seen here.

Usage

Dependency

There are several dependencies for the train and inference Python script:

Pytorch
Pytroch-gemetric
Numpy
PyYAML
BioPandas
Pandas
Scikit-learn
Matplotlib
SciPy

Training

Everything of the graph neural network implementation is at ./gnn. The configuration of training is in the ./gnn/train_classifier.yaml. To use the default architecture and hyperparamters for training, which we recommend, the user only have to make the following modifications:

set cluster_file_dir to the path of clusters.yaml you downloaded.
set pocket_dir to the path of uncompressed dataset.tar.gz you downloaded.
set pop_dir to the path of uncompressed pops.tar.gz you downloaded.
set trained_model_dir: to the directory where you want the trained model to be saved.
set loss_dir and confusion_matrix_dir to the directory where you want to save other training results. If you want to try to play with the model, feel free to tune the hyperparameters and try other models.

After the training confiruations are set, simply

cd ./gnn
python train_classifier.py

Inference

The inference script requires 3 input arguments:

unseen_data_dir: directory of unseen data. For each pocket, there should be 3 associated files: .mol2, .pops, and .profile. For example, a pocket on protein 6af2A needs the following 3 files:

6af2A.pops
6af2A.profile
6ag5A00.mol2

unseen_data_classes: a yaml file containing 14 lists which represent the classes of data. If there is no data in a class, it should correspond to an empty list. See unseen-pocket-lists.yaml as an example.
trained_model: the path to the trained model. After the inference data are prepared, run the following script to test the model:

python inference.py -unseen_data_dir ../unseen-data/unseen_pdb/ -unseen_data_classes ../unseen-data/unseen-pocket-list_new.yaml -trained_model ../trained_models/trained_classifier_model_63.pt

graphsite-classifier's People

Contributors

Stargazers

Watchers

Forkers

gnsrivastava vas2201 rydia001 l40s38 udayiitm

graphsite-classifier's Issues

Error : torch_sparse/_version_cuda.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSs

After installation of dependencies ... I am training the model Graphsite-classifier. getting the following error.
Please advise on this matter ..
Thank you

Traceback (most recent call last):
File "/home/nmrbox/spenumutchu/workflow2023/56Graphsite_classifer/Graphsite-classifier-master/gnn/train_classifier.py", line 8, in
from dataloader import read_cluster_file_from_yaml
File "/home/nmrbox/spenumutchu/workflow2023/56Graphsite_classifer/Graphsite-classifier-master/gnn/dataloader.py", line 6, in
from torch_geometric.data import Data, Dataset
File "/home/nmrbox/spenumutchu/anaconda3/envs/56Graphsite-classifier_py37/lib/python3.9/site-packages/torch_geometric/init.py", line 4, in
import torch_geometric.data
File "/home/nmrbox/spenumutchu/anaconda3/envs/56Graphsite-classifier_py37/lib/python3.9/site-packages/torch_geometric/data/init.py", line 1, in
from .data import Data
File "/home/nmrbox/spenumutchu/anaconda3/envs/56Graphsite-classifier_py37/lib/python3.9/site-packages/torch_geometric/data/data.py", line 20, in
from torch_sparse import SparseTensor
File "/home/nmrbox/spenumutchu/anaconda3/envs/56Graphsite-classifier_py37/lib/python3.9/site-packages/torch_sparse/init.py", line 18, in
torch.ops.load_library(spec.origin)
File "/home/nmrbox/spenumutchu/anaconda3/envs/56Graphsite-classifier_py37/lib/python3.9/site-packages/torch/_ops.py", line 255, in load_library
ctypes.CDLL(path)
File "/home/nmrbox/spenumutchu/anaconda3/envs/56Graphsite-classifier_py37/lib/python3.9/ctypes/init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
OSError: /home/nmrbox/spenumutchu/anaconda3/envs/56Graphsite-classifier_py37/lib/python3.9/site-packages/torch_sparse/_version_cuda.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSs

Generate Files

I am trying to use the procedures and scripts to create the .mol2, .pops, and .profile files, but it appears the separate_pdb_chains.py does not seem to actually separate the chains as it should. It is potential I am just using it wrong, but I can not get it to seperate the pdb files into chains as it appears this script should do.

REG : Regarding fpocket the parameters

Hi,

I am attempting to replicate the binding pockets that you generated using the default fpocket parameters, but I have noticed significant differences in the results. I am not sure what the optimal parameters to use are. Could you provide some guidance on this matter?

Thank you.

The dataset was deleted

Sorry, in the process of reproducing your work, I found that the data set pointed to by the URL was deleted. Could you provide your data set again, thank you very much.

unseen-pocket-lists.yaml

In step 2 of the inference section, it says that I am supposed to create a file that looks something like "unseen-pocket-lists.yaml". I have looked for this example file and I can't seem to find it anywhere. Could this please be added so I can make the yml file correctly? Thanks

Error while writing the Mol files

I am having issues generating input files for the binding pockets (Step-7 and Step-8) as outlined in the instructions at https://github.com/shiwentao00/Graphsite-classifier/blob/master/docs/data_curation/readme.md.

I am getting an error message while attempting to screen small molecules for unknown binding pockets, and my troubleshooting has led me to believe the issue may be with generating mol files using step 7 and step 8 in above link. I have not been able to replicate the exact input mol2 files, for example using the pdb code 1a0f with a trained data set.

Looks like obabel is not writing all the information of the bond connectivity at the end of the file @< "TRIPOS>" BOND while converting the pdb into mol file. (i can be wrong on it).

I would greatly appreciate it if you could offer any advice on this matter.