Graph Network for protein-protein interface including language model features
With Anaconda
- Clone the repository
$ git clone https://github.com/DeepRank/DeepRank-GNN-esm.git
$ cd DeepRank-GNN-esm
- Install either the CPU or GPU version of DeepRank-GNN-esm
$ conda env create -f environment-cpu.yml && conda activate deeprank-gnn-esm-cpu-env
OR
$ conda env create -f environment-gpu.yml && conda activate deeprank-gnn-esm-gpu-env
- Install the command line tool
$ pip install .
- Run the tests to make sure everything is working
$ pytest tests/
We provide a command-line interface for DeepRank-GNN-ESM that can be used to score protein-protein complexes. The command-line interface can be used as follows:
$ deeprank-gnn-esm-predict -h
usage: deeprank-gnn-esm-predict [-h] pdb_file
positional arguments:
pdb_file Path to the PDB file.
optional arguments:
-h, --help show this help message and exit
Example, score the 1B6C
complex
# download it
$ wget https://files.rcsb.org/view/1B6C.pdb -q
# make sure the environment is activated
$ conda activate deeprank-gnn-esm-gpu-env
(deeprank-gnn-esm-gpu-env) $ deeprank-gnn-esm-predict 1B6C.pdb
2023-06-28 06:08:21,889 predict:64 INFO - Setting up workspace - /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred
2023-06-28 06:08:21,945 predict:72 INFO - Renumbering PDB file.
2023-06-28 06:08:22,294 predict:104 INFO - Reading sequence of PDB 1B6C.pdb
2023-06-28 06:08:22,423 predict:131 INFO - Generating embedding for protein sequence.
2023-06-28 06:08:22,423 predict:132 INFO - ################################################################################
2023-06-28 06:08:32,447 predict:138 INFO - Transferred model to GPU
2023-06-28 06:08:32,450 predict:147 INFO - Read /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/all.fasta with 8 sequences
2023-06-28 06:08:32,459 predict:157 INFO - Processing 1 of 2 batches (4 sequences)
2023-06-28 06:08:34,061 predict:157 INFO - Processing 2 of 2 batches (4 sequences)
2023-06-28 06:08:36,462 predict:200 INFO - ################################################################################
2023-06-28 06:08:36,470 predict:205 INFO - Generating graph, using 79 processors
Graphs added to the HDF5 file
Embedding added to the /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/graph.hdf5 file
2023-06-28 06:09:03,345 predict:220 INFO - Graph file generated: /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/graph.hdf5
2023-06-28 06:09:03,345 predict:226 INFO - Predicting fnat of protein complex.
2023-06-28 06:09:03,345 predict:234 INFO - Using device: cuda:0
# ...
2023-06-28 06:09:07,794 predict:280 INFO - Predicted fnat for 1B6C: 0.342
2023-06-28 06:09:07,803 predict:290 INFO - Output written to /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/output.csv
From the output above you can see that the predicted fnat for the 1B6C complex is 0.342, this information is also written to the output.csv
file.
The command above will generate a folder in the current working directory, containing the following:
1B6C-gnn_esm_pred
├── 1B6C.A.pt
├── 1B6C.B.pt
├── 1B6C.pdb
├── GNN_esm_prediction.csv
├── GNN_esm_prediction.hdf5
├── graph.hdf5
└── output.csv
-
Generate fasta sequence in bulk, use script 'get_fasta.py'
usage: get_fasta.py [-h] pdb_dir output_fasta_name positional arguments: pdb_dir Path to the directory containing PDB files output_fasta_name Name of the combined output FASTA file options: -h, --help show this help message and exit
-
Generate embeddings in bulk from combined fasta files, use the script provided inside esm-2 package,
$ python esm_2_installation_location/scripts/extract.py \ esm2_t33_650M_UR50D \ all.fasta \ tests/data/embedding/1ATN/ \ --repr_layers 0 32 33 \ --include mean per_tok
Replace 'esm_2_installation_location' with your installation location, 'all.fasta' with fasta sequence generated above, 'tests/data/embedding/1ATN/' with the output folder name for esm embeddings
- Example code to generate residue graphs in hdf5 format:
from deeprank_gnn.GraphGenMP import GraphHDF5 pdb_path = "tests/data/pdb/1ATN/" pssm_path = "tests/data/pssm/1ATN/" embedding_path = "tests/data/embedding/1ATN/" nproc = 20 outfile = "1ATN_residue.hdf5" GraphHDF5( pdb_path = pdb_path, pssm_path = pssm_path, embedding_path = embedding_path, graph_type = "residue", outfile = outfile, nproc = nproc, #number of cores to use tmpdir="./tmpdir")
- Example code to add continuous or binary targets to the hdf5 file
import h5py import random hdf5_file = h5py.File('1ATN_residue.hdf5', "r+") for mol in hdf5_file.keys(): fnat = random.random() bin_class = [1 if fnat > 0.3 else 0] hdf5_file.create_dataset(f"/{mol}/score/binclass", data=bin_class) hdf5_file.create_dataset(f"/{mol}/score/fnat", data=fnat) hdf5_file.close()
- Example code to use pre-trained DeepRank-GNN-esm model
from deeprank_gnn.ginet import GINet from deeprank_gnn.NeuralNet import NeuralNet database_test = "1ATN_residue.hdf5" gnn = GINet target = "fnat" edge_attr = ["dist"] threshold = 0.3 pretrained_model = deeprank-GNN-esm/paper_pretrained_models/scoring_of_docking_models/gnn_esm/treg_yfnat_b64_e20_lr0.001_foldall_esm.pth.tar node_feature = ["type", "polarity", "bsa", "charge", "embedding"] device_name = "cuda:0" num_workers = 10 model = NeuralNet( database_test, gnn, device_name = device_name, edge_feature = edge_attr, node_feature = node_feature, target = target, num_workers = num_workers, pretrained_model = pretrained_model, threshold = threshold) model.test(hdf5 = "tmpdir/GNN_esm_prediction.hdf5")