This is a course project enhancing docking scoring function. I take deltavinarf20 as the baseline and involve more non-bonded information, including hydrogen_bond, water_bridge, halogen_bond, salt_bridge, pi_cation_interaction and pi_stack.
For a quick setup, you need to install the dependencies in requirements.txt
(NOTE: DO NOT FORGET TO MODIFY SYSTEM PATH).
You also need to preinstall several codebases, deltavina https://github.com/chengwang88/deltavina.git is the repo of deltavinarf20, in which only provides model inference codes. vina4dv https://github.com/chengwang88/vina4dv.git is a fork version of AutoDockVina, which is required by deltavina. plip https://github.com/pharmai/plip.git is a tool extracting non-bonded interactions between ligand and protein pair.
For feature preparation:
Step 1, convert protein.pdb to protein.pdbqt using mgltools:
pythonsh /path-you-install-mgltools/MGLToolsPckgs/AutoDockTools/Utilities24/prepare_receptor4.py -r xxxx_protein.pdb -o xxxx_protein.pdbqt
Step 2, calculate vina score and deltavinarf20 score of each complex in PDBbind refined-set using deltavina repo:
/path-you-install-deltavina/deltavina/bin/dvrf20.py -r xxxx_protein.pdb -l xxxx_ligand.mol2
Step 3, extract feature using plip by running plip_extract_feature.py
(as well as parsing the xml file generated by plip)
Step 4, extract vina score feature by running scoring_core_deltavinarf20.py
Final step, running jupyter-notebook scoring_model.ipynb
to
- combine plip feature with vina score feature as model input
- prepare ground-truth from refined-set/index/INDEX_refined_data.2019
- train models using sklearn
- inference and calculate mse loss
PDBbind (http://www.pdbbind.org.cn/) is a dataset processed from PDB database. It contains a group of protein-ligand pairs with a id (like 1a28
). It take the complex in PDB databse with the same id and seperate the receptor with the ligand, and do experiments to get the binding affinity pKd (-logKi/Kd, the fourth column in refined-set/index/INDEX_refined_data.2019).
After cleaning, train set (from refined-set) includes 3602 complexes and test set (from core-set) includes 263 complexes.