Coder Social home page Coder Social logo

hnljq / gnina Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gnina/gnina

0.0 1.0 0.0 111.97 MB

A deep learning framework for molecular docking

License: Apache License 2.0

Makefile 0.18% CMake 1.16% C++ 52.65% C 0.46% Shell 0.21% Python 3.13% CSS 0.07% Cuda 2.57% HTML 0.05% Jupyter Notebook 39.24% MATLAB 0.25% Dockerfile 0.02%

gnina's Introduction

gnina (pronounced NEE-na) is a fork of smina, which is a fork of AutoDock Vina.

gnina is not recommended for production use (yet) in molecular modeling tasks. However, it is suitable as a platform for researching structure-based deep learning approaches as described in our paper.

Help

Reminder: gnina is not yet intended for production use. However, if you would like to evaluate it or use it as a research platform, please subscribe to our slack team.

Citation

If you find gnina useful, please cite our paper(s):

Protein–Ligand Scoring with Convolutional Neural Networks (Primary citation) M Ragoza, J Hochuli, E Idrobo, J Sunseri, DR Koes. J. Chem. Inf. Model, 2017
link arXiv

Ligand pose optimization with atomic grid-based convolutional neural networks M Ragoza, L Turner, DR Koes. Machine Learning for Molecules and Materials NIPS 2017 Workshop, 2017 arXiv

Visualizing convolutional neural network protein-ligand scoring J Hochuli, A Helbling, T Skaist, M Ragoza, DR Koes. Journal of Molecular Graphics and Modelling, 2018 link arXiv

Convolutional neural network scoring and minimization in the D3R 2017 community challenge J Sunseri, JE King, PG Francoeur, DR Koes. Journal of computer-aided molecular design, 2018 link PubMed

Installation

Ubuntu 16.04

apt-get install build-essential git wget libboost-all-dev libeigen3-dev libgoogle-glog-dev libprotobuf-dev protobuf-compiler libhdf5-serial-dev libatlas-base-dev python-dev librdkit-dev python-numpy python-pip

Build and Install Libmolgrid

Follow NVIDIA's instructions to install the latest version of CUDA. Note we are in the process of transitioning to CUDA 9.1.

git clone https://github.com/gnina/gnina.git
cd gnina
mkdir build
cd build
cmake ..
make
make install

Note you the scripts provided in gnina/scripts have additional python dependencies that must be installed.

CentOS 7

The program will not build in a computer with a gpu with computer capability < 3.5 unless you force a different architecture. The program will compile but will not run in that computer due to the GPU architecture difference.

Add the EPEL repository

sudo yum  install epel-release

Follow NVIDIA's instructions to install the latest version of CUDA. Or:

wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-9.1.85-1.x86_64.rpm
sudo rpm -i cuda-repo-rhel7-9.1.85-1.x86_64.rpm
sudo yum clean all
sudo yum install cuda

Install dependencies
These are necessary to build RDKit, Caffe, and gnina.

sudo yum  groupinstall 'Development Tools'
sudo yum install boost-devel.x86_64 eigen3-devel.noarch protobuf-compiler.x86_64 protobuf-devel.x86_64 hdf5-devel.x86_64 cmake git wget openbabel-devel.x86_64 openbabel.x86_64 leveldb-devel.x86_64 snappy-devel.x86_64 opencv-devel.x86_64 gflags-devel.x86_64 glog-devel.x86_64 lmdb-devel.x86_64 readline-devel.x86_64 zlib-devel.x86_64 bzip2-devel.x86_64 sqlite-devel.x86_64 python-devel.x86_64 numpy.x86_64 atlas-devel.x86_64 atlas.x86_64 atlas-static.x86_64

Build and Install Libmolgrid

Install cmake 3.8
The cmake installed by yum in CentOS 7 (cmake version 2.8.12.2) produce a lot of error. Is better if you use an updated version.

cd /home/$USER/bin
wget https://cmake.org/files/v3.8/cmake-3.8.0-Linux-x86_64.tar.gz
tar -xvf cmake-3.8.0-Linux-x86_64.tar.gz
export CMAKE_HOME=/home/$USER/bin/cmake-3.8.0-Linux-x86_64
export PATH=$CMAKE_HOME/bin:$PATH

Install RDKit Release_2017_03_1 and compile gnina

Install RDKit
Is better if we keep everything inside the gnina directory.

cd /home/$USER/bin
git clone https://github.com/gnina/gnina.git
cd gnina
wget https://github.com/rdkit/rdkit/archive/Release_2017_03_1.tar.gz
tar -xvf Release_2017_03_1.tar.gz
cd rdkit-Release_2017_03_1
export RDBASE=`pwd`
export LD_LIBRARY_PATH=$RDBASE/lib:$LD_LIBRARY_PATH
mkdir build
cd build

If you are using anaconda python the you need to check that all the python variables are set correctly or set them manually.

export ANACONDA_PY_HOME=/home/$USER/(anaconda2 or miniconda2)
cmake -DPYTHON_EXECUTABLE=$ANACONDA_PY_HOME/bin/python -DPYTHON_INCLUDE_DIR=$ANACONDA_PY_HOME/include/python2.7 -DPYTHON_LIBRARY=$ANACONDA_PY_HOME/lib/libpython2.7.so -DPYTHON_NUMPY_INCLUDE_PATH=$ANACONDA_PY_HOME/lib/python2.7/site-packages/numpy/core/include ..
make
ctest
make install

If you are using your CentOS python

cmake ..
make 
ctest
make install

Fix RDKit Libraries
Compiling RDKit will add the name of the package to the library.
ex. libSmilesParse.so (UBUNTU Package) != libRDKitSmilesParse.so (Compiled in CentOS)
We need to make additional links to resemble the UBUNTU names.

cd $RDBASE/lib
for i in $(ls -1 *.so.1.2017.03.1); do name=`basename $i .so.1.2017.03.1`; namef=`echo $name | sed 's/RDKit//g'`; ln -s $i ${namef}.so.1; ln -s ${namef}.so.1 ${namef}.so; done

Continue with gnina compilation
We need to set the variable for the ATLAS libraries.
Use libsatlas.so for serial libraries or libtatlas.so for threaded libraries.
Also, we need to set the variables for the HDF5 compilers to avoid a conflict with the provided by anaconda python.

If you are using anaconda python the you need to check that all the python variables are set correctly or set them manually.

cd /home/$USER/bin/gnina
mkdir build
cd build
cmake -DPYTHON_EXECUTABLE=$ANACONDA_PY_HOME/bin/python -DPYTHON_INCLUDE_DIR=$ANACONDA_PY_HOME/include/python2.7 -DPYTHON_LIBRARY=$ANACONDA_PY_HOME/lib/libpython2.7.so -DAtlas_BLAS_LIBRARY=/usr/lib64/atlas/libtatlas.so -DAtlas_CBLAS_LIBRARY=/usr/lib64/atlas/libtatlas.so -DAtlas_LAPACK_LIBRARY=/usr/lib64/atlas/libtatlas.so -DHDF5_CXX_COMPILER_EXECUTABLE=/usr/bin/h5c++ -DHDF5_C_COMPILER_EXECUTABLE=/usr/bin/h5cc -DHDF5_DIFF_EXECUTABLE=/usr/bin/h5diff ..
make 
make install

If you are using your CentOS python

cd /home/$USER/bin/gnina
mkdir build
cd build
cmake -DAtlas_BLAS_LIBRARY=/usr/lib64/atlas/libtatlas.so -DAtlas_CBLAS_LIBRARY=/usr/lib64/atlas/libtatlas.so -DAtlas_LAPACK_LIBRARY=/usr/lib64/atlas/libtatlas.so ..
make
make install

If you are building for systems with different GPUs, include -DCUDA_ARCH_NAME=All.

Training

Scripts to aid in training new CNN models can be found at https://github.com/gnina/scripts and sample models at https://github.com/gnina/models.

The input layer should be a MolGridData layer. For example:

layer {
  name: "data"
  type: "MolGridData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  molgrid_data_param {
    source: "TRAINFILE"
    batch_size:  20
    dimension: 23.5
    resolution: 0.5
    shuffle: true
    balanced: true
    random_rotation: true
    random_translate: 2
    root_folder: "/home/dkoes/CSAR/"
  }
}

This layer performs GPU-accelerated grid generation on-the-fly which means it can apply random rotations and translations to the input (essential for training). The input file (TRAINFILE) contains an example on each line, which consists of a label, a receptor file, and a ligand file:

1 set2/297/rec.gninatypes set2/297/docked_0.gninatypes # text after a hash is ignored
1 set2/297/rec.gninatypes set2/297/docked_1.gninatypes
1 set2/297/rec.gninatypes set2/297/docked_2.gninatypes 
1 set2/297/rec.gninatypes set2/297/docked_3.gninatypes 
0 set2/297/rec.gninatypes set2/297/docked_4.gninatypes 
0 set2/297/rec.gninatypes set2/297/docked_5.gninatypes 
...

Althoug the receptor and ligand can be specified as any normal molecular data file, we strongly recommend (for training at least) that molecular structure files be converted to gninatypes files with the gninatyper executable. These are much smaller files that incur less I/O. Relative file paths will be prepended with the root_folder parameter in MolGridData, if applicable.

The provided models are templated with TRAINFILE and TESTFILE arguments, which the train.py script will substitue with actual files. The train.py script can be called with a model and a prefix for testing and training files:

cd models/refmodel3
train.py -m refmodel3.model -p ../data/csar/all -d ../data/csar

This will perform cross-validation using the alltrain[0-2].types and alltest[0-2].types files. Note that refmodel3.model requires the file models/refmodel3/ligmap.old to be in the current directory.

There are quite a few options to train.py for modifying training:

usage: train.py [-h] -m MODEL -p PREFIX [-n NUMBER] [-i ITERATIONS] [-s SEED]
                [-t TEST_INTERVAL] [-o OUTPREFIX] [-g GPU] [-c CONT] [-k] [-r]
                [--avg_rotations] [--keep_best] [--dynamic] [--solver SOLVER]
                [--lr_policy LR_POLICY] [--step_reduce STEP_REDUCE]
                [--step_end STEP_END] [--step_when STEP_WHEN]
                [--base_lr BASE_LR] [--momentum MOMENTUM]
                [--weight_decay WEIGHT_DECAY] [--gamma GAMMA] [--power POWER]
                [--weights WEIGHTS]

Train neural net on .types data.

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Model template. Must use TRAINFILE and TESTFILE
  -p PREFIX, --prefix PREFIX
                        Prefix for training/test files:
                        <prefix>[train|test][num].types
  -n NUMBER, --number NUMBER
                        Fold number to run, default is all
  -i ITERATIONS, --iterations ITERATIONS
                        Number of iterations to run,default 10,000
  -s SEED, --seed SEED  Random seed, default 42
  -t TEST_INTERVAL, --test_interval TEST_INTERVAL
                        How frequently to test (iterations), default 40
  -o OUTPREFIX, --outprefix OUTPREFIX
                        Prefix for output files, default <model>.<pid>
  -g GPU, --gpu GPU     Specify GPU to run on
  -c CONT, --cont CONT  Continue a previous simulation from the provided
                        iteration (snapshot must exist)
  -k, --keep            Don't delete prototxt files
  -r, --reduced         Use a reduced file for model evaluation if exists(<pre
                        fix>[_reducedtrain|_reducedtest][num].types)
  --avg_rotations       Use the average of the testfile's 24 rotations in its
                        evaluation results
  --keep_best           Store snapshots everytime test AUC improves
  --dynamic             Attempt to adjust the base_lr in response to training
                        progress
  --solver SOLVER       Solver type. Default is SGD
  --lr_policy LR_POLICY
                        Learning policy to use. Default is inv.
  --step_reduce STEP_REDUCE
                        Reduce the learning rate by this factor with dynamic
                        stepping, default 0.5
  --step_end STEP_END   Terminate training if learning rate gets below this
                        amount
  --step_when STEP_WHEN
                        Perform a dynamic step (reduce base_lr) when training
                        has not improved after this many test iterations,
                        default 10
  --base_lr BASE_LR     Initial learning rate, default 0.01
  --momentum MOMENTUM   Momentum parameters, default 0.9
  --weight_decay WEIGHT_DECAY
                        Weight decay, default 0.001
  --gamma GAMMA         Gamma, default 0.001
  --power POWER         Power, default 1
  --weights WEIGHTS     Set of weights to initialize the model with

The DUD-E docked poses used in the original paper can be found here.
We will make additional datasets (beyond what is available in models/data available as they are requested. Feel free to contact us.

User Grids

In some cases it may be desirable to incorporate additional grid-based input into the training data. In this case it is necessary to pre-generate grids from the molecular data and user-supplied grids with gninagrid and use the NDimData input layer.

layer {
  name: "data"
  type: "NDimData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  ndim_data_param {
    source: "TRAINFILE"
    batch_size: 10
    shape {
      dim: 34
      dim: 48
      dim: 48
      dim: 48
    }
    shuffle: true
    balanced: true
    rotate: 24
  }
}

Similar to the MolGrid layer, TRAINFILE contains an example on each line with a label and one or more binmap files generated using gninagrid:

1 CS12.48.19.binmap.gz CS12_0.48.18.binmap.gz
0 CS12.48.19.binmap.gz CS12_1.48.18.binmap.gz
0 CS12.48.19.binmap.gz CS12_2.48.18.binmap.gz
0 CS12.48.19.binmap.gz CS12_3.48.18.binmap.gz

As an example, imagine we want to incorporate three additional grids, cdk_gist-dipole-dens.dx, cdk_gist-dipolex-dens.dx, and cdk_gist-gO.dx into the input. We would run gninagrid:

gninagrid  -r rec.pdb -l CDK2_CS12_docked.sdf.gz -g cdk_gist-dipole-dens.dx -g cdk_gist-dipolex-dens.dx -g cdk_gist-gO.dx -o CS12 --separate

Since --separate is passed, this will produce separate receptor (which includes the user provided grids) and ligand files:

-rw-rw-r-- 1 dkoes dkoes 8404992 Apr 28 12:55 CS12.48.19.binmap
-rw-rw-r-- 1 dkoes dkoes 7962624 Apr 28 12:55 CS12_0.48.18.binmap
-rw-rw-r-- 1 dkoes dkoes 7962624 Apr 28 12:55 CS12_1.48.18.binmap
...

The receptor file has 16 channels for the regular protein atom types and 3 for the provided grids. The grid dimensions, resolution, and positioning is determined from the provided grids (which must all match). To save (a lot of) space, the binmap files can be gzipped:

gzip *.binmap

Note that it is up to the user to ensure that the dimensions (including total number of channels) of the input files match the specified dimensions in NGridLayer.

gnina's People

Contributors

shelhamer avatar dkoes avatar jeffdonahue avatar jsunseri avatar yangqing avatar longjon avatar sguada avatar kloudkl avatar rmeli avatar sergeyk avatar ronghanghu avatar mattragoza avatar lukeyeager avatar qipeng avatar noiredd avatar cypof avatar flx42 avatar rbgirshick avatar philkr avatar cdluminate avatar eelstork avatar dgolden1 avatar mavenlin avatar jac241 avatar erictzeng avatar jamt9000 avatar tnarihi avatar blgene avatar yosinski avatar mohomran avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.