Scripts for training and validating neural network models for gnina

train.py - Takes a model template and train/test file(s)
predict.py - Takes a model, learned weights, and an input file and outputs probabilities of binding

Training

usage: train.py [-h] -m MODEL -p PREFIX [-d DATA_ROOT] [-n FOLDNUMS] [-a]
                [-i ITERATIONS] [-s SEED] [-t TEST_INTERVAL] [-o OUTPREFIX]
                [-g GPU] [-c CONT] [-k] [-r] [--avg_rotations] [--keep_best]
                [--dynamic] [--solver SOLVER] [--lr_policy LR_POLICY]
                [--step_reduce STEP_REDUCE] [--step_end STEP_END]
                [--step_when STEP_WHEN] [--base_lr BASE_LR]
                [--momentum MOMENTUM] [--weight_decay WEIGHT_DECAY]
                [--gamma GAMMA] [--power POWER] [--weights WEIGHTS]
                [-p2 PREFIX2] [-d2 DATA_ROOT2] [--data_ratio DATA_RATIO]

Train neural net on .types data.

optional arguments:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Model template. Must use TRAINFILE and TESTFILE
  -p PREFIX, --prefix PREFIX
                        Prefix for training/test files:
                        <prefix>[train|test][num].types
  -d DATA_ROOT, --data_root DATA_ROOT
                        Root folder for relative paths in train/test files
  -n FOLDNUMS, --foldnums FOLDNUMS
                        Fold numbers to run, default is '0,1,2'
  -a, --allfolds        Train and test file with all data folds,
                        <prefix>.types
  -i ITERATIONS, --iterations ITERATIONS
                        Number of iterations to run,default 10,000
  -s SEED, --seed SEED  Random seed, default 42
  -t TEST_INTERVAL, --test_interval TEST_INTERVAL
                        How frequently to test (iterations), default 40
  -o OUTPREFIX, --outprefix OUTPREFIX
                        Prefix for output files, default <model>.<pid>
  -g GPU, --gpu GPU     Specify GPU to run on
  -c CONT, --cont CONT  Continue a previous simulation from the provided
                        iteration (snapshot must exist)
  -k, --keep            Don't delete prototxt files
  -r, --reduced         Use a reduced file for model evaluation if exists(<pre
                        fix>[_reducedtrain|_reducedtest][num].types)
  --avg_rotations       Use the average of the testfile's 24 rotations in its
                        evaluation results
  --keep_best           Store snapshots everytime test AUC improves
  --dynamic             Attempt to adjust the base_lr in response to training
                        progress
  --solver SOLVER       Solver type. Default is SGD
  --lr_policy LR_POLICY
                        Learning policy to use. Default is inv.
  --step_reduce STEP_REDUCE
                        Reduce the learning rate by this factor with dynamic
                        stepping, default 0.5
  --step_end STEP_END   Terminate training if learning rate gets below this
                        amount
  --step_when STEP_WHEN
                        Perform a dynamic step (reduce base_lr) when training
                        has not improved after this many test iterations,
                        default 10
  --base_lr BASE_LR     Initial learning rate, default 0.01
  --momentum MOMENTUM   Momentum parameters, default 0.9
  --weight_decay WEIGHT_DECAY
                        Weight decay, default 0.001
  --gamma GAMMA         Gamma, default 0.001
  --power POWER         Power, default 1
  --weights WEIGHTS     Set of weights to initialize the model with
  -p2 PREFIX2, --prefix2 PREFIX2
                        Second prefix for training/test files for combined
                        training: <prefix>[train|test][num].types
  -d2 DATA_ROOT2, --data_root2 DATA_ROOT2
                        Root folder for relative paths in second train/test
                        files for combined training
  --data_ratio DATA_RATIO
                        Ratio to combine training data from 2 sources
  --test_only           Don't train, just evaluate test nets once

MODEL is a caffe model file and is required. It should have a MolGridDataLayer for each phase, TRAIN and TEST. The source parameter of these layers should be placeholder values "TRAINFILE" and "TESTFILE" respectively.

PREFIX is the prefix of pre-specified training and test files. For example, if the prefix is "all" then there should be files "alltrainX.types" and "alltestX.types" for each X in FOLDNUMS. FOLDNUMS is a comma-separated list of ints, for example 0,1,2. With the --allfolds flag set, a model is also trained and tested on a single file that hasn't been split into train/test folds, for example "all.types" in the previous example.

The model trained on "alltrain0.types" will be tested on "alltest0.types". Each model is trained for up to ITERATIONS iterations and tested each TEST_INTERVAL iterations.

The train/test files are of the form 1 set1/159/rec.gninatypes set1/159/docked_0.gninatypes where the first value is the label, the second the receptor, and the third the ligand. Additional whitespace delimited fields are ignored. gninatypes files are created using gninatyper. The receptor and label paths in these files can be absolute, or they can be relative to a path provided as the DATA_ROOT argument. To use this option, the root_folder parameter in each MolGridDataLayer of the model file should be the placeholder value "DATA_ROOT". This can also be hard-coded into the model.

The prefix of all generated output files will be OUTPREFIX. If not specified, it will be the model name and the process id. While running, OUTPREFIX.X.out files are generated. A line is written at every TEST_INTERVAL and consists of the test auc, the train auc, the loss, and the learning rate at that iteration.

The entire train and test sets are evaluated every TEST_INTERVAL. If they are very large this may be undesirable. Alternatively, if -r is passed, pre-specified reduced train/test sets can be used for monitoring.

Once each fold is complete OUTPREFIX.X.finaltest is written with the final predicted values.

After the completion of all folds, OUTPREFIX.test and OUTPREFIX.train are written which contain the average AUC and and individual AUCs for each fold at eash test iteration. Also, a total finaltest file of all the predictions. Graphs of the training behavior (OUTPREFIX_train.pdf) and final ROC (OUTPREFIX_roc.pdf) are also created as well as caffe files.

The GPU to use can be specified with -g.

Previous training runs can be continued with -c. The same prefix etc. should be used.

A large number of training hyperparameters are available as options. The defaults should be pretty reasonable.

wingring47 / scripts Goto Github PK

scripts's Introduction

Scripts for training and validating neural network models for gnina

Training

scripts's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent