Coder Social home page Coder Social logo

cmargreitter / qptuna Goto Github PK

View Code? Open in Web Editor NEW

This project forked from molecularai/qsartuna

1.0 0.0 0.0 646 KB

QPTUNA: QSAR model building with the optuna framework

License: Apache License 2.0

Dockerfile 0.09% Jupyter Notebook 77.31% Python 22.59%

qptuna's Introduction

QPTUNA: QSAR using Optimization for Hyper-parameter Tuning

Build predictive models for CompChem with hyper-parameters optimized by Optuna.

Background

This library searches for the best ML algorithm and molecular descriptor for the given data.

The search itself is done using Optuna.

The three-step process

QPTUNA is structured around three steps:

  1. Hyper-parameter Optimization: Train many models with different parameters using Optuna. Only the training dataset is used here. Training is usually done with cross-validation.

  2. Build (Training): Pick the best model from Optimization, and optionally evaluate its performance on the test dataset.

  3. "Prod-build:" Re-train the best-performing model on the merged training and test datasets. This step has a drawback that there is no data left to evaluate the resulting model, but it has a big benefit that this final model is trained on the all available data.

JSON-based Command-line interface

Let's look at a trivial example of modelling molecular weight using a training set of 50 molecules.

Configuration file

We start with a configuration file in JSON format. It contains four main sections:

  • data - location of the data file, columns to use.
  • settings - details about the optimization run.
  • descriptors - which molecular descriptors to use.
  • algorithms - which ML algorithms to use.

Below is the example of such file (it is also available in the source code repository).:

{
  "task": "optimization",
  "data": {
    "training_dataset_file": "tests/data/DRD2/subset-50/train.csv",
    "input_column": "canonical",
    "response_column": "molwt"
  },
  "settings": {
    "mode": "regression",
    "cross_validation": 5,
    "direction": "maximize",
    "n_trials": 100,
    "n_startup_trials": 30
  },
  "descriptors": [
    {
      "name": "ECFP",
      "parameters": {
        "radius": 3,
        "nBits": 2048
      }
    },
    {
      "name": "MACCS_keys",
      "parameters": {}
    }
  ],
  "algorithms": [
    {
      "name": "RandomForestRegressor",
      "parameters": {
        "max_depth": {"low": 2, "high": 32},
        "n_estimators": {"low": 10, "high": 250},
        "max_features": ["auto"]
      }
    },
    {
      "name": "Ridge",
      "parameters": {
        "alpha": {"low": 0, "high": 2}
      }
    },
    {
      "name": "Lasso",
      "parameters": {
        "alpha": {"low": 0, "high": 2}
      }
    },
    {
      "name": "XGBRegressor",
      "parameters": {
        "max_depth": {"low": 2, "high": 32},
        "n_estimators": {"low": 3, "high": 100},
        "learning_rate": {"low": 0.1, "high": 0.1}
      }
    }
  ]
}

Data section specifies location of the dataset file. In this example it specifies a relative path to the tests/data folder in the source code repository.

Settings section specifies that:

  • we are building a regression model,
  • we want to use 5-fold cross-validation,
  • we want to maximize the value of the objective function (maximization is the standard for scikit-learn models),
  • we want to have a total of 100 trials,
  • and the first 30 trials ("startup trials") should be random exploration (to not get stuck early on in one local minimum).

We specify two descriptors and four algorithm, and optimization is free to pair any specified descriptor with any of the algorithms.

When we have our data and our configuration, it is time to start the optimization.

Running from the command line

QPTUNA can be deployed on a cluster using Singularity container.

The commands below assume the use of Singularity container. If QPTUNA is installed locally, without containerization, just skip the container part.

To run commands inside the container, Singularity uses the following syntax:

singularity exec <container.sif> <command>

We can create environmental variable to hold path to our container. Using Bash syntax:

export QPTUNA_CONTAINER=/path/to/qptuna.sif

We can run three-step-process from command line with the following command:

singularity exec ${QPTUNA_CONTAINER} \
  /opt/qptuna/.venv/bin/qptuna-optimize \
  --config examples/optimization/regression_drd2_50.json \
  --best-buildconfig-outpath ~/qptuna-target/best.json \
  --best-model-outpath ~/qptuna-target/best.pkl \
  --merged-model-outpath ~/qptuna-target/merged.pkl

Since optimization can be a long process, we should avoid running it on the login node, and we should submit it to the SLURM queue instead.

Submitting to SLURM

We can submit our script to the queue by giving sbatch the following script:

#!/bin/sh
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#SBATCH --mem-per-cpu=4G
#SBATCH --time=100:0:0

# This script illustrates how to run one configuration from Qptuna examples.
# The example we use is in examples/optimization/regression_drd2_50.json.

# The example we chose uses relative paths to data files, change directory.
cd /path/to/qptuna/examples/and/data

singularity exec \
   ${QPTUNA_CONTAINER} \
  /opt/qptuna/.venv/bin/qptuna-optimize \
  --config examples/optimization/regression_drd2_50.json \
  --best-buildconfig-outpath ~/qptuna-target/best.json \
  --best-model-outpath ~/qptuna-target/best.pkl \
  --merged-model-outpath ~/qptuna-target/merged.pkl

When the script is complete, it will create pickled model files inside your home directory under ~/qptuna-target/.

Using the model

When the model is built, run inference:

singularity exec  ${QPTUNA_CONTAINER} \
  /opt/qptuna/.venv/bin/qptuna-predict \
  --model-file target/merged.pkl \
  --input-smiles-csv-file tests/data/DRD2/subset-50/test.csv \
  --input-smiles-csv-column "canonical" \
  --output-prediction-csv-file target/prediction.csv

Run from Python/Jupyter Notebook

Create conda environment with Jupyter and Install Qptuna there:

poetry build
python -m pip install dist/qptuna-2.4.2.tar.gz

Then you can use Qptuna inside your Notebook:

from qptuna.three_step_opt_build_merge import (
    optimize,
    buildconfig_best,
    build_best,
    build_merged,
)
from qptuna.config import ModelMode, OptimizationDirection
from qptuna.config.optconfig import (
    OptimizationConfig,
    SVR,
    RandomForest,
    Ridge,
    Lasso,
    PLS,
    XGBregressor,
)
from qptuna.datareader import Dataset
from qptuna.descriptors import ECFP, MACCS_keys, ECFP_counts

##
# Prepare hyperparameter optimization configuration.
config = OptimizationConfig(
    data=Dataset(
        input_column="canonical",
        response_column="molwt",
        training_dataset_file="tests/data/DRD2/subset-50/train.csv",
    ),
    descriptors=[ECFP.new(), ECFP_counts.new(), MACCS_keys.new()],
    algorithms=[
        SVR.new(),
        RandomForest.new(),
        Ridge.new(),
        Lasso.new(),
        PLS.new(),
        XGBregressor.new(),
    ],
    settings=OptimizationConfig.Settings(
        mode=ModelMode.REGRESSION,
        cross_validation=3,
        n_trials=100,
        direction=OptimizationDirection.MAXIMIZATION,
    ),
)

##
# Run Optuna Study.
study = optimize(config, study_name="my_study")

##
# Get the best Trial from the Study and make a Build (Training) configuration for it.
buildconfig = buildconfig_best(study)
# Optional: write out JSON of the best configuration.
import json
print(json.dumps(buildconfig.json(), indent=2))

##
# Build (re-Train) and save the best model.
build_best(buildconfig, "target/best.pkl")

##
# Build (Train) and save the model on the merged train+test data.
build_merged(buildconfig, "target/merged.pkl")

Generating plain Scikit-learn models (models compatible with current Reinvent)

Script optbuild.py has option --model-persistence-mode that you can set to plain_sklearn to get models that has no dependency on OptunaAZ, and that are compatible with current Reinvent GUI.

singularity exec  ${QPTUNA_CONTAINER} \
  /opt/qptuna/.venv/bin/qptuna-optimize \
  --config examples/optimization/regression_drd2_50.json \
  --best-model-outpath target/best-plain.pkl \
  --best-buildconfig-outpath target/best-plain.json \
  --model-persistence-mode plain_sklearn 

In Python, you can send persist_as=ModelPersistenceMode.PLAIN_SKLEARN to build_best() and build_merged().

Contributors

qptuna's People

Stargazers

ZhangLiChuan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.