Maxsmi: data augmentation for molecular property prediction using deep learning

Project description
Installation using conda
- Prerequisites
- How to install
How to use maxsmi
- Examples
  - How to train and evaluate a model using augmentation
  - How to make predictions
Documentation
Repository structure and important files

Project description

SMILES augmentation for deep learning based molecular property and activity prediction.

Accurate molecular property or activity prediction is one of the main goals in computer-aided drug design in which deep learning has become an important part. Since neural networks are data greedy and both physico-chemical and bioactivity data sets remain scarce, augmentation techniques have become a powerful assistance for accurate predictions.

This repository provides the code basis to exploit data augmentation using the fact that one compound can be represented by various SMILES (simplified molecular-input line-entry system) strings.

Augmentation strategies

No augmentation
Augmentation with duplication
Augmentation without duplication
Augmentation with reduced duplication
Augmentation with estimated maximum

Data sets

Physico-chemical data from MoleculeNet, available as part of DeepChem
- ESOL
- FreeSolv
- lipophilicity
Bioactivity data on the EGFR kinase, retrieved from Kinodata

Deep learning models

1D convolutional neural network (CONV1D)
2D convolutional neural network (CONV2D)
Recurrent neural network (RNN)

The results of our study show that data augmentation improves the accuracy independently of the deep learning model and the size of the data. The best strategy leads to the Maxsmi models, which are available here for predictions on novel compounds on the provided data sets.

Installation using conda

Prerequisites

Anaconda and Git should be installed. See Anaconda's website and Git's website for download.

How to install

Clone the github repository:

git clone https://github.com/volkamerlab/maxsmi.git

Change directory:

cd maxsmi

Create the conda environment:

conda env create -n maxsmi -f devtools/conda-envs/test_env.yaml

Activate the environment:

conda activate maxsmi

Install the maxsmi package:

pip install -e .

How to use maxsmi

Examples

How to train and evaluate a model using augmentation

To get an overview of all available options:

python maxsmi/full_workflow.py --help

To train a model with the ESOL data set, augmenting the training set 5 times and the test set 2 times, training for 5 epochs:

python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5

If no ensemble learning is wanted for the evaluation, add the flag as below:

Note: with ensemble learning computes a per compound prediction, whereas without ensemble learning compute a per SMILES prediction.

python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5 --eval-strategy=False

To train a model with all chosen arguments:

Note: This command uses the default number of epochs (which is set to 250). Please allow time for the model to train.

python maxsmi/full_workflow.py --task="FreeSolv" --string-encoding="smiles" --aug-strategy-train="augmentation_with_duplication" --aug-strategy-test="augmentation_with_reduced_duplication" --aug-nb-train=5 --aug-nb-test=2 --ml-model="CONV1D" --eval-strategy=True --nb-epochs=250

To train a model with early stopping (this command could take time to be executed):

python maxsmi/full_workflow_earlystopping.py --aug-nb-train=3 --aug-nb-test=2

How to make predictions

These predictions use the precalculated Maxsmi models (best performing models in the study).

To predict the affinity of a compound against the EGFR kinase, e.g. given by the SMILES CC1CC1, run:

python maxsmi/prediction_unlabeled_data.py --task="affinity" --smiles_prediction="CC1CC1"

To predict the lipophilicity prediction for the semaxanib drug, run:

python maxsmi/prediction_unlabeled_data.py --task="lipophilicity" --smiles_prediction="O=C2C(\c1ccccc1N2)=C/c3c(cc([nH]3)C)C"

Documentation

The maxsmi package documentation is available here.

Repository structure and important files

|-- LICENSE
|-- README.md
|-- devtools
|-- docs
|-- maxsmi
|   |-- augmentation_strategies.py      <- SMILES augmentation strategies
|   |-- full_workflow.py                <- Training and evaluation of deep learning models
    |-- full_workflow_earlystopping.py  <- Training using early stopping
|   |-- output_                         <- Saved outputs for results analysis
|   |-- prediction_models               <- Weights for Maxsmi models
|   |-- prediction_unlabeled_data.py    <- Maxsmi models available for user prediction
|   |-- results_analysis                <- Notebooks for results analysis
|   |-- tests

Copyright

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.4.

churloh / maxsmi Goto Github PK

maxsmi's Introduction

Maxsmi: data augmentation for molecular property prediction using deep learning

Table of contents

Project description

SMILES augmentation for deep learning based molecular property and activity prediction.

Installation using conda

Prerequisites

How to install

How to use maxsmi

Examples

How to train and evaluate a model using augmentation

How to make predictions

Documentation

Repository structure and important files

Copyright

Acknowledgements

maxsmi's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org