Coder Social home page Coder Social logo

churloh / maxsmi Goto Github PK

View Code? Open in Web Editor NEW

This project forked from volkamerlab/maxsmi

0.0 1.0 0.0 225.19 MB

maxsmi: a guide to SMILES augmentation. Find the optimal SMILES augmentation for accurate molecular prediction.

Home Page: https://maxsmi.readthedocs.io/en/latest/

License: MIT License

Python 17.46% Shell 0.01% Jupyter Notebook 82.53%

maxsmi's Introduction

Maxsmi: data augmentation for molecular property prediction using deep learning

Actions Status codecov Actions Status

License: MIT Documentation Status

GitHub closed pr GitHub open pr GitHub closed issues GitHub open issues

Table of contents

  • Project description
  • Installation using conda
    • Prerequisites
    • How to install
  • How to use maxsmi
    • Examples
      • How to train and evaluate a model using augmentation
      • How to make predictions
  • Documentation
  • Repository structure and important files

Project description

SMILES augmentation for deep learning based molecular property and activity prediction.

Accurate molecular property or activity prediction is one of the main goals in computer-aided drug design in which deep learning has become an important part. Since neural networks are data greedy and both physico-chemical and bioactivity data sets remain scarce, augmentation techniques have become a powerful assistance for accurate predictions.

This repository provides the code basis to exploit data augmentation using the fact that one compound can be represented by various SMILES (simplified molecular-input line-entry system) strings.

Augmentation strategies

  • No augmentation
  • Augmentation with duplication
  • Augmentation without duplication
  • Augmentation with reduced duplication
  • Augmentation with estimated maximum

Data sets

  • Physico-chemical data from MoleculeNet, available as part of DeepChem
    • ESOL
    • FreeSolv
    • lipophilicity
  • Bioactivity data on the EGFR kinase, retrieved from Kinodata

Deep learning models

  • 1D convolutional neural network (CONV1D)
  • 2D convolutional neural network (CONV2D)
  • Recurrent neural network (RNN)

The results of our study show that data augmentation improves the accuracy independently of the deep learning model and the size of the data. The best strategy leads to the Maxsmi models, which are available here for predictions on novel compounds on the provided data sets.

Installation using conda

Prerequisites

Anaconda and Git should be installed. See Anaconda's website and Git's website for download.

How to install

  1. Clone the github repository:
git clone https://github.com/volkamerlab/maxsmi.git
  1. Change directory:
cd maxsmi
  1. Create the conda environment:
conda env create -n maxsmi -f devtools/conda-envs/test_env.yaml
  1. Activate the environment:
conda activate maxsmi
  1. Install the maxsmi package:
pip install -e .

How to use maxsmi

Examples

How to train and evaluate a model using augmentation

To get an overview of all available options:

python maxsmi/full_workflow.py --help

To train a model with the ESOL data set, augmenting the training set 5 times and the test set 2 times, training for 5 epochs:

python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5

If no ensemble learning is wanted for the evaluation, add the flag as below:

Note: with ensemble learning computes a per compound prediction, whereas without ensemble learning compute a per SMILES prediction.

python maxsmi/full_workflow.py --task="ESOL" --aug-strategy-train="augmentation_without_duplication" --aug-nb-train=5 --aug-nb-test=2 --nb-epochs 5 --eval-strategy=False

To train a model with all chosen arguments:

Note: This command uses the default number of epochs (which is set to 250). Please allow time for the model to train.

python maxsmi/full_workflow.py --task="FreeSolv" --string-encoding="smiles" --aug-strategy-train="augmentation_with_duplication" --aug-strategy-test="augmentation_with_reduced_duplication" --aug-nb-train=5 --aug-nb-test=2 --ml-model="CONV1D" --eval-strategy=True --nb-epochs=250

To train a model with early stopping (this command could take time to be executed):

python maxsmi/full_workflow_earlystopping.py --aug-nb-train=3 --aug-nb-test=2

How to make predictions

These predictions use the precalculated Maxsmi models (best performing models in the study).

To predict the affinity of a compound against the EGFR kinase, e.g. given by the SMILES CC1CC1, run:

python maxsmi/prediction_unlabeled_data.py --task="affinity" --smiles_prediction="CC1CC1"

To predict the lipophilicity prediction for the semaxanib drug, run:

python maxsmi/prediction_unlabeled_data.py --task="lipophilicity" --smiles_prediction="O=C2C(\c1ccccc1N2)=C/c3c(cc([nH]3)C)C"

Documentation

The maxsmi package documentation is available here.

Repository structure and important files

|-- LICENSE
|-- README.md
|-- devtools
|-- docs
|-- maxsmi
|   |-- augmentation_strategies.py      <- SMILES augmentation strategies
|   |-- full_workflow.py                <- Training and evaluation of deep learning models
    |-- full_workflow_earlystopping.py  <- Training using early stopping
|   |-- output_                         <- Saved outputs for results analysis
|   |-- prediction_models               <- Weights for Maxsmi models
|   |-- prediction_unlabeled_data.py    <- Maxsmi models available for user prediction
|   |-- results_analysis                <- Notebooks for results analysis
|   |-- tests

Copyright

Copyright (c) 2020, Talia B. Kimber at VolkamerLab.

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.4.

maxsmi's People

Contributors

t-kimber avatar maximegagnebin avatar andreavolkamer avatar maxgagnebin avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.