Coder Social home page Coder Social logo

conv_qsar_fast's Introduction

conv_qsar_fast

QSAR/QSPR using descriptor-free molecular embedding

Requirements

This code relies on Keras for the machine learning framework, Theano for computations as its back-end, and RDKit for parsing molecules from SMILES strings. Plotting is done in matplotlib. All other required packages should be dependencies of Keras, Theano, or RDKit.

Basic use

This code implements the tensor-based convolutional embedding strategy described in placeholder for QSAR/QSPR tasks. The model architecture, training schedule, and data source are defined in a configuration file and trained using a cross-validation (CV). The basic architecture is as follows:

  • Pre-processing to convert a SMILES string into an attributed graph, then into an attributed adjacency tensor
  • Convolutional embedding layer, which takings a molecular tensor and produces a learned feature vector
  • Optional dropout layer
  • Optional hidden densely-connected neural network layer
  • Optional dropout layer
  • Optional second hidden densely-connected neural network layer
  • Linear dense output layer

Models are built, trained, and tested with the command

python conv_qsar_fast/main/main_cv.py conv_qsar_fast/inputs/<input_file>.cfg

Numerous example input files, corresponding the models described in placeholder are included in inputs. These include models to be trained on full datasets, 5-fold CVs with internal validation and early stopping, 5-fold CVs without internal validation, models initialized with weights from other trained models, and multi-task models predicting on multiple data sets. Note that when using multi-task models, the output_size must be increased and the loss function must be custom to ensure NaN values are filtered out if not all inputs x have the full set of outputs y.

Data sets

There are four available data sets in this version of the code contained in data:

  1. Abraham octanol solubility data, from Abraham and Admire's 2014 paper.
  2. Delaney aqueous solubility data, from Delaney's 2004 paper.
  3. Bradley double plus good melting point data, from Bradley's open science notebook initiative.
  4. Tox21 data from the Tox21 Data Challenge 2014, describing toxicity against 12 targets.

Because certain entries could not be unambiguously resolved into chemical structures, or because duplicates in the data sets were found, the effective data sets after processing are exported using scripts/save_data.py as coley_abraham.tdf, coley_delaney.tdf, coley_bradley.tdf, coley_tox21.tdf, and coley_tox21-test.tdf.

Model interpretation

This version of the code contains the general method of non-linear model interpretation of assigning individual atom and bond attributes to their average value in the molecular tensor representation. The extent to which this hurts performance is indicative of how dependent a trained model has become on that atom/bond feature. As long as the configuration file defines a model which loads previously-trained weights, the testing routine is performed by

python conv_qsar_fast/main/test_index_removal.py conv_qsar_fast/inputs/<input_file>.cfg

It is assumed that the trained model used molecular_attributes, as the indices for removal are hard-coded into this script.

Suggestions for modification

Overall architecture

The range of possible architectures (beyond what is enabled with the current configuration file style) can be extended by modifying build_model in main/core.py. See the Keras documentation for ideas.

Data sets

Additional .csv data sets can be incorporated by adding an additional elif statement to main/data.py. As long as one column corresponds to SMILES strings and another to the property target, the existing code can be used with minimal modification.

Atom-level or bond-level attributes

Additional atom- or bond-level attributes can be included by modifying utils/neural_fp.py, specifically the bondAttributes and atomAttributes functions. Because molecules are already stored as RDKit molecule objects, any property calculable in RDKit can easily be added.

conv_qsar_fast's People

Contributors

connorcoley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

conv_qsar_fast's Issues

Compatible with Py3?

This is a very interesting cheminformatrics approach. I was trying to learn and use the source code. I managed to change a couple of errors reported by Python 3 and made the code running with keras 2.1.6 and theano 1.0.3.
But the example cases always stopped at 10th epochs with early stopping. It did not appear the training performed correctly and the testing performance was very poor compared to what was reported in the paper. Any suggestion to make it compatible with Py3? Thanks.

Examples not training well

I'm running python 3.6 and having some weird output from training the examples...
It seems like the lib was written in python 2.x but after wrapping a few range calls in list it seems like things are working (executing) fine except that the example input.cfg's are stopping very early (~12-20 epochs for the Ab-oct example) and generating some pretty horrendous models:
06-30-2018_01-31 train

Any thoughts on where to start trouble shooting?

Also what were the versions of Rdkit, Theano, and python this was written in?

train_model for fingerprint data

#Thanks Connor for publishing this project- it is a fascinating take on QSAR approaches-
I noticed that train_model in core.py assumes that all inputs are molecular tensors, so fingerprint-based models fail because they are single arrays. For example, the command

python conv_qsar_fast/main/main_cv.py conv_qsar_fast/inputs/tox21_Morgan/tox21_ahr.cfg

fails with a error message along the lines of:

Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 arrays but instead got the following list of 3 arrays: [array([[1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...

svm_cv.py won't work with tanimoto kernel

Dear professor,

This is just a reference for me in the future and also for people who encounter the same problem as I did.

It seems like the codes won't work with Tanimoto kernel. It was caused by the list type nature of input data. Adding a conditional statement to convert inputs from list to ndarray for both the training and testing should work:

For training:

		if kwargs['kernel'] not in ['tanimoto']:
			model.fit([x[0] for x in data[0]['mols']], data[0]['y'])
		else:
			train_x = np.array([x[0] for x in data[0]['mols']])
			train_y = np.array(data[0]['y'])

For testing:

	if kwargs['kernel'] not in ['tanimoto']:		
		predicted_y = model.predict(test_x)
	else:
		test_x = np.array(test_x)
		predicted_y = model.predict(test_x)

Training failed

Hi, when I run the script "python conv_qsar_fast/main/main_cv.py conv_qsar_fast/inputs/De-aq.cfg", some minor errors show up. I fixed it and then it runs normally. However, the final results are odd. The training loss never decreases in each CV folder and the predicted values stay all the same. I run it several times but with same results. Don't know why.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.