pulimeng / etoxpred Goto Github PK

A simple tool to predict the general toxicity and calculate the synthesize accessibility (SA) score for small molecules.

License: GNU General Public License v3.0

Python 100.00%

toxicity-prediction sascore-prediction

etoxpred's Introduction

eToxPred

eToxPred is a tool to reliably estimate the toxicity and calculate the synthetic accessibility of small organic compounds. This is a newer implementation. The libraries used have been updated and the Deep Belief Network (DBN) for the SA score prediction has been replaced by the exact SA score calculation. For older implementation please refer to the folder /stale.

This README file is written by Limeng PU.

If you find this tool is useful to you, please cite our paper: Pu, L., Naderi, M., Liu, T. et al. eToxPred: a machine learning-based approach to estimate the toxicity of drug candidates. BMC Pharmacol Toxicol 20, 2 (2019). https://doi.org/10.1186/s40360-018-0282-6

Prerequisites:

Python 3.7.*
Pandas 1.0 or higher
scikit-learn 2.3.*
rdkit 2020.03.1

Usage:

The software package contains 2 parts:

SAscore calculation
Toxicity prediction

To use the trained models for predictinos:

Download and extract the package.
Run the eToxPred by python etoxpred_predict.py --datafile tcm600_nr.smi --modelfile etoxpred_best_model.joblib --outputfile results.csv

--datafile specifies the input .smi file which stores the SMILES data.
--modelfile specifies the location of the trained model.
--outputfile specifies the output file to store the predicted SAscores and Tox-scores. If this term is not provided, the code will save the output to ./results.csv.

The trianed toxicity model is provided as etoxpred_best_model.tar.gz. Please untar before use. For those who wonders, the best parameter setup is n_estimators 550, min_samples_split 16, min_samples_leaf 3, and max_features 10.

To use the package to train your own models:

Prepare the training dataset. The dataset contains three parts: the smiles, the name of the compound, and the label. The label is 0 or 1, where 0 means safe and 1 means toxic. The dataset has to be stored in a .smi file, where each field is separated by a tab, in the format: [SmilesString\tID\tLabel].
Train the ET for toxicity prediction. The code provided performs a randomized parameter search. It will return the best result (depending on chosen metric), parameters (.json format), and the model (.joblib format). Run etoxpred_train.py in by python etoxpred_train.py --datafile your_training_set.smi --paramfile params.json --outputfile best_model --iter 3 --scorer balanced_accuracy.

--datafile specifies the path to your training datset with the aforementioned format.
--paramfile specifies the parameter file contains the parameters and the range/distribution of them that you want to search for during the training. An example file is provided, namely param.json. More parameters can be added according to https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html.
- --outputfile specifies the output file to store the best model. If this term is not provided, the code will save the output to best_model.joblib.
- --iter is the number of iterations to run the randomized search.
- --scorer is the metric to evaluate the performance of each run. It defaultly uses balanced_accuracy. recommanded metrics include accuracy, balanced_accuracy, f1, and roc_auc.

Datasets:

An example test dataset that can be used for prediction (in the .smi format) is provided in tcm600_nr.smi. The ready to use dataset for ET training is provided in trainig_set.smi. Much larger dataset for training can be found at https://osf.io/m4ah5/. The general format is SmilesString\tID\tToxicity. The results of the testing set TCM6000_NR are also provied in tcm_results.csv.

etoxpred's People

Contributors

Stargazers

Watchers

Forkers

unixjunkie aspirincode abdulk084 phenylazide icamps iansebastian59 sensouradeep rjd55 lourenswalters mattoslmp rnaimehaom jaysshah7 dna2rna blagowhatnow m-hakmi smnkhoma

etoxpred's Issues

{Critical} Value Error: node array from pickle has incompatible dtype

Hello, Thanks for this wonderful repo!

When i git clone and run this using python etoxpred_predict.py --datafile tcm600_nr.smi --modelfile etoxpred_best_model.joblib --outputfile results.csv (after unzippiing tar) i get this error:

...loading models
C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py:376: InconsistentVersionWarning: Trying to unpickle estimator ExtraTreeClassifier from version 0.23.2 when using version 1.4.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
Traceback (most recent call last):
  File "C:\Users\Sri Raghu\Desktop\swift-trials\ind-projects\eToxPred\etoxpred_predict.py", line 74, in <module>
    predict(opt)
  File "C:\Users\Sri Raghu\Desktop\swift-trials\ind-projects\eToxPred\etoxpred_predict.py", line 59, in predict
    clf = load(opt.modelfile)
          ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\numpy_pickle.py", line 658, in load
    obj = _unpickle(fobj, filename, mmap_mode)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\numpy_pickle.py", line 577, in _unpickle
    obj = unpickler.load()
          ^^^^^^^^^^^^^^^^
  File "C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\pickle.py", line 1205, in load
PS C:\Users\Sri Raghu\Desktop\swift-trials\ind-projects\eToxPred> python etoxpred_predict.py --datafile tcm600_nr.smi --modelfile etoxpred_best_model.joblib --outputfile results.csv
...loading models
C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\base.py:376: InconsistentVersionWarning: Trying to unpickle estimator ExtraTreeClassifier from version 0.23.2 when using version 1.4.0. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(
Traceback (most recent call last):
  File "C:\Users\Sri Raghu\Desktop\swift-trials\ind-projects\eToxPred\etoxpred_predict.py", line 74, in <module>
    predict(opt)
  File "C:\Users\Sri Raghu\Desktop\swift-trials\ind-projects\eToxPred\etoxpred_predict.py", line 59, in predict
    clf = load(opt.modelfile)
          ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\numpy_pickle.py", line 658, in load
    obj = _unpickle(fobj, filename, mmap_mode)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\numpy_pickle.py", line 577, in _unpickle
    obj = unpickler.load()
          ^^^^^^^^^^^^^^^^
  File "C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\pickle.py", line 1205, in load
    dispatch[key[0]](self)
  File "C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\site-packages\joblib\numpy_pickle.py", line 402, in load_build
    Unpickler.load_build(self)
  File "C:\Users\Sri Raghu\AppData\Local\Programs\Python\Python312\Lib\pickle.py", line 1710, in load_build
    setstate(state)
  File "sklearn\\tree\\_tree.pyx", line 865, in sklearn.tree._tree.Tree.__setstate__
  File "sklearn\\tree\\_tree.pyx", line 1571, in sklearn.tree._tree._check_node_ndarray
ValueError: node array from the pickle has an incompatible dtype:
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got     : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

I tired reducing sk-learn to base version -> didnt work
installed MS visual tools and C++ tools -> didnt work

Kindly help me through this. Much Appreciated

Trained Model

I want to use the trained model and predict the toxicity of new molecules. How can I use your tool? Can you please explain? Should I go about using the joblib or py file? How do I give input files in the model to test? readme is not much clear. Kindly elaborate.

About theano

Is it must to use theano? Because theano will not get support anymore, so people use tensorflow instead, especially for keras users. I read your code. You used theano for graph. Do you have alternatives? Many thanks.

Installation Issues

I'm facing some issues with installing this software, more precisely clashes in the versions of the different libraries used (despite using conda, a well-known package management system).

Can we please get more details on the development environment in which the software was created?
Ideally, a requirements.txt or Dockerfile file.

Thanks.

Extracting necessary files to use prediction in Jupityr Notebook.

Hello, I am a little new to python and was wondering if there was a more detailed description on how to extract the necessary files for this package to work.

I downloaded the entire package (.py, .smi, and .pkl files) but am struggling to get the predictor started.

What is the best way to extract the package through the Jupityr environment?
I have tried to copy and paste the necessary code into a single notebook, but get errors each time. Running python etoxpred_predict.py --datafile tcm600_nr.smi --modelfile etoxpred_best_model.joblib --outputfile results.csv gives me an invalid syntax error. Should I be changing anything in this line of code?

Datasets Availability

Can you provide datasets for specific toxicities. They are not available in the dataset link mentioned in paper.

How to use the tool in python notebook/google colab?

how do i use your tool in python to predict the toxicity of new compounds?

64 bit sklearn decision tree on 32 bit python

Hello,

I recently downloaded all dependencies (as you have listed) to all same versions other than openbabel 2.3.2 instead of 2.3.1. When I run the code, I get the following error...
Traceback (most recent call last): File "etoxpred.py", line 121, in <module> predicted_values,proba = predict(X,'SA_trained_model_cpu.pkl','Tox_trained_model.pkl') # if cuda is not installed, use the trained_model_cpu File "etoxpred.py", line 113, in predict xtree = joblib.load(tox_model) File "C:\Python27\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 575, in load obj = _unpickle(fobj, filename, mmap_mode) File "C:\Python27\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 507, in _unpickle obj = unpickler.load() File "C:\Python27\lib\pickle.py", line 864, in load dispatch[key](self) File "C:\Python27\lib\pickle.py", line 1139, in load_reduce value = func(*args) File "sklearn\tree\_tree.pyx", line 584, in sklearn.tree._tree.Tree.__cinit__ (sklearn\tree\_tree.c:7533) ValueError: Buffer dtype mismatch, expected 'SIZE_t' but got 'long long'

Which the internet says is from the mismatched bit versions. I could download a 64 bit python, but openbabel only works on 32 bit python. Let me know if you get any issues when trying to run it

Documentation

Could you maybe provide a documentation on how to include your library into a self written program? That would help alot. :-)

Where are these two files "sa_results.txt" and "tox_results.txt"

I am retraining your models. I want to compare my retrained models with yours. You said "The results of our experiments in terms of SAscores and Tox-scores are also provied in sa_results.txt and tox_results.txt". But I couldn't find them?