Coder Social home page Coder Social logo

gitter-lab / pria_lifechem Goto Github PK

View Code? Open in Web Editor NEW
8.0 7.0 1.0 130.34 MB

Virtual screening on PriA-SSB and RMI-FANCM with the LifeChem library

License: MIT License

Python 1.60% Jupyter Notebook 98.36% Shell 0.04%
virtual-screening neural-network biochemistry

pria_lifechem's Introduction

Virtual screening on PriA-SSB and RMI-FANCM with the LifeChem library

Build Status DOI

Citation

If you use this software or the new high-throughput screening data, please cite:

Shengchao Liu+, Moayad Alnammi+, Spencer S. Ericksen, Andrew F. Voter, Gene E. Ananiev, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter. Practical model selection for prospective virtual screening. Journal of Chemical Information and Modeling 2018.

+ denotes co-first authors.

Installation

We recommend creating a conda environment to manage the dependencies. First install Anaconda if it is not already installed. Then, clone this pria_lifechem repository:

git clone https://github.com/gitter-lab/pria_lifechem.git
cd pria_lifechem

Create and activate a conda environment named pria using the conda_env.yml file:

conda env create -f conda_env.yml
source activate pria

Finally, install pria_lifechem with pip.

pip install -e .

To use the package again later, use source activate pria to re-activate the conda environment. The package is only currently supported for Linux. The conda environment provided does not include a Theano GPU backend. To use Theano with a GPU, see the Theano guide.

The IRV models were trained using a customized fork of DeepChem. See the separate installation instructions in that repository.

Note: Random Forest results in the paper were obtained using Python 3.4 and sklearn=0.18.1. The random forest code is still compatible with conda_env.yml, but the results may differ due to different versions.

dataset

The dataset subdirectory contains a description of the expected file format and an example dataset that has been split into five folds.

The complete high-throughput screening data are available on PubChem (AID:1272365 and AID:1159607). Pre-processed, merged versions of the data are available on Zenodo (doi:10.5281/zenodo.1411506). The Zenodo files are:

  • pria_rmi_cv.tar.gz: The LifeChem compounds used for cross validation with PriA-SSB and RMI-FANCM split into five folds.
  • pria_rmi_pcba_cv.tar.gz: These same compounds merged with 128 tasks from PubChem split into five folds.
  • pria_prospective.csv.gz: The separate LifeChem compounds used for prospective testing with PriA-SSB.

pria_lifechem

The pria_lifechem subdirectory contains:

  • scripts to prepare and load datasets
  • a script to evaluate trained models
  • a models subdirectory with code and instructions for training models
  • an analysis subdirectory to reproduce figures from the manuscript

json

The json subdirectory contains json config files with the model hyperparameters.

output

The output subdirectory contains scripts for post-processing the output files.

pria_lifechem's People

Contributors

agitter avatar chao1224 avatar malnammi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

xuzhang5788

pria_lifechem's Issues

Automated Zenodo data download

Now that our screening datasets are available on Zenodo, we could write a Python script that downloads the files to the dataset subdirectory and creates the directory structure expected by the model training code.

error when trainning "invalid literal for int() with base 10: ' ' "

When I used
python sklearn_randomforest.py --config_json_file=../../json/sklearn_randomforest.json --model_dir=$results_and_model_directory --dataset_dir=$path_to_dataset --process_num=$process --stage=0

to train random forest models, I got the following error message:
Traceback (most recent call last):
File "sklearn_randomforest.py", line 151, in
process_num = int(given_args.process_num)
ValueError: invalid literal for int() with base 10: ' '

I found given_args.process_num is a space ' ', so it can not be changed into an integer. Any advice? Many thanks.

Why didn't you do model ensemble?

Thank you for sharing your code.
You selected several better models and compare their performance. In the end, you only chose the best model for your final model. I don't know why you didn't ensemble your models to become a better final model. Possibly, this final model performs better than your the best single model.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.