paccmann_rl

Pipeline to reproduce the results of the PaccMann^RL paper.

Description

In the repo we provide a conda environment and instructions to reproduce the pipeline descirbed in the manuscript:

Train a multimodal drug sensitivity predictor (source code)
Train a generative model for omic profiles, also known as the PVAE (source code)
Train a generative model for molecules, also known as the SVAE (source code)
Train PaccMann^RL (source code)

Requirements

conda>=3.7
The following data from here:
- The processed splitted data from the folder splitted_data
- The processed gene expression data from GDSC: data/gene_expression/gdsc-rnaseq_gene-expression.csv
- The processed SMILES from the drugs from GDSC: data/smiles/gdsc.smi
- A pickled SMILESLanguage object (data/smiles_language_chembl_gdsc_ccle.pkl)
- A pickled list of genes representing the panel considered in the paper (data/2128_genes.pkl)
- A pickled pandas DataFrame containing expression values and metadata for the cell lines considered in the paper (data/gdsc_transcriptomics_for_conditional_generation.pkl)
The git repos linked in the previous section

NOTE: please refer to the README.md and to the manuscript for details on the datasets used and the preprocessing applied.

Setup

Install the environment

Create a conda environment:

conda env create -f conda.yml

Activate the environment:

conda activate paccmann_rl

Download data

Download the data reported in the requirements section. From now on, we will assume that they are stored in the root of the repository in a folder called data, following this structure:

data
├── 2128_genes.pkl
├── gdsc-rnaseq_gene-expression.csv
├── gdsc.smi
├── gdsc_transcriptomics_for_conditional_generation.pkl
├── smiles_language_chembl_gdsc_ccle.pkl
└── splitted_data
    ├── gdsc_cell_line_ic50_test_fraction_0.1_id_997_seed_42.csv
    ├── gdsc_cell_line_ic50_train_fraction_0.9_id_997_seed_42.csv
    ├── tcga_rnaseq_test_fraction_0.1_id_242870585127480531622270373503581547167_seed_42.csv
    ├── tcga_rnaseq_train_fraction_0.9_id_242870585127480531622270373503581547167_seed_42.csv
    ├── test_chembl_22_clean_1576904_sorted_std_final.smi
    └── train_chembl_22_clean_1576904_sorted_std_final.smi

1 directory, 11 files

NOTE: no worries, the data folder is in the .gitignore.

Clone the repos

To get the scripts to run each of the component create a code folder and clone the repos. Simply type this:

mkdir code && cd code && \
  git clone https://github.com/PaccMann/paccmann_predictor && \ 
  git clone https://github.com/PaccMann/paccmann_omics && \ 
  git clone https://github.com/PaccMann/paccmann_chemistry && \ 
  git clone https://github.com/PaccMann/paccmann_generator && \
  cd ..

NOTE: no worries, the code folder is in the .gitignore.

Pipeline

Now it's all set to run the full pipeline.

NOTE: the workload required to run the full pipeline is intesive and might not be straightforward to run all the steps on a desktop laptop. For this reason, we also provide pretrained models that can be downloaded and used to run the different steps.

NOTE: in the following, we assume a folder models has been created in the root of the repository. No worries, the models folder is in the .gitignore.

Multimodal drug sensitivity predictor

(paccmann_rl) $ python ./code/paccmann_predictor/examples/train_paccmann.py \
    ./data/splitted_data/gdsc_cell_line_ic50_train_fraction_0.9_id_997_seed_42.csv \
    ./data/splitted_data/gdsc_cell_line_ic50_test_fraction_0.1_id_997_seed_42.csv \
    ./data/gdsc-rnaseq_gene-expression.csv \
    ./data/gdsc.smi \
    ./data/2128_genes.pkl \
    ./data/smiles_language_chembl_gdsc_ccle.pkl \
    ./models/ \
    ./code/paccmann_predictor/examples/example_params.json paccmann

PVAE

(paccmann_rl) $ python ./code/paccmann_omics/examples/train_vae.py \
    ./data/splitted_data/tcga_rnaseq_train_fraction_0.9_id_242870585127480531622270373503581547167_seed_42.csv \
    ./data/splitted_data/tcga_rnaseq_test_fraction_0.1_id_242870585127480531622270373503581547167_seed_42.csv \
    ./data/2128_genes.pkl \
    ./models/ \
    ./code/paccmann_omics/examples/example_params.json pvae

SVAE

(paccmann_rl) $ python ./code/paccmann_chemistry/examples/train_vae.py \
    ./data/splitted_data/train_chembl_22_clean_1576904_sorted_std_final.smi \
    ./data/splitted_data/test_chembl_22_clean_1576904_sorted_std_final.smi \
    ./data/smiles_language_chembl_gdsc_ccle.pkl \
    ./models/ \
    ./code/paccmann_chemistry/examples/example_params.json svae

PaccMann^RL

(paccmann_rl) $ python ./code/paccmann_generator/examples/train_paccmann_rl.py \
    ./models/svae \
    ./models/pvae \
    ./models/paccmann \
    ./data/smiles_language_chembl_gdsc_ccle.pkl \
    ./data/gdsc_transcriptomics_for_conditional_generation.pkl \
    ./code/paccmann_generator/examples/example_params.json \
    paccmann_rl breast

NOTE: this will create a biased_model folder containing the conditional generator and the baseline SMILES generator used. In this case: breast_paccmann_rl and baseline. No worries, the biased_models folder is in the .gitignore.

References

If you use paccmann_rl in your projects, please cite the following:

@misc{born2019paccmannrl,
    title={PaccMann^RL: Designing anticancer drugs from transcriptomic data via reinforcement learning},
    author={Jannis Born and Matteo Manica and Ali Oskooei and Joris Cadow and María Rodríguez Martínez},
    year={2019},
    eprint={1909.05114},
    archivePrefix={arXiv},
    primaryClass={q-bio.BM}
}

anu-bioinfo / paccmann_rl Goto Github PK

paccmann_rl's Introduction

paccmann_rl

Description

Requirements

Setup

Install the environment

Download data

Clone the repos

Pipeline

Multimodal drug sensitivity predictor

PVAE

SVAE

PaccMann^RL

References

paccmann_rl's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent